vllm

Par mkurman · zorai

Moteur d'inférence LLM rapide. PagedAttention, batching continu, parallélisme tensoriel, décodage spéculatif et prefix caching. Serveur API compatible OpenAI. Prend en charge Llama, Mistral, Qwen, DeepSeek et des centaines de modèles.

npx skills add https://github.com/mkurman/zorai --skill vllm

Aperçu

vLLM est un moteur d'inference LLM haute performance et économe en mémoire, doté de PagedAttention (quasi zéro gaspillage mémoire), batching continu, parallélisme tensoriel, décodage spéculatif, prefix caching et une API compatible OpenAI.

Installation

uv pip install vllm

Inference hors ligne

from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen2.5-1.5B-Instruct")
params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)

outputs = llm.generate(["What is the capital of France?"], params)
for o in outputs:
    print(o.outputs[0].text)

Serveur API

vllm serve Qwen/Qwen2.5-1.5B-Instruct --port 8000
# Client OpenAI :
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen2.5-1.5B-Instruct", "messages": [{"role": "user", "content": "Hello!"}]}'

Multi-GPU

llm = LLM(model="meta-llama/Llama-3.1-8B", tensor_parallel_size=2)

Références

Étoiles: 319
Découvert: 2026-05-18
Langage: Python
Mis à jour: 2026-05-05
Licence: MIT
Dernière release: v0.9.35 · 2026-07-19
Source: GitHub ↗

Santé du projet
Dernier push: hier
Forks: 27
Issues ouvertes: 2
Watchers: 4

Utile ?

Skills similaires

jetson-llm-serve

nvidia / skills

Déployer un LLM ou VLM sur Jetson avec vLLM ou SGLang via Docker.

2 609

jetson-inference-mem-tune

nvidia / skills

Recommander un runtime d'inférence et ses flags mémoire optimaux pour Jetson.

2 609

evaluation

nvidia / model-optimizer

Configurer, lancer et superviser des évaluations de modèles via NeMo Evaluator Launcher.

3 268

nemoclaw-user-configure-inference

nvidia / skills

Configurer un sous-agent spécialisé dans un sandbox NemoClaw avec OpenClaw.

2 609

jetson-llm-benchmark

nvidia / skills

Mesurer les performances LLM sur Jetson avec sortie JSON structurée et comparable.

2 609