Aperçu
Text Generation Inference (TGI) est une solution de serving LLM prête pour la production de Hugging Face. Elle fournit une inférence optimisée avec batching continu, quantization (GPTQ, AWQ), tensor parallelism, flash attention et une API compatible OpenAI.
Installation
# Docker deployment (recommended)
docker run --gpus all -p 8080:80 -v $HOME/models:/data ghcr.io/huggingface/text-generation-inference:latest --model-id Qwen/Qwen2.5-1.5B-Instruct
Client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")
response = client.chat.completions.create(
model="tgi",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
Streaming
stream = client.chat.completions.create(
model="tgi",
messages=[{"role": "user", "content": "Write a poem"}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="")