CARVIEW |
Select Language
HTTP/2 200
content-type: text/html; charset=utf-8
content-length: 65452
date: Tue, 15 Jul 2025 17:19:01 GMT
etag: W/"ffac-KScWDexURaq4IKV1FBd3va3OC9k"
x-powered-by: huggingface-moon
cross-origin-opener-policy: same-origin
referrer-policy: strict-origin-when-cross-origin
x-request-id: Root=1-68768d85-46d1d3b84bb0517667746e18
set-cookie: token=; Path=/; Expires=Thu, 01 Jan 1970 00:00:00 GMT; Secure; SameSite=None
set-cookie: token=; Domain=huggingface.co; Path=/; Expires=Thu, 01 Jan 1970 00:00:00 GMT; Secure; SameSite=Lax
content-language: en
x-frame-options: DENY
x-cache: Miss from cloudfront
via: 1.1 46508a87a6cc058f9886696605629398.cloudfront.net (CloudFront)
x-amz-cf-pop: BOM78-P10
x-amz-cf-id: tmRTAsa5JNLC00yJtvjmDdjTgwNHBeyz3nVP3XETtxfFyAhUf3TxEw==
Text Generation Inference
Text Generation Inference
Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and T5.
Text Generation Inference implements many optimizations and features, such as:
- Simple launcher to serve most popular LLMs
- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
- Tensor Parallelism for faster inference on multiple GPUs
- Token streaming using Server-Sent Events (SSE)
- Continuous batching of incoming requests for increased total throughput
- Optimized transformers code for inference using Flash Attention and Paged Attention on the most popular architectures
- Quantization with bitsandbytes and GPT-Q
- Safetensors weight loading
- Watermarking with A Watermark for Large Language Models
- Logits warper (temperature scaling, top-p, top-k, repetition penalty)
- Stop sequences
- Log probabilities
- Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance.
- Guidance: Enable function calling and tool-use by forcing the model to generate structured outputs based on your own predefined output schemas.
Text Generation Inference is used in production by multiple projects, such as:
- Hugging Chat, an open-source interface for open-access models, such as Open Assistant and Llama
- OpenAssistant, an open-source community effort to train LLMs in the open
- nat.dev, a playground to explore and compare LLMs.