Breathing Life into Language

Aphrodite is an inference engine that optimizes the serving of HuggingFace-compatible models at scale. Built on vLLM's Paged Attention technology, it delivers high-performance model inference for multiple concurrent users. Developed through a collaboration between PygmalionAI and Ruliad, Aphrodite serves as the backend engine powering both organizations' chat platforms and API infrastructure.

Aphrodite builds upon and integrates the exceptional work from various projects, primarily vLLM.

Caution

Development is currently happening in #1388.

🔥 News

(09/2024) v0.6.1 is here. You can now load FP16 models in FP2 to FP7 quant formats, to achieve extremely high throughput and save on memory.

(09/2024) v0.6.0 is released, with huge throughput improvements, many new quant formats (including fp8 and llm-compressor), asymmetric tensor parallel, pipeline parallel and more! Please check out the exhaustive documentation for the User and Developer guides.

Features

Continuous Batching
Efficient K/V management with PagedAttention from vLLM
Optimized CUDA kernels for improved inference
Quantization support via AQLM, AWQ, Bitsandbytes, GGUF, GPTQ, QuIP#, Smoothquant+, SqueezeLLM, Marlin, FP2-FP12, and more
Distributed inference
8-bit KV Cache for higher context lengths and throughput, at both FP8 E5M3 and E4M3 formats
Support for modern samplers such as DRY, XTC, and more

Quickstart

Install the engine:

pip install -U aphrodite-engine --extra-index-url https://downloads.pygmalion.chat/whl

Then launch a model:

aphrodite run meta-llama/Meta-Llama-3.1-8B-Instruct

If you're not serving at scale, you can append the --single-user-mode flag to limit memory usage.

This will create a OpenAI-compatible API server that can be accessed at port 2242 of the localhost. You can plug in the API into a UI that supports OpenAI, such as SillyTavern.

Please refer to the documentation for the full list of arguments and flags you can pass to the engine.

You can play around with the engine in the demo here:

Docker

Additionally, we provide a Docker image for easy deployment. Here's a basic command to get you started:

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    #--env "CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7" \
    -p 2242:2242 \
    --ipc=host \
    alpindale/aphrodite-openai:latest \
    --model NousResearch/Meta-Llama-3.1-8B-Instruct \
    --tensor-parallel-size 8 \
    --api-keys "sk-empty"

This will pull the Aphrodite Engine image (~8GiB download), and launch the engine with the Llama-3.1-8B-Instruct model at port 2242.

Requirements

Operating System: Linux, Windows (Needs building from source)
Python: 3.9 to 3.12

Build Requirements:

CUDA >= 12

For supported devices, see here. Generally speaking, all semi-modern GPUs are supported - down to Pascal (GTX 10xx, P40, etc.) We also support AMD GPUs, Intel CPUs and GPUs, Google TPU, and AWS Inferentia.

Notes

By design, Aphrodite takes up 90% of your GPU's VRAM. If you're not serving an LLM at scale, you may want to limit the amount of memory it takes up. You can do this in the API example by launching the server with the --gpu-memory-utilization 0.6 (0.6 means 60%), or --single-user-mode to only allocate as much memory as needed for a single sequence.
You can view the full list of commands by running aphrodite run --help.

Acknowledgements

Aphrodite Engine would have not been possible without the phenomenal work of other open-source projects. A (non-exhaustive) list:

Contributing

Everyone is welcome to contribute. You can support the project by opening Pull Requests for new features, fixes, or general UX improvements.

Name		Name	Last commit message	Last commit date
Latest commit History 1,357 Commits
.github		.github
aphrodite		aphrodite
assets		assets
cmake		cmake
docker		docker
docs		docs
examples		examples
kernels		kernels
patches		patches
requirements		requirements
tests		tests
tools		tools
.clang-format		.clang-format
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.cpu		Dockerfile.cpu
Dockerfile.neuron		Dockerfile.neuron
Dockerfile.openvino		Dockerfile.openvino
Dockerfile.ppc64le		Dockerfile.ppc64le
Dockerfile.rocm		Dockerfile.rocm
Dockerfile.tpu		Dockerfile.tpu
Dockerfile.xpu		Dockerfile.xpu
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
amdpatch.sh		amdpatch.sh
build_and_upload_docker.sh		build_and_upload_docker.sh
build_wheel.sh		build_wheel.sh
config.yaml		config.yaml
env.py		env.py
environment.yaml		environment.yaml
formatting.ps1		formatting.ps1
formatting.sh		formatting.sh
install_windows.ps1		install_windows.ps1
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
runtime.sh		runtime.sh
setup.py		setup.py
update-runtime.sh		update-runtime.sh
use_existing_torch.py		use_existing_torch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

Breathing Life into Language

🔥 News

Features

Quickstart

Docker

Requirements

Build Requirements:

Notes

Acknowledgements

Sponsors

Contributing

About

Uh oh!

Releases 34

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors 75

Languages

Uh oh!

License

aphrodite-engine/aphrodite-engine

Folders and files

Latest commit

History

Repository files navigation

Breathing Life into Language

🔥 News

Features

Quickstart

Docker

Requirements

Build Requirements:

Notes

Acknowledgements

Sponsors

Contributing

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 34

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors 75

Languages

Packages