A curated list of tools, frameworks, platforms, and resources for Large Language Model Operations (LLMOps) — enabling production-ready, scalable, and reliable LLM applications.
LLMOps is the emerging practice of managing the lifecycle of large language models, including fine-tuning, deployment, monitoring, evaluation, versioning, and observability — similar to MLOps but optimized for LLMs and generative AI systems.
- Overview & Learning
- Model Training & Fine-Tuning
- Evaluation & Benchmarking
- Serving & Inference
- Monitoring & Observability
- Prompt Engineering & Management
- Data Management
- Security & Safety
- Platforms & Frameworks
- Tooling Ecosystem
- Related Awesome Lists
- LLMOps Guide (Weights & Biases) – High-level overview of LLMOps concepts and tools.
- LLMOps Field Guide (Fiddler) – A breakdown of the infrastructure stack for LLMOps.
- LangChain Cookbook – Recipes for building with LangChain and LLMs.
- Full Stack Deep Learning – Practical LLM lifecycle, from training to deployment.
- Hugging Face Transformers – Leading library for pre-trained and fine-tunable LLMs.
- PEFT – Parameter-Efficient Fine-Tuning methods for LLMs.
- LoRA – Lightweight fine-tuning strategy for large models.
- Colossal-AI – Framework for efficient distributed LLM training.
- Open LLM Leaderboard – Benchmarking open LLMs.
- Helm – Stanford’s framework for evaluating LLMs across tasks.
- LM Evaluation Harness – Test harness for LLM evaluation.
- TruLens – LLM observability and feedback tracking.
- vLLM – Fast and memory-efficient inference for LLMs with continuous batching.
- TGI (Text Generation Inference) – High-performance inference server by Hugging Face.
- DeepSpeed MII – Low-latency inference for Hugging Face models.
- Ray Serve – Scalable model serving via Ray.
- PromptLayer – Log, monitor, and manage prompts across LLM providers.
- Arize AI – LLM monitoring, evaluation, and prompt tracing.
- WhyLabs – Observability for ML and LLM deployments.
- TruLens – Feedback loop framework for evaluating and improving LLM apps.
- LangChain – Modular framework for chaining LLM calls and prompts.
- Prompt Engineering Guide – Structured guide to writing effective prompts.
- PromptFoo – Compare, test, and evaluate LLM prompts easily.
- Guidance – Prompt programming with structured control over model output.
- Label Studio – Open-source data labeling for fine-tuning and RAG pipelines.
- Weaviate – Vector database for semantic search and hybrid retrieval.
- Pinecone – Managed vector DB for similarity search and retrieval-augmented generation.
- ChromaDB – Open-source embeddings DB built for LLMs.
- Guardrails AI – Validating and controlling LLM outputs.
- Rebuff – Open-source framework for prompt injection defense.
- Giskard – Testing, debugging, and securing LLM applications.
- OpenAI Moderation API – API for detecting harmful or unsafe content.
- LangChain – Infrastructure to build end-to-end LLM-powered apps.
- LLamaIndex – Connect data sources to LLMs via indexing.
- RAGStack (Haystack) – Retrieval-augmented generation framework.
- FastChat – Open platform for serving and fine-tuning chat LLMs.
- Weights & Biases – Track and visualize model training and performance.
- MLflow – Platform for managing the ML lifecycle.
- PromptLayer – Middleware for logging and versioning prompt inputs and outputs.
- OpenLLM – Open-source platform to deploy and manage LLMs in production.
Contributions are welcome!