v0.2.0

@Muku784

Concurrency

Ollama 0.2.0 is now available with concurrency support. This unlocks 2 specific features:

Parallel requests

Ollama can now serve multiple requests at the same time, using only a little bit of additional memory for each request. This enables use cases such as:

Handling multiple chat sessions at the same time
Hosting a code completion LLM for your internal team
Processing different parts of a document simultaneously
Running several agents at the same time.

demo.mov

Multiple models

Ollama now supports loading different models at the same time, dramatically improving:

Retrieval Augmented Generation (RAG): both the embedding and text completion models can be loaded into memory simultaneously.
Agents: multiple different agents can now run simultaneously
Running large and small models side-by-side

Models are automatically loaded and unloaded based on requests and how much GPU memory is available.

To see which models are loaded, run ollama ps:

% ollama ps
NAME                    ID              SIZE    PROCESSOR       UNTIL
gemma:2b                030ee63283b5    2.8 GB  100% GPU        4 minutes from now
all-minilm:latest       1b226e2802db    530 MB  100% GPU        4 minutes from now
llama3:latest           365c0bd3c000    6.7 GB  100% GPU        4 minutes from now

For more information on concurrency, see the FAQ

New models

GLM-4: A strong multi-lingual general language model with competitive performance to Llama 3.
CodeGeeX4: A versatile model for AI software development scenarios, including code completion.
Gemma 2: Improved output quality and base text generation models now available

What's Changed

Improved Gemma 2
- Fixed issue where model would generate invalid tokens after hitting context window
- Fixed inference output issues with gemma2:27b
- Re-downloading the model may be required: ollama pull gemma2 or ollama pull gemma2:27b
Ollama will now show a better error if a model architecture isn't supported
Improved handling of quotes and spaces in Modelfile FROM lines
Ollama will now return an error if the system does not have enough memory to run a model on Linux

New Contributors

@Muku784 made their first contribution in #5382
@abitrolly made their first contribution in #4821

Full Changelog: v0.1.48...v0.2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!