You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Inference of Facebook's LLaMA model in Golang with embedded C/C++.
Description
This project embeds the work of llama.cpp in a Golang binary.
The main goal is to run the model using 4-bit quantization using CPU on Consumer-Grade hardware.
At startup, the model is loaded and a prompt is offered to enter a prompt,
after the results have been printed another prompt can be entered.
The program can be quit using ctrl+c.
This project was tested on Linux but should be able to get to work on macOS as well.
Requirements
The memory requirements for the models are approximately:
# build this repo
git clone https://github.com/cornelk/llama-go
cd llama-go
make
# install Python dependencies
python3 -m pip install torch numpy sentencepiece
Obtain the original LLaMA model weights and place them in ./models -
for example by using the https://github.com/shawwn/llama-dl script to download them.
Use the following steps to convert the LLaMA-7B model to a format that is compatible:
ls ./models
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model
# convert the 7B model to ggml FP16 format
python3 convert-pth-to-ggml.py models/7B/ 1
# quantize the model to 4-bits
./quantize.sh 7B
When running the larger models, make sure you have enough disk space to store all the intermediate files.
Usage
./llama-go -m ./models/13B/ggml-model-q4_0.bin -t 4 -n 128
Loading model ./models/13B/ggml-model-q4_0.bin...
Model loaded successfully.
>>> Some good pun names for a pet groomer:
Some good pun names for a pet groomer:
Rub-a-Dub, Scooby Doo
Hair Force One
Duck and Cover, Two Fleas, One Duck
...
>>>
The settings can be changed at runtime, multiple values are possible: