You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Code generator for inference on Quantized Large Language Models. Quantization is done using GPTQ.
Current features
Support for LlaMA and OPT
4,3, and 2 bit inference
x86 with AVX2 support
Support for pyTorch and transformers
Support for generic quantization group size
TODOs
Support for ARM Neon
Support for AVX512
Including quantization error analysis in code generation
Usage
Installation
Install dependencies via pip install -r requirements.txt
Install transformers from source pip install git+https://github.com/huggingface/transformers
Install the python module python setup.py install. This will run a search to find the best parameters for register usage.
Example
We give an example notebook in demo.ipynb. The basic workflow is
load floating point model,
load quantized checkpoint from GPTQ,
call the infergen.swap_modules_llama(model, quantized_checkpoint, bits=4, p=64, l1=l1, inplace=False) function, where model is the full-size model, quantized_checkpoint is the quantized model, bits is the number of bits used for the quantization,l1 is the size of the l1 data cache in bits, p is the number of cores to use, and inplace is a flag to swap in place or creating a copy.
Use the quantized model as a normal transformer.
About
Repository for CPU Kernel Generation for LLM Inference