You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding
Welcome to the SpecHub repository! This repository contains the implementation of SpecHub, a novel approach to accelerating the inference process of Large Language Models (LLMs) through an optimized speculative decoding framework.
Overview
SpecHub addresses the inefficiencies of traditional Multi-Draft Speculative Decoding (MDSD) methods by optimizing the acceptance rate of draft tokens using an Optimal Transport (OT) formulation. By strategically designing the joint distribution of draft tokens and the accepted token, SpecHub significantly accelerates the decoding process and achieves higher acceptance rates compared to existing methods.
Key Features
Improved Efficiency: SpecHub enhances batch efficiency, generating 0.05-0.27 more tokens per step than Recursive Rejection Sampling (RRS) and achieves equivalent batch efficiency with half the concurrency.
Optimal Transport Formulation: Utilizes a simplified Linear Programming (LP) model to optimize the acceptance rate of draft tokens.
Seamless Integration: Can be integrated into existing MDSD frameworks with minimal computational overhead.
Usage
To use SpecHub in your projects, follow the steps below: