Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum
π€ Huggingfacel | π Paper
- π Overview
- π Repository Structure
- βοΈ Installation
- π Training
- π Evaluation
- π Datasets
- π Acknowledgements
- π Citation
Supervised fine-tuning (SFT) is the standard post-training approach for large language models (LLMs), but its default objective β Negative Log-Likelihood (NLL) β is not universally optimal. While NLL is classically optimal when training from scratch, post-training operates in a different paradigm and could violate its optimality assumptions, where models already encode task-relevant priors and supervision can be long and noisy.
In addition, language models are trained to be general-purpose models, but the vast differences between downstream tasks mean that they should not be treated equally. Tasks differ in how much useful prior knowledge is already encoded from pretraining, and thus a single objective may not work well across all cases.
To this end, we study a general family of probability-based objectives and characterize their effectiveness under different conditions. We first categorize objectives based on how they distribute gradient weight:
- Prior-leaning objectives: Emphasize mid- to high-probability tokens (e.g., βp, βpΒΉβ°, thresholded variants), leveraging model priors to refine already plausible predictions.
- Prior-averse objectives: Emphasize low-probability tokens (e.g., βlog p), encouraging the model to learn broadly even when priors are weak or misaligned.
Building on this categorization, we introduce the model-capability continuum that characterizes the effectiveness of different objectives:
- Model-Strong (MS): Base models already encode strong priors (e.g., math). Prior-leaning objectives consistently outperform NLL by focusing on reliable signals.
- Model-Intermediate (MI): Models have partial priors (e.g., medical reasoning). No single objective dominates; performance depends on data and supervision.
- Model-Weak (MW): Models lack useful priors (e.g., novel puzzles). NLL remains superior by enforcing learning from low-probability tokens.
This framework provides a principled view of when and why different SFT objectives succeed or fail.
Beyond-Log-Likelihood/
β
βββ data/ # Data processing files
β βββ data_process_figfont.py
β βββ data_process_math.py
β βββ data_process_medical.py
β βββ download_data.py
β
βββ evaluations/ # Evaluation pipelines for different tasks
β βββ figfont/
β βββ math/
β βββ medical/
β
βββ main_verl/ # Core training framework
β βββ trainer/
β β βββ config/
β β βββ fsdp_sft_trainer.py # Main Trainer
β βββ utils/
β
βββ scripts/ # Scripts for running experiments
β βββ evaluation/
β βββ training/
β βββ one_click/ # train and eva in one click with your passed-in parameters
β
βββ .gitignore
βββ README.md
The installation requirements are minimal. (You may use your own environments for running the code.) The main dependencies are:
verl(==0.4.0.dev0)
torch
vllm
flash_attnBefore training, run the following code to download all necessary data (or you can generate your own training data by following files inside π Datasets):
python data/download_data.pyTraining scripts are provided in scripts/training/. Each dataset has exemplar .sh files for quick use. In addition, we provide a one-shot script that automatically generates and runs the training command.
To run training and evaluation in one step, use:
python scripts/one_click/script_generator.py \
--dataset $DATASET \
--model_save_name $MODEL_KEY \
--trainer_objective_trans $OBJECTIVE \
(--run_script)-
--dataset: Specifies the dataset to use. Choose from:[math, medical, figfont] -
--model_save_name: Specifies the model key from the mapping below:
MODEL_MAPPING = {
"qwen-2.5-math-1.5b": "Qwen/Qwen2.5-Math-1.5B",
"qwen-2.5-math-7b": "Qwen/Qwen2.5-Math-7B",
"qwen-2.5-1.5b": "Qwen/Qwen2.5-1.5B",
"qwen-2.5-7b": "Qwen/Qwen2.5-7B",
"llama-3.1-8b": "meta-llama/Llama-3.1-8B",
"llama-3.2-3b": "meta-llama/Llama-3.2-3B",
"deepseek-math-7b": "deepseek-ai/deepseek-math-7b-base",
}--trainer_objective_trans: The most important argument. Specifies the training objective from the following options (more to be added):
| Key | Description |
|---|---|
original |
Original implementation of SFT |
GeneralFamily-alpha |
The function |
p |
|
OnlyTopP-q |
The thresholded function |
OnlyBottomP-q |
The thresholded function |
OnlyTopLogP-q |
The thresholded function |
OnlyBottomLogP-q |
The thresholded function |
-
--run_script: (Optional) Boolean flag. If specified, directly executes the generated training command. -
--nproc_per_node: (Optional) Specifies the number of GPUs to use. -
--cuda_visible_devices: (Optional) Specifies specific GPU devices (e.g.,--cuda_visible_devices 0,1,2,3).
# Math dataset with Qwen2.5-Math-1.5B using GeneralFamily objective (alpha=8)
python scripts/one_click/script_generator.py \
--dataset math \
--model_save_name qwen-2.5-math-1.5b \
--trainer_objective_trans GeneralFamily-8 \
--run_script
# Medical dataset with Qwen2.5-1.5B using original SFT
python scripts/one_click/script_generator.py \
--dataset medical \
--model_save_name qwen-2.5-1.5b \
--trainer_objective_trans original \
--run_script
# Figfont dataset with Qwen2.5-7B using original SFT
python scripts/one_click/script_generator.py \
--dataset figfont \
--model_save_name qwen-2.5-7b \
--trainer_objective_trans original \
--run_scriptThe evaluation scripts are provided in scripts/evaluation/. You may use the one-shot training & evaluation script for best convenience. Our evaluation logs all output runs for transparent comparisons.
Dataset processing and downloading code are in data/. You may generate custom splits using similar preprocessing stages. Feel free to add new datasets via pull request following the logic in scripts/script_generator.py. Our paper uses the following datasets for training: NuminaMath-CoT, m23k, and reasoning-gym. We are extremely grateful for these open-source contributions.
The implementation of this repository is built upon veRL and DFT. We sincerely appreciate the efforts of these teams for their contributions to open-source research and development.
If you find this repository useful, please cite:
@article{li2025beyond,
title={Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum},
author={Li, Gaotang and Qiu, Ruizhong and Chen, Xiusi and Ji, Heng and Tong, Hanghang},
journal={arXiv preprint arXiv:2510.00526},
year={2025}
}