This repository contains the code and data for reproducing experiments from our paper, ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering.
- Features
- Installation
- Configuration
- Getting Started
- User Study Data Generation:
- Use the
proxann.data_formattermodule to generate the JSON files containing the required topic model information to carry out the evaluation (or user study).
- Use the
- Proxy-Based Evaluation:
- Perform LLM proxy evaluations using the
proxann.llm_annotationsmodule.
- Perform LLM proxy evaluations using the
- Topic Model Training:
- Train topic and clustering models (currently, LDA-Mallet, LDA-Tomotopy, and BERTopic) under a unified structure using the
proxann.topic_models.trainmodule.
- Train topic and clustering models (currently, LDA-Mallet, LDA-Tomotopy, and BERTopic) under a unified structure using the
We recommend uv for installing the necessary dependencies.
-
Install uv by following the official guide
-
Create a local environment (it will use the python version specified in pyproject.toml)
uv venv- Install dependencies
uv pip install -e .- Run scripts in this repository with either
uv run <bash script>.shoruv run python <python script>.py. You can also first runsource .venv/bin/activateto avoid the need foruv run.
To use GPT models via the OpenAI API, create a .env file in the root directory with the following content:
OPENAI_API_KEY=[your_open_ai_api_key]You can also modify the path to the .env file in the configuration file.
We rely on vLLM models for evaluating with open-source large language models. You must have the model running and specify the endpoint where it is deployed in the configuration file.
The src.train module supports multiple topic modeling backends. No extra setup is required for most of them.
Only if you're using LDA-Mallet, follow these steps:
- Download the latest release of Mallet.
- Place the contents in the
src/traindirectory. - Optionally, you can use the provided script to automate the download:
bash bash_scripts/wget_mallet.shThis section will guide you through the process of setting up and using ProxAnn, from choosing the right LLM to running your first metric.
ProxAnn’s performance varies depending on the language model used. This section summarizes how different LLMs perform across tasks and datasets, helping you balance accuracy with computational cost.
⚠️ Recommendation: For best overall alignment with human judgments, GPT-4o and Qwen 2.5–72B perform the strongest across both Fit and Rank steps.
Qwen 1.5–32B is a solid cost-effective alternative.
Avoid Llama 3.1–8B, which consistently underperforms.
This test estimates how often ProxAnn (with a given LLM) performs as well as or better than a random human annotator.
Metrics are advantage probabilities. Asterisks (*) and daggers (†) mark statistical significance: * indicates the LLM outperforms a random human annotator (p < 0.05, t-test); † shows significance under a Wilcoxon signed-rank test.
| Model | Doc ρ (Fit) | Doc ρ (Rank) | Topic ρ (Fit) | Topic ρ (Rank) |
|---|---|---|---|---|
| GPT-4o | 0.56*† | 0.68*† | 0.66† | 0.55† |
| Llama 3.1 8B | 0.22 | 0.36 | 0.05 | 0.11 |
| Llama 3.1 70B | 0.57*† | 0.67*† | 0.58† | 0.50† |
| Qwen 1.5 8B | 0.56*† | 0.58† | 0.46 | 0.39 |
| Qwen 1.5 32B | 0.55*† | 0.63† | 0.47 | 0.42 |
| Qwen 2.5 72B | 0.52† | 0.68*† | 0.66† | 0.46 |
The plot and table below show how well ProxAnn’s topic rankings align with human judgments, using Kendall’s τ as the correlation metric. The Human row reflects inter-annotator agreement, and NPMI provides a traditional baseline.
| Metric / Model | Wiki (Fit) | Bills (Fit) | Wiki (Rank) | Bills (Rank) |
|---|---|---|---|---|
| NPMI | -0.15 (0.14) | 0.01 (0.10) | -0.18 (0.10) | -0.02 (0.12) |
| GPT-4o | 0.22 (0.13) | 0.31 (0.13) | 0.27 (0.14) | 0.29 (0.11) |
| Llama 3.1 8B | 0.19 (0.18) | 0.16 (0.18) | -0.35 (0.14) | 0.15 (0.14) |
| Qwen 1.5 8B | 0.35 (0.16) | 0.12 (0.16) | 0.33 (0.16) | 0.28 (0.13) |
| Qwen 1.5 32B | 0.20 (0.18) | 0.34 (0.11) | 0.51 (0.11) | 0.30 (0.13) |
| Llama 3.1 70B | 0.41 (0.14) | 0.26 (0.15) | 0.36 (0.13) | 0.19 (0.13) |
| Qwen 2.5 72B | 0.48 (0.13) | 0.22 (0.17) | 0.36 (0.12) | 0.21 (0.15) |
| Human (HTM) | 0.41 (0.09) | 0.09 (0.14) | 0.34 (0.09) | 0.18 (0.12) |
ProxAnn expects output from traditional topic models, where each document is represented by a topic distribution (
To be used with ProxAnn, models must be saved as NumPy arrays (.npy or .npz), along with:
- A JSON file containing the model vocabulary (i.e., the words indexing the columns in
$\beta_k$ ). - A plain-text corpus file (one document per line).
You can download examples files that meet these requirements here.
Alternatively, you can train topic models directly using ProxAnn's training module. In that case, only the corpus is required. See bash_scripts/train_models.sh for an example of how to invoke src/train/tm_trainer.py.
ProxAnn creates a JSON file that serves as input for both human and LLM-based evaluations. This file contains:
- Top words for each topic (
topic_words) - Representative documents (
exemplar_docs) using various selection methods (thetas,thetas_sample,sall, etc.) - Evaluation documents with topic assignment probabilities (
eval_docs) - A distractor document for each topic
Example structure:
{
"<topic_id>": {
"topic_words": ["word1", "word2", "word3"],
"exemplar_docs": [
{"doc_id": 1, "text": "...", "prob": 0.9},
{"doc_id": 2, "text": "...", "prob": 0.8}
],
"eval_docs": [
{"doc_id": 3, "text": "...", "prob": 0.9, "assigned_to_k": 1},
{"doc_id": 4, "text": "...", "prob": 0.8, "assigned_to_k": 1}
],
"distractor_doc": {"doc_id": 100, "text": "..."}
}
}To generate the above JSON files, you’ll need a YAML config file like those in config/user_study. Each config should specify how to load model outputs depending on how the model was trained:
-
If the model was not trained with ProxAnn (
trained_with_thetas_eval=False), provide:thetas_path: Document-topic matrix (docs × topics)betas_path: Topic-word matrix (topics × vocab size)vocabulary_path: Vocabulary filecorpus_path: Original documents (one per line)
-
If the model was trained using ProxAnn (
trained_with_thetas_eval=True), provide:model_path: Path to the trained modelcorpus_path: As above
-
You can also specify
remove_topic_idsto exclude topics from evaluation.
To create the user study data files, run:
python3 get_user_study_data.py --user_study_config <path_to_config_file>You can see examples of generated JSONs in data/json_out.
To evaluate your topic model using ProxAnn:
from proxann.llm_annotations.proxann import ProxAnn
proxann = ProxAnn()Use a user study configuration file (see examples in config/user_study) to produce the input JSON for evaluation:
status, tm_model_data_path = proxann.generate_user_provided_json(path_user_study_config_file)- If
status == 0, the JSON was created successfully. - Otherwise, an error occurred, and evaluation should be halted.
corr_data, _ = proxann.run_metric(
tm_model_data_path.as_posix(),
llm_models=["gpt-4o-mini-2024-07-18"],
q1_temp=1.0,
q2_temp=0.0,
q3_temp=0.0,
custom_seeds=[122,133,144,155,166],
nruns=5
)llm_modelsis a list of LLMs to use for evaluation.- These must be pre-defined in your
config/config.yaml, under the deployment section you're using (e.g.,vllm,openai, etc.).
Example config.yaml snippet for a VLLM setup:
llm:
vllm:
available_models:
"Qwen/Qwen3-8B": ...
host: https://localhost:8000/v1See proxann_eval.py for a minimal runnable example.
You can also run ProxAnn as a REST API server:
python3 -m proxann.llm_annotations.frontend.backThis launches a local web server that exposes ProxAnn’s evaluation pipeline via HTTP endpoints.
Alternatively, use the hosted instance at: 👉 https://proxann.uc3m.es/
ProxAnn uses a unified wrapper class, Prompter, to standardize API calls across different LLM backends. It currently supports OpenAI, VLLM, and Ollama.
⚠️ Note: Only OpenAI and VLLM support logprobs, which are required for ProxAnn’s evaluation. Ollama is currently not compatible for this reason.
The Prompter class includes a caching mechanism that ensures repeated prompts return the same result without reissuing an API call, improving speed and efficiency during evaluation.
from proxann.proxann.prompter import Prompter
llm_model = "Qwen/Qwen3-8B" # Must match a model defined in `available_models` for your deployment type (e.g., VLLM, OpenAI)
prompter = Prompter(model_type=llm_model)You can also override configuration parameters such as temperature, max_tokens, etc., by passing them as keyword arguments. If not specified, defaults are taken from config/config.yaml.
result, logprobs = prompter.prompt(system_prompt, question_prompt)system_promptis optional and can be left asNone.- You may also override the
temperatureor other generation parameters at call time.
User responses were collected through Prolific using the user_annotations/annotation_server.py server. More details on this will be provided soon.
To generate LLM-based annotations, run proxann_user_study.py (or the bash wrapper bash_scripts/run_proxann_multiple.sh) with the following parameters:
-
--model_type "$MODEL_TYPE"LLM(s) to be used for generating annotations. Multiple models can be specified, separated by commas. These must be defined inavailable_modelsunder the chosen deployment section inconfig/config.yaml. -
--tm_model_data_path "$TM_MODEL_DATA_PATH"Path to the JSON file containing model output, generated in the setup phase (e.g., usingget_user_study_data.py). -
--dataset_key "$DATASET_KEY"Identifier for the dataset being evaluated (e.g.,Wiki,Bills). -
--response_csv "$RESPONSE_CSV"Path to a CSV file containing human annotation responses (e.g., from Qualtrics). -
--path_save_results "$SAVE_PATH"Directory where generated annotations and results will be saved.
-
--prompt_mode "$PROMPT_MODE"(default:q1_then_q3_mean,q1_then_q2_mean) Which evaluation steps to perform. Comma-separated list of options:-
q1_then_q2_mean: Category Identification → Relevance Judgment -
q1_then_q3_mean: Category Identification → Representativeness Ranking
-
-
--config_path "$CONFIG_PATH"(default:src/proxann/config/config.yaml) Path to the main configuration YAML file. -
--running_mode "$MODE"(default:run) Mode of execution:runoreval. -
--removal_condition "$REMOVAL_CONDITION"(default:loose) Condition for disqualifying invalid responses:-
loose: Exclude if any evaluation fails. -
strict: Exclude only if all fail.
-
-
--temperatures "$TEMP_LIST"Comma-separated temperatures for LLM generation in Q1/Q2/Q3 (e.g.,"0.7,0.3,0.5"). -
--seed "$SEED"Integer seed for random number generation (for reproducibility). -
--max_tokens "$MAX_TOKENS"Maximum number of tokens allowed in LLM completions.
To reproduce the results from our paper using ProxAnn, ensure the following datasets, models, and configuration files are placed in the correct directories. These paths correspond to those expected by the script bash_scripts/run_proxann_multiple.sh.
-
Datasets Preprocessed Wiki and Bills datasets (15,000-term vocabulary) from Hoyle et al. (2022).
Download: 🤗.
Save to:
data/training_data/ -
Trained Topic Models Includes LDA-Mallet and CTM models from Hoyle et al., and BERTopic models trained using
proxann.topic_models.train.BERTopicTrainer. All models are available onDownload: 🤗
Save to:
data/models/ -
User Study Configuration Files Configuration files for sampling 8 topics per model per dataset.
Available in:
data/user_study/ -
User Study JSON Files Generated using
get_user_study_data.py.Save to:
data/json_out/Expected files:
data/json_out/config_wiki_part1.jsondata/json_out/config_wiki_part2.jsondata/json_out/config_bills_part1.jsondata/json_out/config_bills_part2.json
-
Human Annotations Collected via Qualtrics, matching the user study JSON files. Save to:
data/human_annotations/Cluster+Evaluation+-+Sort+and+Rank_December+12,+2024_05.19.csvdata/human_annotations/Cluster+Evaluation+-+Sort+and+Rank+-+Bills_December+14,+2024_13.20.csv
-
LLM Annotations LLM outputs for all models evaluated in the paper.
Save to:
data/llm_out/ -
Coherence Scores Topic coherence evaluation metrics.
Save to:
data/cohrs/
Once the files are in place, you can run:
bash bash_scripts/run_proxann_multiple.shTo ensure full reproducibility, use the same random seeds that appear in the filenames of our saved evaluation outputs. For example, in the file:
data/llm_out/mean/wiki/Qwen2.5-72B-Instruct-AWQ/q1_then_q3_mean,q1_then_q2_mean_temp1.0_0.0_0.0_seed174_20250529_1352
the seed used is 174.
Once you have both the LLM-generated and human annotations in place, you can reproduce all tables and figures from the paper using the notebooks and scripts provided in evaluation_scripts.


