A comprehensive toolkit for streamlining data editing, search, and inspection for large-scale language model training and interpretability.
TokenSmith is a powerful Python package designed to simplify dataset management for large language model training. It provides a unified interface for editing, inspecting, searching, sampling, and exporting tokenized datasets, making it easier to work with training data at scale.
- π Search & Index: Fast token sequence search with n-gram indexing
- π Dataset Inspection: Examine samples, batches, and document metadata
- π― Smart Sampling: Flexible sampling with policy-based selection
- βοΈ Dataset Editing: Inject and modify training samples with precision
- π€ Export Utilities: Export data in multiple formats
- π© Ingest Utilities: Ingest data from multiple formats
- π₯οΈ Interactive UI: Streamlit-based web interface for visual exploration
- β‘ Memory Efficient: Chunked processing for large datasets
TokenSmith is built around a central DatasetManager that coordinates five specialized handlers:
DatasetManager
βββ SearchHandler # Token sequence search and indexing
βββ InspectHandler # Dataset examination and visualization
βββ SampleHandler # Flexible data sampling strategies
βββ EditHandler # Dataset modification and injection
βββ ExportHandler # Multi-format data export
βββ IngestHandler # Multi-format data ingestion
TokenSmith can be installed in several ways depending on your use case.
Note: Apart from search all features assume that GPT-NeoX is installed to use Megatron. You can do that by simply following the steps provided here.
If you only need the core functionality (data editing, sampling, importing, exporting, inspection):
git clone https://github.com/aflah02/tokensmith.git
cd tokensmith
pip install -e .If you plan to build or serve the documentation locally:
git clone https://github.com/aflah02/tokensmith.git
cd tokensmith
pip install -e ".[docs]"Once installed, you can build and serve the docs:
mkdocs serveIf you want the interactive interface for exploring data:
git clone https://github.com/aflah02/tokensmith.git
cd tokensmith
pip install -e ".[ui]"For advanced token-level search and n-gram utilities:
git clone https://github.com/aflah02/tokensmith.git
cd tokensmith
pip install -e ".[search]"To install all optional features (does not include docs):
git clone https://github.com/aflah02/tokensmith.git
cd tokensmith
pip install -e ".[all]"This includes docs, UI, and search extras.
If youβre contributing to tokensmith:
git clone https://github.com/aflah02/tokensmith.git
cd tokensmith
pip install -e ".[all,docs]"This sets up a local environment with all extras for development.
We provide an example project to help you quickly set up TokenSmith on Modal, a serverless cloud platform, using its Notebooks feature. To get started, follow the instructions in the modal_example directory.
# Search for token sequences
query = [101, 2023, 102] # Token IDs
count = manager.search.count(query)
positions = manager.search.positions(query)
contains = manager.search.contains(query)
# Get next token distributions
next_tokens = manager.search.count_next(query)# Inspect individual samples
sample = manager.inspect.inspect_sample_by_id(
sample_id=42,
return_detokenized=True,
tokenizer=tokenizer,
return_doc_details=True
)
# Inspect entire batches
batch = manager.inspect.inspect_sample_by_batch(
batch_id=0,
batch_size=32,
return_detokenized=True,
tokenizer=tokenizer
)# Sample by specific indices
samples = manager.sample.get_samples_by_indices(
indices=[1, 5, 10, 42],
return_detokenized=True,
tokenizer=tokenizer
)
# Sample batches by ID
batches = manager.sample.get_batches_by_ids(
batch_ids=[0, 1, 2],
batch_size=32,
return_detokenized=True,
tokenizer=tokenizer
)
# Policy-based sampling
def random_policy(n_samples):
import random
return random.sample(range(1000), n_samples)
policy_samples = manager.sample.get_samples_by_policy(
policy_fn=random_policy,
n_samples=10,
return_detokenized=True,
tokenizer=tokenizer
)# Inject text into specific locations
manager.edit.inject_and_preview(
text="This is injected content",
tokenizer=tokenizer,
injection_loc=100,
injection_type="seq_shuffle", # or "seq_start"
dry_run=False
)# Export specific batches
manager.export.export_batches(
batch_ids=[0, 1, 2],
batch_size=32,
output_path="exports/batches.jsonl",
format_type="jsonl",
return_detokenized=True,
tokenizer=tokenizer,
include_doc_details=True
)
# Export sequence ranges
manager.export.export_sequence_range(
start_idx=0,
end_idx=1000,
output_path="exports/sequences.csv",
format_type="csv",
return_detokenized=True,
tokenizer=tokenizer
)
# Export entire dataset (in chunks)
manager.export.export_entire_dataset(
output_path="exports/full_dataset.jsonl",
format_type="jsonl",
return_detokenized=True,
tokenizer=tokenizer,
chunk_size=1000
)TokenSmith includes a Streamlit-based web interface for visual dataset exploration:
# Launch the web UI using the convenience script
cd tokensmith/ui
./run_ui.shModify run_ui.sh to change modes and args
The web interface provides:
- Search Page: Interactive token sequence search with visualization
- Inspect Page: Browse and examine dataset samples and batches
- View Documents Page: View individual documents in training or corpus order
tokensmith/
βββ manager.py # Central DatasetManager class
βββ utils.py # Utility functions and classes
βββ edit/ # Dataset editing functionality
β βββ handler.py
βββ inspect/ # Dataset inspection tools
β βββ handler.py
βββ search/ # Search and indexing
β βββ handler.py
βββ sample/ # Sampling strategies
β βββ handler.py
βββ export/ # Data export utilities
β βββ handler.py
βββ ingest/ # Data ingestion utilities
β βββ handler.py
βββ ui/ # Streamlit web interface
βββ app.py
βββ pages/
βββ search.py
βββ inspect.py
βββ view_documents.py
Complete API documentation with automatically generated docstrings is available at: https://aflah02.github.io/TokenSmith
Comprehensive tutorials and examples are available in the tutorials/ directory:
- Basic Setup Tutorial
- Dataset Inspection Tutorial
- Dataset Sampling Tutorial
- Dataset Editing Tutorial
- Dataset Searching Tutorial
To build and serve the documentation locally:
# Make sure you install docs by the appropriate command mentioned above
# Serve locally (auto-reloads on changes)
mkdocs serve
# or use the convenience script
./serve-docs.shThe documentation will be available at https://127.0.0.1:8000
We welcome contributions! Please see our Contributing Guidelines for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the Apache 2.0 License - see this for further details.
- Built on top of the tokengrams library for efficient n-gram indexing
- Uses Megatron-style dataset indexing for compatibility with existing training pipelines
- π Issues: GitHub Issues
- π Documentation: https://aflah02.github.io/TokenSmith
If you find this library useful or build upon it, please remember to cite our work -
@misc{khan2025tokensmithstreamliningdataediting,
title={TokenSmith: Streamlining Data Editing, Search, and Inspection for Large-Scale Language Model Training and Interpretability},
author={Mohammad Aflah Khan and Ameya Godbole and Johnny Tian-Zheng Wei and Ryan Wang and James Flemings and Krishna Gummadi and Willie Neiswanger and Robin Jia},
year={2025},
eprint={2507.19419},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.19419},
}