v2.3.0 -- Extending cleanlab beyond label errors into a complete library for data-centric AI

@clu0

Cleanlab was originally open-sourced as code to accompany a research paper on label errors in classification tasks, to prove to skeptical researchers that it's possible to utilize ML models to discover mislabeled data and then train even better versions of these same models. We've been hard at work since then, turning this into an industry-grade library that helps you handle label errors in many ML tasks such as: entity recognition, image/document tagging, data labeled by multiple annotators, etc. While label errors are critical to deal with in real-world ML applications, data-centric AI involves utilizing trained ML models to improve the data in other ways as well.

With the newest release, cleanlab v2.3 can now automatically:

As always, the cleanlab library works with almost any ML model (no matter how it was trained) and type of data (image, text, tabular, audio, etc). We have user-friendly 5min tutorials to get started with any of the above objectives and easily improve your data!

We're aiming for this library to provide all the key functionalities needed to practice data-centric AI. Much of this involves inventing new algorithms for data quality, and we transparently publish all of these algorithms in scientific papers. Read these to understand how particular cleanlab methods work under the hood and see extensive benchmarks of how effective they are on real data.

Highlights of what’s new in 2.3.0:

We have added new functionality for active learning and easily making Keras models compatible with sklearn. Label issues can now be estimated 10x faster and with much less memory using new methods added to help users with massive datasets. This release is non-breaking when upgrading from v2.2.0 (except for certain methods in cleanlab.experimental that have been moved).

Active Learning with ActiveLab

For settings where you want to label more data to get better ML, active learning helps you train the best ML model with the least data labeling. Unfortunately data annotators often give imperfect labels, in which case we might sometimes prefer to have another annotator check an already-labeled example rather than labeling an entirely new example. ActiveLab is a new algorithm invented by our team that automatically answers the question: which new data should I label or which of my current labels should be checked again? ActiveLab is highly practical — it runs quickly and works with: any type of ML model, batch settings where many examples are (re)labeled before model retraining, and settings where multiple annotators can label an example (or just one annotator).

Here's all the code needed to determine active learning scores for examples in your unlabeled pool (no annotations yet) and labeled pool (at least one annotation already collected).

from cleanlab.multiannotator import get_active_learning_scores
scores_labeled_pool, scores_unlabeled_pool = get_active_learning_scores(
        multiannotator_labels, pred_probs, pred_probs_unlabeled
    )

The batch of examples with the lowest scores are those that are most informative to collect an additional label for (scores between labeled vs unlabeled pool are directly comparable). You can either have a new annotator label the batch of examples with lowest scores, or distribute them amongst your previous annotators as is most convenient. ActiveLab is also effective for: standard active learning where you collect at most one label per example (no re-labeling), as well as active label cleaning (with no unlabeled pool) where you only want to re-label examples to ensure 100% correct consensus labels (with the least amount of re-labeling).

Get started running ActiveLab with our tutorial notebook from our repo that has many other examples.

KerasWrapper

We've introduced one-line wrappers for TensorFlow/Keras models that enable you to use TensorFlow models within scikit-learn workflows with features like Pipeline, GridSearch and more. Just change one line of code to make your existing Tensorflow/Keras model compatible with scikit-learn’s rich ecosystem! All you have to do is swap out: keras.Model → KerasWrapperModel, or keras.Sequential → KerasSequentialWrapper. Imported from cleanlab.models.keras, the wrapper objects have all the same methods of their keras counterparts, plus you can use them with tons of handy scikit-learn methods.

Resources to get started include:

Blogpost and Jupyter notebook demonstrating how to make a HuggingFace Transformer (BERT model) sklearn-compatible.
Jupyter notebook showing how to fit these sklearn-compatible models to a Tensorflow Dataset.
Revamped tutorial on label errors in text classification data, which has been updated to use this new wrapper.

Computational improvements for detecting label issues

Through extensive optimization of our multiprocessing code (thanks to @clu0), find_label_issues has been made ~10x faster on Linux machines that have many CPU cores.

For massive datasets, find_label_issues may require too much memory to run our your machine. We've added new methods in cleanlab.experimental.label_issues_batched that can compute label issues with far less memory via mini-batch estimation. You can use these with billion-scale memmap arrays or Zarr arrays like this:

from cleanlab.experimental.label_issues_batched import find_label_issues_batched
labels = zarr.convenience.open("LABELS.zarr", mode="r")
pred_probs = zarr.convenience.open("PREDPROBS.zarr", mode="r")
issues = find_label_issues_batched(labels=labels, pred_probs=pred_probs, batch_size=100000)

By choosing sufficiently small batch_size, you should be able to handle pretty much any dataset (set it as large as your memory will allow for best efficiency). With default arguments, the batched methods closely approximate the results of the option: cleanlab.filter.find_label_issues(..., filter_by="low_self_confidence", return_indices_ranked_by="self_confidence")
This and filter_by="low_normalized_margin" are new find_label_issues() options added in v2.3, which require less computation and still output accurate estimates of the label errors.

Other changes to be aware of

Like all major ML frameworks, we have dropped support for Python 3.6.
We have moved some particularly useful models (fasttext, keras) from cleanlab.experimental -> cleanlab.models.

Change Log

Shorten tutorial titles in docs for readability by @ulya-tkch in #553
Swap CI workflow to actions by @huiwengoh in #560
Remove .pylintrc by @elisno in #564
Tutorial fixes by @huiwengoh in #565
Fix typo in CONTRIBUTING.md by @ulya-tkch in #566
Multiannotator Active Learning Support by @huiwengoh in #538
multiannotator explanation improvements by @jwmueller in #570
Specify Sphinx to order functions by source code order by @huiwengoh in #571
Fix example in ema docstring by @elisno in #563, #573
update paper list and applications beyond label error detection in readme by @jwmueller in #574, #580
Drop Python 3.6 support (by @jwmueller in #558, #577; by @anishathalye in #562; by @krmayankb in #578; by @sanjanag in #579)
add maximum line length by @cgnorthcutt in #583
Update github actions by @ulya-tkch in #589
Revamp text tutorial by @huiwengoh in #584
clarify thresholding in issues_from_scores by @jwmueller in #582
Remove temp scaling from single annotator case by @huiwengoh in #590
Update docs dependencies by @huiwengoh in #593
Use euclidean distance for identifying outliers for lower dimensional features by @ulya-tkch in #581
Handle missing type parameters for generic type "ndarray" by @elisno in #587
Remove temp scaling for single-label case in ensemble method by @huiwengoh in #597
Adding type hints for mypy strict compatibility by @unna97 in #585
fix typo in outliers.ipynb by @eltociear in #603
10x speedup in find_label_issues on linux via better multiprocessing by @clu0 in #596
Update tabular tutorial with better language by @cmauck10 in #609
Improve num_label_issues() to reflect most accurate num issues by @ulya-tkch in #610
Removed duplicate classifier from setup.py by @sanjanag in #612
Add two methods to filter.find_label_issues by @cgnorthcutt in #595
Fix dictionary type annotation for OutOfDistribution object by @ulya-tkch in #616
Fix format compatibility with latest black==23. release by @ulya-tkch in #620
Create new cleanlab.models module by @huiwengoh in #601
upgrade torch in docs by @jwmueller in #607
fix bug: confidences -> confidence by @jwmueller in #623
Fixed duplicate issue removal in find_label_issues by @ulya-tkch in #624
Method to estimate label issues with limited memory via mini-batches by @jwmueller in #615, #629, #632, #635
Fix KerasWrapper summary method by @huiwengoh in #631
Clarify rank.py not for multi-label classification by @ulya-tkch in #626
Removed $ from shell commands to avoid it being copied by @sanjanag in #625
label_issues_batched multiprocessing by @clu0 in #630, #634
Switch to typing.Self by @anishathalye in #489
Documentation improvements by @huiwengoh in #643
add 2.3.0 to release versions by @jwmueller in #644

New Contributors

@krmayankb made their first contribution in #578
@sanjanag made their first contribution in #579
@unna97 made their first contribution in #585
@eltociear made their first contribution in #603
@clu0 made their first contribution in #596

Full Changelog: v2.2.0...v2.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!