CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 838
v2.3.0 -- Extending cleanlab beyond label errors into a complete library for data-centric AI
Compare
b8c034f
Cleanlab was originally open-sourced as code to accompany a research paper on label errors in classification tasks, to prove to skeptical researchers that it's possible to utilize ML models to discover mislabeled data and then train even better versions of these same models. We've been hard at work since then, turning this into an industry-grade library that helps you handle label errors in many ML tasks such as: entity recognition, image/document tagging, data labeled by multiple annotators, etc. While label errors are critical to deal with in real-world ML applications, data-centric AI involves utilizing trained ML models to improve the data in other ways as well.
With the newest release, cleanlab v2.3 can now automatically:
- find mislabeled data + train robust models
- detect outliers and out-of-distribution data
- estimate consensus + annotator-quality for multi-annotator datasets
- suggest which data is best to (re)label next
As always, the cleanlab library works with almost any ML model (no matter how it was trained) and type of data (image, text, tabular, audio, etc). We have user-friendly 5min tutorials to get started with any of the above objectives and easily improve your data!
We're aiming for this library to provide all the key functionalities needed to practice data-centric AI. Much of this involves inventing new algorithms for data quality, and we transparently publish all of these algorithms in scientific papers. Read these to understand how particular cleanlab methods work under the hood and see extensive benchmarks of how effective they are on real data.
Highlights of what’s new in 2.3.0:
We have added new functionality for active learning and easily making Keras models compatible with sklearn. Label issues can now be estimated 10x faster and with much less memory using new methods added to help users with massive datasets. This release is non-breaking when upgrading from v2.2.0 (except for certain methods in cleanlab.experimental
that have been moved).
Active Learning with ActiveLab
For settings where you want to label more data to get better ML, active learning helps you train the best ML model with the least data labeling. Unfortunately data annotators often give imperfect labels, in which case we might sometimes prefer to have another annotator check an already-labeled example rather than labeling an entirely new example. ActiveLab is a new algorithm invented by our team that automatically answers the question: which new data should I label or which of my current labels should be checked again? ActiveLab is highly practical — it runs quickly and works with: any type of ML model, batch settings where many examples are (re)labeled before model retraining, and settings where multiple annotators can label an example (or just one annotator).
Here's all the code needed to determine active learning scores for examples in your unlabeled pool (no annotations yet) and labeled pool (at least one annotation already collected).
from cleanlab.multiannotator import get_active_learning_scores
scores_labeled_pool, scores_unlabeled_pool = get_active_learning_scores(
multiannotator_labels, pred_probs, pred_probs_unlabeled
)
The batch of examples with the lowest scores are those that are most informative to collect an additional label for (scores between labeled vs unlabeled pool are directly comparable). You can either have a new annotator label the batch of examples with lowest scores, or distribute them amongst your previous annotators as is most convenient. ActiveLab is also effective for: standard active learning where you collect at most one label per example (no re-labeling), as well as active label cleaning (with no unlabeled pool) where you only want to re-label examples to ensure 100% correct consensus labels (with the least amount of re-labeling).
Get started running ActiveLab with our tutorial notebook from our repo that has many other examples.
KerasWrapper
We've introduced one-line wrappers for TensorFlow/Keras models that enable you to use TensorFlow models within scikit-learn workflows with features like Pipeline
, GridSearch
and more. Just change one line of code to make your existing Tensorflow/Keras model compatible with scikit-learn’s rich ecosystem! All you have to do is swap out: keras.Model
→ KerasWrapperModel
, or keras.Sequential
→ KerasSequentialWrapper
. Imported from cleanlab.models.keras
, the wrapper objects have all the same methods of their keras counterparts, plus you can use them with tons of handy scikit-learn methods.
Resources to get started include:
- Blogpost and Jupyter notebook demonstrating how to make a HuggingFace Transformer (BERT model) sklearn-compatible.
- Jupyter notebook showing how to fit these sklearn-compatible models to a Tensorflow Dataset.
- Revamped tutorial on label errors in text classification data, which has been updated to use this new wrapper.
Computational improvements for detecting label issues
Through extensive optimization of our multiprocessing code (thanks to @clu0), find_label_issues
has been made ~10x faster on Linux machines that have many CPU cores.
For massive datasets, find_label_issues
may require too much memory to run our your machine. We've added new methods in cleanlab.experimental.label_issues_batched that can compute label issues with far less memory via mini-batch estimation. You can use these with billion-scale memmap arrays or Zarr arrays like this:
from cleanlab.experimental.label_issues_batched import find_label_issues_batched
labels = zarr.convenience.open("LABELS.zarr", mode="r")
pred_probs = zarr.convenience.open("PREDPROBS.zarr", mode="r")
issues = find_label_issues_batched(labels=labels, pred_probs=pred_probs, batch_size=100000)
By choosing sufficiently small batch_size
, you should be able to handle pretty much any dataset (set it as large as your memory will allow for best efficiency). With default arguments, the batched methods closely approximate the results of the option: cleanlab.filter.find_label_issues(..., filter_by="low_self_confidence", return_indices_ranked_by="self_confidence")
This and filter_by="low_normalized_margin"
are new find_label_issues()
options added in v2.3, which require less computation and still output accurate estimates of the label errors.
Other changes to be aware of
- Like all major ML frameworks, we have dropped support for Python 3.6.
- We have moved some particularly useful models (fasttext, keras) from
cleanlab.experimental
->cleanlab.models
.
Change Log
- Shorten tutorial titles in docs for readability by @ulya-tkch in #553
- Swap CI workflow to actions by @huiwengoh in #560
- Remove .pylintrc by @elisno in #564
- Tutorial fixes by @huiwengoh in #565
- Fix typo in CONTRIBUTING.md by @ulya-tkch in #566
- Multiannotator Active Learning Support by @huiwengoh in #538
- multiannotator explanation improvements by @jwmueller in #570
- Specify Sphinx to order functions by source code order by @huiwengoh in #571
- Fix example in ema docstring by @elisno in #563, #573
- update paper list and applications beyond label error detection in readme by @jwmueller in #574, #580
- Drop Python 3.6 support (by @jwmueller in #558, #577; by @anishathalye in #562; by @krmayankb in #578; by @sanjanag in #579)
- add maximum line length by @cgnorthcutt in #583
- Update github actions by @ulya-tkch in #589
- Revamp text tutorial by @huiwengoh in #584
- clarify thresholding in issues_from_scores by @jwmueller in #582
- Remove temp scaling from single annotator case by @huiwengoh in #590
- Update docs dependencies by @huiwengoh in #593
- Use euclidean distance for identifying outliers for lower dimensional features by @ulya-tkch in #581
- changing copyright year 2017-2022 to 2017-2023 by @aditya1503 in #594
- Handle missing type parameters for generic type "ndarray" by @elisno in #587
- Remove temp scaling for single-label case in ensemble method by @huiwengoh in #597
- Adding type hints for mypy strict compatibility by @unna97 in #585
- fix typo in outliers.ipynb by @eltociear in #603
- 10x speedup in find_label_issues on linux via better multiprocessing by @clu0 in #596
- Update tabular tutorial with better language by @cmauck10 in #609
- Improve num_label_issues() to reflect most accurate num issues by @ulya-tkch in #610
- Removed duplicate classifier from setup.py by @sanjanag in #612
- Add two methods to filter.find_label_issues by @cgnorthcutt in #595
- Fix dictionary type annotation for OutOfDistribution object by @ulya-tkch in #616
- Fix format compatibility with latest black==23. release by @ulya-tkch in #620
- Create new cleanlab.models module by @huiwengoh in #601
- upgrade torch in docs by @jwmueller in #607
- fix bug: confidences -> confidence by @jwmueller in #623
- Fixed duplicate issue removal in find_label_issues by @ulya-tkch in #624
- Method to estimate label issues with limited memory via mini-batches by @jwmueller in #615, #629, #632, #635
- Fix KerasWrapper summary method by @huiwengoh in #631
- Clarify rank.py not for multi-label classification by @ulya-tkch in #626
- Removed $ from shell commands to avoid it being copied by @sanjanag in #625
- label_issues_batched multiprocessing by @clu0 in #630, #634
- Switch to typing.Self by @anishathalye in #489
- Documentation improvements by @huiwengoh in #643
- add 2.3.0 to release versions by @jwmueller in #644
New Contributors
- @krmayankb made their first contribution in #578
- @sanjanag made their first contribution in #579
- @unna97 made their first contribution in #585
- @eltociear made their first contribution in #603
- @clu0 made their first contribution in #596
Full Changelog: v2.2.0...v2.3.0