v2.4.0 -- One line of code to detect all sorts of dataset issues

@jwmueller

Cleanlab has grown into a popular package used by thousands of data scientists to diagnose issues in diverse datasets and improve the data itself in order to fit more robust models. Many new methods/algorithms were added in recent months to increase the capabilities of this data-centric AI library.

Introducing Datalab

Now we've added a unified platform called Datalab for you to apply many of these capabilities in a single line of code!
To audit any classification dataset for issues, first use any trained ML model to produce pred_probs (predicted class probabilities) and/or feature_embeddings (numeric vector representations of each datapoint). Then, these few lines of code can detect many types of real-world issues in your dataset like label errors, outliers, near duplicates, etc:

from cleanlab import Datalab
lab = Datalab(data=dataset, label_name="column_name_for_labels")
lab.find_issues(features=feature_embeddings, pred_probs=pred_probs)
lab.report()  # summarize the issues found, how severe they are, and other useful info about the dataset

Follow our blog to better understand how this works internally, many articles will be published there shortly!
A detailed description of each type of issue Datalab can detect is provided in this guide, but we recommend first starting with the tutorials which show you how easy it is to run on your own dataset.

Datalab can be used to do things like find label issues with string class labels (whereas the prior find_label_issues() method required integer class indices). But you are still free to use all of the prior cleanlab methods you're used to! Datalab is also using these internally to detect data issues.

Our goal is for Datalab to be an easy way to run a comprehensive suite of cleanlab capabilities on any dataset. This is an evolving paradigm, so be aware some Datalab APIs may change in subsequent package versions -- as noted in the documentation.
You can easily run the issue checks in Datalab together with a custom issue type you define outside of cleanlab. This customizability also makes it easy to contribute new data quality algorithms into Datalab. Help us build the best open-source platform for data-centric AI by adding your ideas or those from recent publications! Feel free to reach out via Slack.

Revamped Tutorials

We've updated some of our existing tutorials with more interesting datasets and ML models. Regarding the basic tutorials on identifying label issues in classification data from various modalities (image, text, audio, tables), we have also created an analogous versions to detect issues in these same datasets with Datalab instead (see Datalab Tutorials). This should help existing users quickly ramp up on using Datalab to see how much more powerful this comprehensive data audit can be.

Improvements for Multi-label Classification

To provide a better experience for users with multi-label classification datasets, we have explicitly separated the functionality to work with these into the cleanlab.multilabel_classification module. So please start there rather than specifying the multi_label=True flag in certain methods outside of this module, as that option will be deprecated in the future.

Particularly noteworthy are the new dataset-level issue summaries for multi-label classification datasets, available in the cleanlab.multilabel_classification.dataset module.

While moving methods to the cleanlab.multilabel_classification module, we noticed some bugs in existing methods. We got rid of these methods entirely (replacing them with new ones in the cleanlab.multilabel_classification module), so some changes may appear to be backwards incompatible, even though the original code didn't function as intended in the first place.

Backwards incompatible changes

Your existing code will break if you do not upgrade to the new versions of these methods (the existing cleanlab v.2.3.1 code was probably producing bad results anyway based on some bugs that have been fixed). Here are changes you must make in your code for it to work with newer cleanlab versions:

cleanlab.dataset.rank_classes_by_label_quality(..., multi_label=True)
→
cleanlab.multilabel_classification.dataset.rank_classes_by_label_quality(...)

The multi_label=False/True argument will be removed in the future from the former method.

cleanlab.dataset.find_overlapping_classes(..., multi_label=True)
→
cleanlab.multilabel_classification.dataset.common_multilabel_issues(...)

The multi_label=False/True argument will be removed in the future from the former method. The returned DataFrame is slightly different, please refer to the new method's documentation.

cleanlab.dataset.overall_label_health_score(...multi_label=True)
→
cleanlab.multilabel_classification.dataset.overall_label_health_score(...)

The multi_label=False/True argument will be removed in the future from the former method.

cleanlab.dataset.health_summary(...multi_label=True)
→
cleanlab.multilabel_classification.dataset.multilabel_health_summary(...)

The multi_label=False/True argument will be removed in the future from the former method.

There are no other backwards incompatible changes in the package with this release.

Deprecated workflows

We recommend updating your existing code to the new versions of these methods (existing cleanlab v2.3.1 code will still work though, for now). Here are changes we recommend:

cleanlab.filter.find_label_issues(..., multi_label=True)
→
cleanlab.multilabel_classification.filter.find_label_issues(...)

The multi_label=False/True argument will be removed in the future from the former method.

from cleanlab.multilabel_classification import get_label_quality_scores
→
from cleanlab.multilabel_classification.rank import get_label_quality_scores

Remember: All of the code to work with multi-label data now lives in the cleanlab.multilabel_classification module.

Change Log

readme updates by @jwmueller in #659, #660, #713
CI updates (by @sanjanag in #701; by @huiwengoh in #671; by @elisno in #695, #706)
Documentation updates (by @jwmueller in #669, #710, #711, #716, #719, #720; by @huiwengoh in #714, #717; by @elisno in #678, #684)
Documentation: use default rules for shorter, more readable links by @DerWeh in #700
Added installation instructions for package extras by @sanjanag in #697
Pass confident joint computed in CleanLearning to filter.find_label_issues by @huiwengoh in #661
Add Example codeblock to the docstrings of important functions in the dataset module by @Steven-Yiran in #662, #663, #668
Remove batch size check in label_issues_batched by @huiwengoh in #665
adding multilabel dataset issue summaries by @aditya1503 in #657
move int2onehot, onehot2int to top of multilabel tutorial by @jwmueller in #666
Update softmax to more stable variant by @ulya-tkch in #667
Revamp text and tabular tutorial by @huiwengoh in #673, #693
allow for kwargs in token find_label_issues by @jwmueller in #686
Update numpy.typing import and annotations by @elisno in #688
Standardize documentation and simplify code for outliers by @DerWeh in #689
Extract function for computing OOD scores from distances by @elisno in #664
Introduce Datalab by @elisno in #614
Introduce NonIID issue type by @jecummin in #614
Further Datalab updates by @elisno in #680, #683, #687, #690, #691, #699, #705, #709, #712
Add descriptions of issues that Datalab can detect by @elisno in #682
Datalab IssueManager.get_summary() -> make_summary() in custom issue manager example by @jwmueller in #692
Improve NonIID issue checks by @elisno in #694, #707

New Contributors

@Steven-Yiran made their first contribution in #662
@DerWeh made their first contribution in #689
@jecummin made their first contribution in #614

Full Changelog: v2.3.1...v2.4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v2.4.0 -- One line of code to detect all sorts of dataset issues

Introducing Datalab

Revamped Tutorials

Improvements for Multi-label Classification

Backwards incompatible changes

Deprecated workflows

Change Log

New Contributors

Contributors

Uh oh!