CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 838
v2.4.0 -- One line of code to detect all sorts of dataset issues
Compare
5830714
Cleanlab has grown into a popular package used by thousands of data scientists to diagnose issues in diverse datasets and improve the data itself in order to fit more robust models. Many new methods/algorithms were added in recent months to increase the capabilities of this data-centric AI library.
Introducing Datalab
Now we've added a unified platform called Datalab
for you to apply many of these capabilities in a single line of code!
To audit any classification dataset for issues, first use any trained ML model to produce pred_probs
(predicted class probabilities) and/or feature_embeddings
(numeric vector representations of each datapoint). Then, these few lines of code can detect many types of real-world issues in your dataset like label errors, outliers, near duplicates, etc:
from cleanlab import Datalab
lab = Datalab(data=dataset, label_name="column_name_for_labels")
lab.find_issues(features=feature_embeddings, pred_probs=pred_probs)
lab.report() # summarize the issues found, how severe they are, and other useful info about the dataset
Follow our blog to better understand how this works internally, many articles will be published there shortly!
A detailed description of each type of issue Datalab
can detect is provided in this guide, but we recommend first starting with the tutorials which show you how easy it is to run on your own dataset.
Datalab
can be used to do things like find label issues with string class labels (whereas the prior find_label_issues()
method required integer class indices). But you are still free to use all of the prior cleanlab methods you're used to! Datalab
is also using these internally to detect data issues.
Our goal is for Datalab
to be an easy way to run a comprehensive suite of cleanlab capabilities on any dataset. This is an evolving paradigm, so be aware some Datalab
APIs may change in subsequent package versions -- as noted in the documentation.
You can easily run the issue checks in Datalab
together with a custom issue type you define outside of cleanlab. This customizability also makes it easy to contribute new data quality algorithms into Datalab
. Help us build the best open-source platform for data-centric AI by adding your ideas or those from recent publications! Feel free to reach out via Slack.
Revamped Tutorials
We've updated some of our existing tutorials with more interesting datasets and ML models. Regarding the basic tutorials on identifying label issues in classification data from various modalities (image, text, audio, tables), we have also created an analogous versions to detect issues in these same datasets with Datalab
instead (see Datalab Tutorials
). This should help existing users quickly ramp up on using Datalab
to see how much more powerful this comprehensive data audit can be.
Improvements for Multi-label Classification
To provide a better experience for users with multi-label classification datasets, we have explicitly separated the functionality to work with these into the cleanlab.multilabel_classification
module. So please start there rather than specifying the multi_label=True
flag in certain methods outside of this module, as that option will be deprecated in the future.
Particularly noteworthy are the new dataset-level issue summaries for multi-label classification datasets, available in the cleanlab.multilabel_classification.dataset
module.
While moving methods to the cleanlab.multilabel_classification
module, we noticed some bugs in existing methods. We got rid of these methods entirely (replacing them with new ones in the cleanlab.multilabel_classification
module), so some changes may appear to be backwards incompatible, even though the original code didn't function as intended in the first place.
Backwards incompatible changes
Your existing code will break if you do not upgrade to the new versions of these methods (the existing cleanlab v.2.3.1 code was probably producing bad results anyway based on some bugs that have been fixed). Here are changes you must make in your code for it to work with newer cleanlab versions:
cleanlab.dataset.rank_classes_by_label_quality(..., multi_label=True)
→
cleanlab.multilabel_classification.dataset.rank_classes_by_label_quality(...)
The multi_label=False/True
argument will be removed in the future from the former method.
cleanlab.dataset.find_overlapping_classes(..., multi_label=True)
→
cleanlab.multilabel_classification.dataset.common_multilabel_issues(...)
The multi_label=False/True
argument will be removed in the future from the former method. The returned DataFrame is slightly different, please refer to the new method's documentation.
cleanlab.dataset.overall_label_health_score(...multi_label=True)
→
cleanlab.multilabel_classification.dataset.overall_label_health_score(...)
The multi_label=False/True
argument will be removed in the future from the former method.
cleanlab.dataset.health_summary(...multi_label=True)
→
cleanlab.multilabel_classification.dataset.multilabel_health_summary(...)
The multi_label=False/True
argument will be removed in the future from the former method.
There are no other backwards incompatible changes in the package with this release.
Deprecated workflows
We recommend updating your existing code to the new versions of these methods (existing cleanlab v2.3.1 code will still work though, for now). Here are changes we recommend:
cleanlab.filter.find_label_issues(..., multi_label=True)
→
cleanlab.multilabel_classification.filter.find_label_issues(...)
The multi_label=False/True
argument will be removed in the future from the former method.
from cleanlab.multilabel_classification import get_label_quality_scores
→
from cleanlab.multilabel_classification.rank import get_label_quality_scores
Remember: All of the code to work with multi-label data now lives in the cleanlab.multilabel_classification
module.
Change Log
- readme updates by @jwmueller in #659, #660, #713
- CI updates (by @sanjanag in #701; by @huiwengoh in #671; by @elisno in #695, #706)
- Documentation updates (by @jwmueller in #669, #710, #711, #716, #719, #720; by @huiwengoh in #714, #717; by @elisno in #678, #684)
- Documentation: use default rules for shorter, more readable links by @DerWeh in #700
- Added installation instructions for package extras by @sanjanag in #697
- Pass confident joint computed in CleanLearning to filter.find_label_issues by @huiwengoh in #661
- Add Example codeblock to the docstrings of important functions in the dataset module by @Steven-Yiran in #662, #663, #668
- Remove batch size check in label_issues_batched by @huiwengoh in #665
- adding multilabel dataset issue summaries by @aditya1503 in #657
- move int2onehot, onehot2int to top of multilabel tutorial by @jwmueller in #666
- Update softmax to more stable variant by @ulya-tkch in #667
- Revamp text and tabular tutorial by @huiwengoh in #673, #693
- allow for kwargs in token find_label_issues by @jwmueller in #686
- Update numpy.typing import and annotations by @elisno in #688
- Standardize documentation and simplify code for outliers by @DerWeh in #689
- Extract function for computing OOD scores from distances by @elisno in #664
- Introduce Datalab by @elisno in #614
- Introduce NonIID issue type by @jecummin in #614
- Further Datalab updates by @elisno in #680, #683, #687, #690, #691, #699, #705, #709, #712
- Add descriptions of issues that Datalab can detect by @elisno in #682
- Datalab IssueManager.get_summary() -> make_summary() in custom issue manager example by @jwmueller in #692
- Improve NonIID issue checks by @elisno in #694, #707
New Contributors
- @Steven-Yiran made their first contribution in #662
- @DerWeh made their first contribution in #689
- @jecummin made their first contribution in #614
Full Changelog: v2.3.1...v2.4.0