CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 837
v2.6.0 -- Elevating Data Insights: Comprehensive Issue Checks & Expanded ML Task Compatibility
Compare
3f07a88
This release is non-breaking when upgrading from v2.5.0, continuing our commitment to maintaining backward compatibility while introducing new features and improvements.
However, this release drops support for Python 3.7 while adding support for Python 3.11.
Enhancements to Datalab
In this update, Datalab, our dataset analysis platform, enhances its ability to identify various types of issues within your datasets. With this release, Datalab now detects additional types of issues by default, offering users a more comprehensive analysis. Specifically, it can now:
- Identify
null
values in your dataset. - Detect
class_imbalance
. - Highlight an
underperforming_group
, which refers to a subset of data points where your model exhibits poorer performance compared to others.
See our FAQ
for more information on how to provide pre-defined groups for this issue type.
Additionally, Datalab can now optionally:
- Assess the value of data points in your dataset using KNN-Shapley scores as a measure of
data_valuation
.
If you have ideas for new features or notice any bugs, we encourage you to open an Issue or Pull Request on our GitHub repository!
Expanded Datalab Support for New ML Tasks
With cleanlab v2.6.0, Datalab extends its support to new machine-learning tasks and introduces enhancements across the board.
This release introduces the task
parameter in Datalab's API, enabling users to specify the type of machine learning task they are working on.
from cleanlab import Datalab
lab = Datalab(..., task="regression")
The task
s currently supported are:
- classification (default): Includes all previously supported issue-checking capabilities based on
pred_probs
,features
, or aknn_graph
, and the new features introduced earlier. - regression (new):
- Run specialized label error detection algorithms on regression datasets. You can see this in action in our updated regression tutorial.
- Find other issues utilizing
features
or aknn_graph
.
- multilabel (new):
- Detect label errors in multilabel classification datasets using
pred_probs
exclusively. Explore the updated capabilities in our multilabel tutorial. - Find various other types of issues based on
features
or aknn_graph
.
- Detect label errors in multilabel classification datasets using
Improved Object Detection Dataset Exploration
New functions have been introduced to enhance the exploration of object detection datasets, simplifying data comprehension and issue detection.
Learn how to leverage some of these functions in our object detection tutorial.
Other Major Improvements
- Rescaled Near Duplicate and Outlier Scores:
- Note that what matters for all cleanlab issue scores is not their absolute magnitudes but rather how these scores rank the data points from most to least severe instances of the issue. But based on user feedback, we have updated the near duplicate and outlier scores to display a more human-interpretable range of values. How these scores rank data points within a dataset remains unchanged.
- Consistency in counting label issues:
cleanlab.dataset.health_summary()
now returns the same number of issues ascleanlab.classification.find_label_issues()
andcleanlab.count.num_label_issues()
.
- Improved handling of non-iid issues:
- The non-iid issue check in Datalab now handles
pred_probs
as input.
- The non-iid issue check in Datalab now handles
- Better reporting in Datalab:
- Simplified
Datalab.report()
now highlights only detected issue types. To view all checked issue types, useDatalab.report(show_all_issues=True)
.
- Simplified
- Enhanced Handling of Binary Classification Tasks:
- Examples with predicted probabilities close to 0.5 for both classes are no longer flagged as label errors, improving the handling of binary classification tasks.
- Experimental Functionality:
- cleanlab now offers experimental functionality for detecting label issues in span categorization tasks with a single class, enhancing its applicability in natural language processing projects.
New Contributors
We're thrilled to welcome new contributors to the cleanlab community! Your contributions help us improve and grow cleanlab:
- @smttsp made their first contribution in #867
- @abhijitpal1247 made their first contribution in #856
- @01PrathamS made their first contribution in #893
- @mglowacki100 made their first contribution in #796
- @gibsonliketheguitar made their first contribution in #831
- @kylegallatin made their first contribution in #885
- @ryansingman made their first contribution in #919
- @R-Peleg made their first contribution in #948
Thank you for your valuable contributions! If you're interested in contributing, check out our contributing guide for ways to get involved.
Change Log
Significant changes in this release include:
- Update FAQ section in docs by @tataganesh in #869; @elisno in #913
- Improve Object Detection module by @Steven-Yiran in #840, #877; @aditya1503 in #883, #969, #968
- Clearer documentation/tutorials/readme by @jwmueller in #851, #931, #981, #983, #1001, #978, #994, #1010; @01PrathamS in #893; @elisno in #878, #1007, #992, #1015, #1016; @huiwengoh in #984; @sanjanag in #936; @tataganesh in #916; @ulya-tkch in #954;
- CI updates by @aditya1503 in #864; @elisno in #879, #961, #963, #965, #1008, #975, #1011, #1012, #1013, #1014; @jwmueller in #852, #865; @tataganesh in #900; @anishathalye in #956; @sanjanag in #1009
- Docs system updates by @elisno in #880, #881, #958, #959, #960, #964
- Add Null Issue Manager by @abhijitpal1247 in #856; @tataganesh in #927, #917
- Add Data Valuation Issue Manager by @coding-famer in #850, #925
- Extend non-iid issue check to run if only pred_probs are provided by @abhijitpal1247 in #857; @tataganesh in #896, #897
- Add Underperforming Group Issue Manager by @tataganesh in #838, #907; @elisno in #990
- Add Class Imbalance issue type to Datalab defaults by @tataganesh in #912, #933; @jwmueller in #924, #934; @elisno in #940
- Add regression task to Datalab by @mglowacki100 in #796; @elisno in #902
- Add multilabel task to Datalab by @tataganesh in #929
- 702 - Shorten Refs of classes and functions in Docs by @gibsonliketheguitar in #831
- Update near duplicate issues and sets by @ryansingman in #919; @elisno in #895
- Rescale near duplicate scores by @elisno in #943
- Rescale outlier scores by @elisno in #953
- List comprehension to numpy ops for efficiency by @tataganesh in #844
- Reduce memory usage of filter.find_label_issues() by @kylegallatin in #885
- Updates to tests by @aditya1503 in #945; @elisno in #985, #998
- Refactor Datalab functionality by @elisno in #971, #1006
- Minor fixes for Datalab by @elisno in #997, #999, #1000, #1003, #1005, #979
- Drop Python 3.7 support and add Python 3.11 support by @elisno in #980
- Add a
show_all_issues
optional argument to Datalab.report() by @elisno in #970 - Single Class Span Classification Support by @Steven-Yiran in #982
- ensure near-predicted labels are not flagged as label issues by @aditya1503 in #950
- PR template added and gitignore improved by @smttsp in #867
- Update label issue count in dataset.health_summary() by @ulya-tkch in #875
- Update segmentation.ipynb by @R-Peleg in #948
- Refactor batching logic in cleanlab.segmentation.filter.find_label_issues by @elisno in #918
For a full list of changes, enhancements, and fixes, please refer to the Full Changelog.