v2.6.0 -- Elevating Data Insights: Comprehensive Issue Checks & Expanded ML Task Compatibility

@smttsp

This release is non-breaking when upgrading from v2.5.0, continuing our commitment to maintaining backward compatibility while introducing new features and improvements.
However, this release drops support for Python 3.7 while adding support for Python 3.11.

Enhancements to Datalab

In this update, Datalab, our dataset analysis platform, enhances its ability to identify various types of issues within your datasets. With this release, Datalab now detects additional types of issues by default, offering users a more comprehensive analysis. Specifically, it can now:

Identify null values in your dataset.
Detect class_imbalance.
Highlight an underperforming_group, which refers to a subset of data points where your model exhibits poorer performance compared to others.
See our FAQ
for more information on how to provide pre-defined groups for this issue type.

Additionally, Datalab can now optionally:

Assess the value of data points in your dataset using KNN-Shapley scores as a measure of data_valuation.

If you have ideas for new features or notice any bugs, we encourage you to open an Issue or Pull Request on our GitHub repository!

Expanded Datalab Support for New ML Tasks

With cleanlab v2.6.0, Datalab extends its support to new machine-learning tasks and introduces enhancements across the board.
This release introduces the task parameter in Datalab's API, enabling users to specify the type of machine learning task they are working on.

from cleanlab import Datalab
lab = Datalab(..., task="regression")

The tasks currently supported are:

classification (default): Includes all previously supported issue-checking capabilities based on pred_probs, features, or a knn_graph, and the new features introduced earlier.
regression (new):
- Run specialized label error detection algorithms on regression datasets. You can see this in action in our updated regression tutorial.
- Find other issues utilizing features or a knn_graph.
multilabel (new):
- Detect label errors in multilabel classification datasets using pred_probs exclusively. Explore the updated capabilities in our multilabel tutorial.
- Find various other types of issues based on features or a knn_graph.

Improved Object Detection Dataset Exploration

New functions have been introduced to enhance the exploration of object detection datasets, simplifying data comprehension and issue detection.
Learn how to leverage some of these functions in our object detection tutorial.

Other Major Improvements

Rescaled Near Duplicate and Outlier Scores:
- Note that what matters for all cleanlab issue scores is not their absolute magnitudes but rather how these scores rank the data points from most to least severe instances of the issue. But based on user feedback, we have updated the near duplicate and outlier scores to display a more human-interpretable range of values. How these scores rank data points within a dataset remains unchanged.
Consistency in counting label issues:
- cleanlab.dataset.health_summary() now returns the same number of issues as cleanlab.classification.find_label_issues() and cleanlab.count.num_label_issues().
Improved handling of non-iid issues:
- The non-iid issue check in Datalab now handles pred_probs as input.
Better reporting in Datalab:
- Simplified Datalab.report() now highlights only detected issue types. To view all checked issue types, use Datalab.report(show_all_issues=True).
Enhanced Handling of Binary Classification Tasks:
- Examples with predicted probabilities close to 0.5 for both classes are no longer flagged as label errors, improving the handling of binary classification tasks.
Experimental Functionality:
- cleanlab now offers experimental functionality for detecting label issues in span categorization tasks with a single class, enhancing its applicability in natural language processing projects.

New Contributors

We're thrilled to welcome new contributors to the cleanlab community! Your contributions help us improve and grow cleanlab:

@smttsp made their first contribution in #867
@abhijitpal1247 made their first contribution in #856
@01PrathamS made their first contribution in #893
@mglowacki100 made their first contribution in #796
@gibsonliketheguitar made their first contribution in #831
@kylegallatin made their first contribution in #885
@ryansingman made their first contribution in #919
@R-Peleg made their first contribution in #948

Thank you for your valuable contributions! If you're interested in contributing, check out our contributing guide for ways to get involved.

Change Log

Significant changes in this release include:

Update FAQ section in docs by @tataganesh in #869; @elisno in #913
Improve Object Detection module by @Steven-Yiran in #840, #877; @aditya1503 in #883, #969, #968
Clearer documentation/tutorials/readme by @jwmueller in #851, #931, #981, #983, #1001, #978, #994, #1010; @01PrathamS in #893; @elisno in #878, #1007, #992, #1015, #1016; @huiwengoh in #984; @sanjanag in #936; @tataganesh in #916; @ulya-tkch in #954;
CI updates by @aditya1503 in #864; @elisno in #879, #961, #963, #965, #1008, #975, #1011, #1012, #1013, #1014; @jwmueller in #852, #865; @tataganesh in #900; @anishathalye in #956; @sanjanag in #1009
Docs system updates by @elisno in #880, #881, #958, #959, #960, #964
Add Null Issue Manager by @abhijitpal1247 in #856; @tataganesh in #927, #917
Add Data Valuation Issue Manager by @coding-famer in #850, #925
Extend non-iid issue check to run if only pred_probs are provided by @abhijitpal1247 in #857; @tataganesh in #896, #897
Add Underperforming Group Issue Manager by @tataganesh in #838, #907; @elisno in #990
Add Class Imbalance issue type to Datalab defaults by @tataganesh in #912, #933; @jwmueller in #924, #934; @elisno in #940
Add regression task to Datalab by @mglowacki100 in #796; @elisno in #902
Add multilabel task to Datalab by @tataganesh in #929
702 - Shorten Refs of classes and functions in Docs by @gibsonliketheguitar in #831
Update near duplicate issues and sets by @ryansingman in #919; @elisno in #895
Rescale near duplicate scores by @elisno in #943
Rescale outlier scores by @elisno in #953
List comprehension to numpy ops for efficiency by @tataganesh in #844
Reduce memory usage of filter.find_label_issues() by @kylegallatin in #885
Updates to tests by @aditya1503 in #945; @elisno in #985, #998
Refactor Datalab functionality by @elisno in #971, #1006
Minor fixes for Datalab by @elisno in #997, #999, #1000, #1003, #1005, #979
Drop Python 3.7 support and add Python 3.11 support by @elisno in #980
Add a show_all_issues optional argument to Datalab.report() by @elisno in #970
Single Class Span Classification Support by @Steven-Yiran in #982
ensure near-predicted labels are not flagged as label issues by @aditya1503 in #950
PR template added and gitignore improved by @smttsp in #867
Update label issue count in dataset.health_summary() by @ulya-tkch in #875
Update segmentation.ipynb by @R-Peleg in #948
Refactor batching logic in cleanlab.segmentation.filter.find_label_issues by @elisno in #918

For a full list of changes, enhancements, and fixes, please refer to the Full Changelog.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!