CARVIEW |
Select Language
HTTP/2 200
date: Mon, 28 Jul 2025 03:31:25 GMT
content-type: text/html; charset=utf-8
vary: X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, X-Requested-With,Accept-Encoding, Accept, X-Requested-With
etag: W/"4b9c9b792b39b8e2c10e4f16be1830ba"
cache-control: max-age=0, private, must-revalidate
strict-transport-security: max-age=31536000; includeSubdomains; preload
x-frame-options: deny
x-content-type-options: nosniff
x-xss-protection: 0
referrer-policy: no-referrer-when-downgrade
content-security-policy: default-src 'none'; base-uri 'self'; child-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/; connect-src 'self' uploads.github.com www.githubstatus.com collector.github.com raw.githubusercontent.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com *.rel.tunnels.api.visualstudio.com wss://*.rel.tunnels.api.visualstudio.com objects-origin.githubusercontent.com copilot-proxy.githubusercontent.com proxy.individual.githubcopilot.com proxy.business.githubcopilot.com proxy.enterprise.githubcopilot.com *.actions.githubusercontent.com wss://*.actions.githubusercontent.com productionresultssa0.blob.core.windows.net/ productionresultssa1.blob.core.windows.net/ productionresultssa2.blob.core.windows.net/ productionresultssa3.blob.core.windows.net/ productionresultssa4.blob.core.windows.net/ productionresultssa5.blob.core.windows.net/ productionresultssa6.blob.core.windows.net/ productionresultssa7.blob.core.windows.net/ productionresultssa8.blob.core.windows.net/ productionresultssa9.blob.core.windows.net/ productionresultssa10.blob.core.windows.net/ productionresultssa11.blob.core.windows.net/ productionresultssa12.blob.core.windows.net/ productionresultssa13.blob.core.windows.net/ productionresultssa14.blob.core.windows.net/ productionresultssa15.blob.core.windows.net/ productionresultssa16.blob.core.windows.net/ productionresultssa17.blob.core.windows.net/ productionresultssa18.blob.core.windows.net/ productionresultssa19.blob.core.windows.net/ github-production-repository-image-32fea6.s3.amazonaws.com github-production-release-asset-2e65be.s3.amazonaws.com insights.github.com wss://alive.github.com api.githubcopilot.com api.individual.githubcopilot.com api.business.githubcopilot.com api.enterprise.githubcopilot.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com copilot-workspace.githubnext.com objects-origin.githubusercontent.com; frame-ancestors 'none'; frame-src viewscreen.githubusercontent.com notebooks.githubusercontent.com; img-src 'self' data: blob: github.githubassets.com media.githubusercontent.com camo.githubusercontent.com identicons.github.com avatars.githubusercontent.com private-avatars.githubusercontent.com github-cloud.s3.amazonaws.com objects.githubusercontent.com release-assets.githubusercontent.com secured-user-images.githubusercontent.com/ user-images.githubusercontent.com/ private-user-images.githubusercontent.com opengraph.githubassets.com copilotprodattachments.blob.core.windows.net/github-production-copilot-attachments/ github-production-user-asset-6210df.s3.amazonaws.com customer-stories-feed.github.com spotlights-feed.github.com objects-origin.githubusercontent.com *.githubusercontent.com; manifest-src 'self'; media-src github.com user-images.githubusercontent.com/ secured-user-images.githubusercontent.com/ private-user-images.githubusercontent.com github-production-user-asset-6210df.s3.amazonaws.com gist.github.com; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; upgrade-insecure-requests; worker-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/
server: github.com
content-encoding: gzip
accept-ranges: bytes
set-cookie: _gh_sess=K3gTkriGu%2B5X2gfkUg%2BJ07pWiIjo6pNLwz5wj6Lw1ETMZHgOl22mncFCOMqw0f3mwDw%2F8vRBlvGklRSjihQH6ZTnzH1YoG5rK3ZMzPSw%2BnaHOpKtSiXOBpkwj96T0MoJ5amr2shNNJRbWu7720F2lp%2FYT7HuF48dYHN06PaszkmghUCKv98zWcpQ5K0ptGAyOPMuxH8Y9bciwqo8hlcmnxJ%2BXHBngA2s%2BjvRBpuO3xSLKbBDf2N9JzZNsotaYXXsx2a2n57bL5lY0REU0na%2B7w%3D%3D--voPyI7%2Fn2qUTIaM8--9j8B%2Fr3TtPlnC%2BOYzWCfOw%3D%3D; Path=/; HttpOnly; Secure; SameSite=Lax
set-cookie: _octo=GH1.1.1357265279.1753673484; Path=/; Domain=github.com; Expires=Tue, 28 Jul 2026 03:31:24 GMT; Secure; SameSite=Lax
set-cookie: logged_in=no; Path=/; Domain=github.com; Expires=Tue, 28 Jul 2026 03:31:24 GMT; HttpOnly; Secure; SameSite=Lax
x-github-request-id: 8BB6:3C6DB4:9B2C25:D11F38:6886EF0C
Release v2.0.0 - Data-centric AI Ready · cleanlab/cleanlab · GitHub
Loading
Skip to content
Navigation Menu
{{ message }}
-
Notifications
You must be signed in to change notification settings - Fork 837
v2.0.0 - Data-centric AI Ready
Compare
·
882 commits
to master
since this release
73f947e
This commit was created on GitHub.com and signed with GitHub’s verified signature.
The key has expired.
If you liked cleanlab v1.0.1, v2.0.0 will blow your mind! 💥🧠
cleanlab 2.0 adds powerful new workflows and algorithms for data-centric AI, dataset curation, auto-fixing label issues in data, learning with noisy labels, and more. Nearly every module, method, parameter, and docstring has been touched by this release.
If you're coming from 1.0, here's a migration guide.
A few highlights of new functionalities in cleanlab 2.0:
- rank every data point by label quality
- find label issues in any dataset.
- train any classifier on any dataset with label issues.
- find overlapping classes to merge and/or delete at the dataset-level
- yield an overall dataset health
For an in-depth overview of what cleanlab 2.0 can do, check out this tutorial.
To help you get started with 2.0, we've added:
Change Log
This list is non-exhaustive! Assume every aspect of API has changed.
Module name changes or moves:
classification.LearningWithNoisyLabels
class -->classification.CleanLearning
classpruning.py
-->filter.py
latent_estimation.py
-->count.py
cifar_cnn.py
-->experimental/cifar_cnn.py
coteaching.py
-->experimental/coteaching.py
fasttext.py
-->experimental/fasttext.py
mnist_pytorch.py
-->experimental/fmnist_pytorch.py
noise_generation.py
-->benchmarking/noise_generation.py
util.py
-->internal/util.py
latent_algebra.py
-->internal/latent_algebra.py
Module Deletions:
- removed
polyplex.py
- removed models/` --> (moved content to experimental/)
New module created:
rank.py
- moved all ranking and ordering functions from
pruning.py
/filter.py
to here
- moved all ranking and ordering functions from
dataset.py
- brand new module supporting methods for dealing with data-level issues
benchmarking.py
- Future benchmarking modules go here. Moved
noise_generation.py
here.
- Future benchmarking modules go here. Moved
Method name changes:
pruning.get_noise_indices()
-->filter.find_label_issues()
count.num_label_errors()
-->count.num_label_issues()
Methods added:
rank.py
adds- two ranking functions to rank data based on label quality for entire dataset (not just examples with label issues)
get_self_confidence_for_each_label()
get_normalized_margin_for_each_label()
filter.py
adds- two more methods added to
filter.find_label_issues()
(select method using thefilter_by
parameter)confident_learning
, which has been shown to work very well and may become the default in the future, andpredicted_neq_given
, which is useful for benchmarking a simple baseline approach, but underperformant relative to the other filter_by methods)
- two more methods added to
classification.py
addsClearnLearning.get_label_issues()
- for a canonical one-line of code use:
CleanLearning().fit(X, y).get_label_issues()
- no need to compute predicted probabilities in advance
- for a canonical one-line of code use:
CleanLearning.find_label_issues()
- returns a dataframe with label issues (instead of just a mask)
Naming conventions changed in method names, comments, parameters, etc.
s
->labels
psx
->pred_probs
label_errors
-->label_issues
noise_mask
-->label_issues_mask
label_errors_bool
-->label_issues_mask
prune_method
-->filter_by
prob_given_label
-->self_confidence
pruning
-->filtering
Parameter re-ordering:
- re-ordered (
labels
,pred_probs
) parameters to be consistent (in that order) in all methods. - re-ordered parameters (e.g.
frac_noise
) in filter.find_label_issues()
Parameter changes:
- in
order_label_issues()
- param:
sorted_index_method
-->rank_by
- param:
- in
find_label_issues()
- param:
sorted_index_method
-->return_indices_ranked_by
- param:
prune_method
-->filter_by
- param:
Global variables changed:
filter.py
- Only require 1 example to be left in each class
MIN_NUM_PER_CLASS = 5
-->MIN_NUM_PER_CLASS = 1
- enables cleanlab to work for toy-sized datasets
Dependencies added
- pandas=1.0.0
Way-too-detailed Change Log
- convert readme to markdown for pypi release. by @cgnorthcutt in #126
- Add EditorConfig by @anishathalye in #129
- Major API change. Introducing Cleanlab 2.0 by @cgnorthcutt in #128
- Standardize code style to Black by @anishathalye in #107
- Redirect RTD site to docs.cleanlab.ai by @weijinglok in #130
- Redirect RTD site to docs.cleanlab.ai: Part 2 by @weijinglok in #132
- Add image classification tutorial and streamline docs CI/CD by @weijinglok in #127
- remove redundant text by @jwmueller in #134
- Remove extra slashes in docs relative path by @weijinglok in #135
- Fix docs TOC for v2.0 by @weijinglok in #136
- Add label quality scoring functions and user API to choose the method by @JohnsonKuan in #131
- Change cleanlab version in Image Tutorial by @weijinglok in #138
- Utilites -> internal submodule refactor by @jwmueller in #141
- Fix NumPy deprecation warning by @anishathalye in #142
- Remove unnecessary print statement by @anishathalye in #145
- Add explanation that estimators must be clonable by @anishathalye in #146
- Update doc site quickstart page to reflect v2.0 API by @weijinglok in #143
- Fix sklearn estimator cloning by @anishathalye in #144
- Update default label quality scoring method to self_confidence by @JohnsonKuan in #147
- Allow n-dim data in LearningWithNoisyLabels by @anishathalye in #148
- Improve user-control by @jwmueller in #149
- Enable use of find_label_issues_kwargs for hyper-parameter search by @JohnsonKuan in #152
- Add fix and test for sklearn GridSearchCV with LearningWithNoisyLabels by @JohnsonKuan in #153
- Add tutorial for tabular data classification by @weijinglok in #151
- Minor tutorial edits by @jwmueller in #155
- Add Python 3.10 to CI by @anishathalye in #160
- Add development guide by @anishathalye in #164
- Add text classification tutorial by @weijinglok in #154
- Add audio tutorial to doc site by @weijinglok in #165
- Add overview for computing out-of-sample predicted probabilities with cross-validation to doc site by @weijinglok in #166
- Add CI check that .ipynb outputs are empty by @anishathalye in #169
- Add CI check for trailing newlines in notebooks by @anishathalye in #170
- Improve image tutorial accuracy and finding better label errors by @weijinglok in #167
- Remove unnecessary version warning by @anishathalye in #162
- Add test to check examples are found by cleanlab by @weijinglok in #172
- Various tutorial improvements by @jwmueller in #173
- Deploys docs only if triggered by master branch by @weijinglok in #175
- added LearningWithNoisyLabels.find_label_issues instance method by @jwmueller in #157
- Add note on EditorConfig to development guide by @anishathalye in #176
- CleanLearning = Machine Learning with cleaned data by @cgnorthcutt in #177
- Simple fix to Issue 158 (and potentially other issues) by @cgnorthcutt in #178
- Update docs README by @weijinglok in #180
- Polish the APIs and file-structure to prepare for 2.0 release by @jwmueller in #181
- Make EditorConfig match Jupyter for notebooks by @anishathalye in #179
- Add figure to out-of-sample pred proba via cv tutorial by @weijinglok in #183
- Move rest of example_models/ -> experimental/ by @jwmueller in #184
- Fix typo in test by @anishathalye in #186
- Add more ergonomic method to skip notebooks by @anishathalye in #185
- Add link checking for all Markdown files by @anishathalye in #187
- Add link checking for compiled docs by @anishathalye in #188
- Minor improvements to count docstring by @cgnorthcutt in #190
- Add a GitHub icon and link below the docs' project title by @weijinglok in #192
- Add self.labels as attribute of FastTextClassifier by @JohnsonKuan in #194
- Move noise_generation into benchmarking module by @anishathalye in #196
- Remove import of internal package by @anishathalye in #195
- more specific filter-warning by @jwmueller in #193
- Remove deprecated functions by @anishathalye in #197
- Docs cleanup by @anishathalye in #189
- Introducing the new Dataset Module for cleanlab 2.0 by @cgnorthcutt in #182
- Readme reformat by @jwmueller in #198
- Returns DataFrame type from CleanLearning functions by @jwmueller in #199
- Migration guide for v2 by @jwmueller in #200
- set verbose default false. fix order of printing by @cgnorthcutt in #204
- Add from . import dataset to init by @cgnorthcutt in #205
- Make minor doc tweaks by @anishathalye in #203
- Fix error in docstring. missing item in tuple by @cgnorthcutt in #206
- fix broken printing of matrices by @cgnorthcutt in #207
- Add in-depth tutorial [WIP] by @weijinglok in #208
- Revise migration guide by @anishathalye in #209
- Tutorial header-levels decreased by @jwmueller in #210
- Dark plots by @jwmueller in #211
- bug fixes and jupyter notebook support added by @cgnorthcutt in #212
- Add spaces at the end of doc side nav toc by @weijinglok in #213
- Clarify and fix several docstrings. by @cgnorthcutt in #214
- Configure display setting of Dataframe in 2.0 tutorial by @JohnsonKuan in #215
- Formats dataset health tutorial by @jwmueller in #216
- final tutorial edits. dataset docstrings imp by @cgnorthcutt in #217
- Deploy docs when new release is tagged by @weijinglok in #219
- Hardcode v1.0.1 hyperlink in in doc site by @weijinglok in #221
- Make dataset tutorial runnable on website docs, improve pulldown formatting by @jwmueller in #220
- Final docs polishing patches by @jwmueller in #223
- Bump version to 2.0.0 by @jwmueller in #222
- fix black formatting compliance by @cgnorthcutt in #224
- Update readme links for v2.0 by @jwmueller in #225
- Remove unneeded alt text by @anishathalye in #228
- Add tutorial toc page by @weijinglok in #230
- Proofreading README.md by @calebchiam in #226
- Generalize text tutorial to multiclass datasets by @calebchiam in #229
- Make small fixes by @anishathalye in #235
- Address tutorial points of confusion by @jwmueller in #233
New Contributors
- @JohnsonKuan made their first contribution in #131
- @calebchiam made their first contribution in #226
Full Changelog: v1.0.1...v2.0.0
Assets 2
5 people reacted
0
Join discussion
You can’t perform that action at this time.