CARVIEW |
Navigation Menu
-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
[MRG+2] Add float32 support for Linear Discriminant Analysis #13273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] Add float32 support for Linear Discriminant Analysis #13273
Conversation
sklearn/discriminant_analysis.py
Outdated
@@ -485,9 +485,10 @@ def fit(self, X, y): | |||
raise ValueError("unknown solver {} (valid solvers are 'svd', " | |||
"'lsqr', and 'eigen').".format(self.solver)) | |||
if self.classes_.size == 2: # treat binary case as a special case | |||
self.coef_ = np.array(self.coef_[1, :] - self.coef_[0, :], ndmin=2) | |||
my_type = np.float32 if (X.dtype == np.float32) else np.float64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is necessary when X.dtype is int32 or int64.
Should we keep the number of bits short, i.e. perfom casting from int32 to float32, or int32 to float 64 ?
ping @glemaitre
sklearn/discriminant_analysis.py
Outdated
@@ -485,9 +485,10 @@ def fit(self, X, y): | |||
raise ValueError("unknown solver {} (valid solvers are 'svd', " | |||
"'lsqr', and 'eigen').".format(self.solver)) | |||
if self.classes_.size == 2: # treat binary case as a special case | |||
self.coef_ = np.array(self.coef_[1, :] - self.coef_[0, :], ndmin=2) | |||
my_type = np.float32 if (X.dtype in [np.float32, np.int32]) else np.float64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should already have called check_X_y and/or as_float_array above. Then you can just use X.dtype to specify the dtype of arrays you allocate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, and the check_X_y of line 430 probably needs to be changed to "dtype=[np.float32, np.float64]"
@@ -295,6 +296,27 @@ def test_lda_dimension_warning(n_classes, n_features): | |||
"min(n_features, n_classes - 1).") | |||
assert_warns_message(FutureWarning, future_msg, lda.fit, X, y) | |||
|
|||
@pytest.mark.parametrize("data_type, expected_type",[ | |||
(np.float32, np.float32), (np.float64, np.float64), (np.int32, np.float32), (np.int64, np.float64)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pep8 violations. please check
No additional on top of those @agramfort and myself made above. |
sklearn/discriminant_analysis.py
Outdated
@@ -427,7 +427,8 @@ def fit(self, X, y): | |||
Target values. | |||
""" | |||
# FIXME: Future warning to be removed in 0.23 | |||
X, y = check_X_y(X, y, ensure_min_samples=2, estimator=self) | |||
X = as_float_array(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is necessary to get coefficients of type float32 output when X is of type int32. Removing this line gives coefficients of type float64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is strange becasue the next check_X_y
will convert to float anyway. It seems unecessary, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see why. So X will be converted to float64 and you expect it to be converted to float 32 because X is of type int32. I would say that there is no real reason for that. int32
can default to float64
. I am fine with that.
@massich what is the mechanism in the other estimators that you modified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was no mechanism set. I've the vague idea that we talked about it long ago, but I can't recall that we agreed on anything. To me it makes sense that if someone willing creates the data in 32, wants to keep it in 32 bits. But if such individual is there she/he will scream at us, or would had done it already. So I've no objection on defaulting to float64
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An int32 is a "very big" int. Hence, it make sense to convert it to a float64. People who want to save memory with ints use int16. Indeed, an int has a larger span than a float for the same memory cost.
CI are failing |
It looks good. Could you add an entry in the what's new because we change the behavior of LDA. |
sklearn/discriminant_analysis.py
Outdated
@@ -19,7 +19,7 @@ | |||
from .linear_model.base import LinearClassifierMixin | |||
from .covariance import ledoit_wolf, empirical_covariance, shrunk_covariance | |||
from .utils.multiclass import unique_labels | |||
from .utils import check_array, check_X_y | |||
from .utils import check_array, check_X_y, as_float_array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from .utils import check_array, check_X_y, as_float_array | |
from .utils import check_array, check_X_y | |
from .utils import as_float_array |
sklearn/discriminant_analysis.py
Outdated
@@ -427,7 +427,8 @@ def fit(self, X, y): | |||
Target values. | |||
""" | |||
# FIXME: Future warning to be removed in 0.23 | |||
X, y = check_X_y(X, y, ensure_min_samples=2, estimator=self) | |||
X = as_float_array(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is strange becasue the next check_X_y
will convert to float anyway. It seems unecessary, isn't it?
@@ -296,6 +297,29 @@ def test_lda_dimension_warning(n_classes, n_features): | |||
assert_warns_message(FutureWarning, future_msg, lda.fit, X, y) | |||
|
|||
|
|||
@pytest.mark.parametrize("data_type, expected_type", [ | |||
(np.float32, np.float32), (np.float64, np.float64), (np.int32, np.float32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convention for int32 is casting to float64.
@thibsej Could you add an entry to what's new |
@@ -296,6 +297,29 @@ def test_lda_dimension_warning(n_classes, n_features): | |||
assert_warns_message(FutureWarning, future_msg, lda.fit, X, y) | |||
|
|||
|
|||
@pytest.mark.parametrize("data_type, expected_type", [ | |||
(np.float32, np.float32), (np.float64, np.float64), (np.int32, np.float64), | |||
(np.int64, np.float64)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this one more readable:
@pytest.mark.parametrize("data_type, expected_type", [
(np.float32, np.float32),
(np.float64, np.float64),
(np.int32, np.float64),
(np.int64, np.float64)
])
# Check value consistency between types | ||
rtol = 1e-6 | ||
assert_allclose(clf_32.coef_, clf_64.coef_.astype(np.float32), | ||
rtol=rtol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not use astype
assert_allclose
should be able to handle it.
assert_allclose(clf_32.coef_, clf_64.coef_, rtol=rtol)
You should search for this section in :mod:`sklearn.discriminant_analysis`
....................................
- |Enhancement| :class:`discriminant_analysis.LinearDiscriminantAnalysis` now
preserves ``float32`` and ``float64`` dtypes. :issues:`8769` and
:issues:`11000` by :user:`Thibault Sejourne <thibsej>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, once addressed the nitpicks
LGTM once the tests pass. |
@pytest.mark.parametrize("data_type, expected_type", [ | ||
(np.float32, np.float32), | ||
(np.float64, np.float64), | ||
(np.int32, np.float64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going from int32 to float64 is not what was done by isotonic regression isofused branch this morning. We should be consistent here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am +1 for int32 -> float64. It was the semantic for check_array
.
@ogrisel was also for this conversion. I would think that we should correct the isotonic regression then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Merging. This is a net improvement. We'll nitpick later :D
We should almost have a common test. |
Thx a lot. @thibsej wellcome to the family of sklearn contributers
…On Wed, Feb 27, 2019, 10:11 Gael Varoquaux ***@***.***> wrote:
Merged #13273 <#13273>
into master.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13273 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGt-44_gpAl_XBbRrCq2vg3WTUvSMiMtks5vRkupgaJpZM4bR0bh>
.
|
…learn#13273) * [skip ci] Empty commit to trigger PR * Add dtype testing * Fix: dtype testing * Fix test_estimators[OneVsRestClassifier-check_estimators_dtypes] * TST refactor using parametrize + Add failing test for int32 * Fix for int32 * Fix code according to review + Fix PEP8 violation * Fix dtype for int32 and complex * Fix pep8 violation * Update whatsnew + test COSMIT
…cikit-learn#13273)" This reverts commit 2c98ed6.
…cikit-learn#13273)" This reverts commit 2c98ed6.
…learn#13273) * [skip ci] Empty commit to trigger PR * Add dtype testing * Fix: dtype testing * Fix test_estimators[OneVsRestClassifier-check_estimators_dtypes] * TST refactor using parametrize + Add failing test for int32 * Fix for int32 * Fix code according to review + Fix PEP8 violation * Fix dtype for int32 and complex * Fix pep8 violation * Update whatsnew + test COSMIT
Reference Issues/PRs
this PR works on #11000 by preserving the dtype in LDA.
cross reference: #8769 (comment)
What does this implement/fix? Explain your changes.
Any other comments?