You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is not final optimization. At least, we should use AVX masks here too.
But this one is complex enough already, so the rest would be follow-up PR(s).
I also notice that Both_val 8-bit and 16-bit cases are slow.
The vectorization for them is not engaged, it is a separate issue from the AVX.
With extra dispatcher, the inlining decisions are different. Now, the dispatcher is inlined into the exported functions, along with the scalar implementation. The vector implementations are tail called, and signature variations are not likely to prevent that.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves #2803
This is not final optimization. At least, we should use AVX masks here too.
But this one is complex enough already, so the rest would be follow-up PR(s).
I also notice that
Both_val
8-bit and 16-bit cases are slow.The vectorization for them is not engaged, it is a separate issue from the AVX.
Benchmark results
Before:After: