CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 24.7k
[inductor] refine loop split logic #128812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
đź”— Helpful Linksđź§Ş See artifacts and rendered test results at hud.pytorch.org/pr/128812
Note: Links to docs will display an error until the docs builds have been completed. âś… No FailuresAs of commit 2ad22b1 with merge base 740d1eb ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
ghstack-source-id: a0ffb42 Pull Request resolved: pytorch#128812
ghstack-source-id: ae8e67d Pull Request resolved: pytorch#128812
ghstack-source-id: ae8e67d Pull Request resolved: pytorch#128812
This PR aims to improves parallelization by collapsing vectorized loop. #122281 For such case, the parallel level is only `2`. And the vectorized loop cannot be collapsed. ``` #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(199984L); x1+=static_cast<long>(16L)) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985L*x0)), 16); tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985L*x0)), 16); } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(199984L); x1<static_cast<long>(199985L); x1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985L*x0))]; out_ptr0[static_cast<long>(x1 + (209985L*x0))] = tmp0; } } ``` After this PR, we will gen code ``` #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(199985L); x1+=static_cast<long>(16L)) { if (x1 >= 0 && x1 <199984) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985L*x0)), 16); tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985L*x0)), 16); } if (x1 >= 199984 && x1 <199985) { auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985L*x0))]; out_ptr0[static_cast<long>(x1 + (209985L*x0))] = tmp0; } } } ``` ### Highlight For reduction case, we have some side-effect here. For below case, we vectorized `x1` dim and reduction at `x2` dim. ``` #pragma omp for for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L); x1+=static_cast<int64_t>(8L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17L*x2) + (306L*x0)), 8); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } [&] { __at_align__ std::array<float, 8> tmpbuf; tmp_acc0_vec.store(tmpbuf.data(), 8); #pragma GCC unroll 8 for (long x1_inner = 0; x1_inner < 8; x1_inner++) { out_ptr1[static_cast<int64_t>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner]; } } () ; } } #pragma omp simd simdlen(4) for(int64_t x1=static_cast<int64_t>(16L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(1L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { auto tmp0 = in_ptr1[static_cast<int64_t>(x1 + (17L*x2) + (306L*x0))]; tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0); } out_ptr1[static_cast<int64_t>(x0 + (39L*x1))] = tmp_acc0; } } } ``` After collapse, the loop order will be `x1 -> x2 -> x1_tail_part`, thus we will need a `tmp_acc_arr` to store the reduction result for `x1_tail_part`. And for `reduction_stores`, we also need to check `x1`'s value like what we do in the `loopbody` since the `reduction_stores` happened between `x1` and `x2` loops. ``` #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(8L)) { { float tmp_acc0_arr[8]; ######### need an array to hold acc result for tail part for (int i = 0; i < 8; i++) { tmp_acc0_arr[i] = -std::numeric_limits<float>::infinity(); } float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17L*x2) + (306L*x0)), 8); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L))) { for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++) { auto tmp0 = in_ptr1[static_cast<int64_t>(x1_tail + (17L*x2) + (306L*x0))]; tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)] = max_propagate_nan(tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)], tmp0); } } } } ############### reduction stores if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L))) { [&] { __at_align__ std::array<float, 8> tmpbuf; tmp_acc0_vec.store(tmpbuf.data(), 8); #pragma GCC unroll 8 for (long x1_inner = 0; x1_inner < 8; x1_inner++) { out_ptr1[static_cast<int64_t>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner]; } } () ; } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L))) { for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++) { out_ptr1[static_cast<int64_t>(x0 + (39L*x1_tail))] = tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)]; } } } } } ``` cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 gujinghui PenghuiCheng jianyuh min-jean-cho yanbing-j Guobing-Chen Xia-Weiwen snadampal voznesenskym penguinwu EikanWang blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov peterbell10 [ghstack-poisoned]
This PR aims to improves parallelization by collapsing vectorized loop. #122281 For such case, the parallel level is only `2`. And the vectorized loop cannot be collapsed. ``` #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(199984L); x1+=static_cast<long>(16L)) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985L*x0)), 16); tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985L*x0)), 16); } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(199984L); x1<static_cast<long>(199985L); x1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985L*x0))]; out_ptr0[static_cast<long>(x1 + (209985L*x0))] = tmp0; } } ``` After this PR, we will gen code ``` #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(199985L); x1+=static_cast<long>(16L)) { if (x1 >= 0 && x1 <199984) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985L*x0)), 16); tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985L*x0)), 16); } if (x1 >= 199984 && x1 <199985) { auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985L*x0))]; out_ptr0[static_cast<long>(x1 + (209985L*x0))] = tmp0; } } } ``` ### Highlight For reduction case, we have some side-effect here. For below case, we vectorized `x1` dim and reduction at `x2` dim. ``` #pragma omp for for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L); x1+=static_cast<int64_t>(8L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17L*x2) + (306L*x0)), 8); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } [&] { __at_align__ std::array<float, 8> tmpbuf; tmp_acc0_vec.store(tmpbuf.data(), 8); #pragma GCC unroll 8 for (long x1_inner = 0; x1_inner < 8; x1_inner++) { out_ptr1[static_cast<int64_t>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner]; } } () ; } } #pragma omp simd simdlen(4) for(int64_t x1=static_cast<int64_t>(16L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(1L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { auto tmp0 = in_ptr1[static_cast<int64_t>(x1 + (17L*x2) + (306L*x0))]; tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0); } out_ptr1[static_cast<int64_t>(x0 + (39L*x1))] = tmp_acc0; } } } ``` After collapse, the loop order will be `x1 -> x2 -> x1_tail_part`, thus we will need a `tmp_acc_arr` to store the reduction result for `x1_tail_part`. And for `reduction_stores`, we also need to check `x1`'s value like what we do in the `loopbody` since the `reduction_stores` happened between `x1` and `x2` loops. ``` #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(8L)) { { float tmp_acc0_arr[8]; ######### need an array to hold acc result for tail part for (int i = 0; i < 8; i++) { tmp_acc0_arr[i] = -std::numeric_limits<float>::infinity(); } float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17L*x2) + (306L*x0)), 8); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L))) { for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++) { auto tmp0 = in_ptr1[static_cast<int64_t>(x1_tail + (17L*x2) + (306L*x0))]; tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)] = max_propagate_nan(tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)], tmp0); } } } } ############### reduction stores if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L))) { [&] { __at_align__ std::array<float, 8> tmpbuf; tmp_acc0_vec.store(tmpbuf.data(), 8); #pragma GCC unroll 8 for (long x1_inner = 0; x1_inner < 8; x1_inner++) { out_ptr1[static_cast<int64_t>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner]; } } () ; } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L))) { for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++) { out_ptr1[static_cast<int64_t>(x0 + (39L*x1_tail))] = tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)]; } } } } } ``` cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 gujinghui PenghuiCheng jianyuh min-jean-cho yanbing-j Guobing-Chen Xia-Weiwen snadampal voznesenskym penguinwu EikanWang blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov peterbell10 [ghstack-poisoned]
This PR aims to improves parallelization by collapsing vectorized loop. #122281 For such case, the parallel level is only `2`. And the vectorized loop cannot be collapsed. ``` #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(199984L); x1+=static_cast<long>(16L)) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985L*x0)), 16); tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985L*x0)), 16); } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(199984L); x1<static_cast<long>(199985L); x1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985L*x0))]; out_ptr0[static_cast<long>(x1 + (209985L*x0))] = tmp0; } } ``` After this PR, we will gen code ``` #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(199985L); x1+=static_cast<long>(16L)) { if (x1 >= 0 && x1 <199984) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985L*x0)), 16); tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985L*x0)), 16); } if (x1 >= 199984 && x1 <199985) { auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985L*x0))]; out_ptr0[static_cast<long>(x1 + (209985L*x0))] = tmp0; } } } ``` ### Highlight For reduction case, we have some side-effect here. For below case, we vectorized `x1` dim and reduction at `x2` dim. ``` #pragma omp for for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L); x1+=static_cast<int64_t>(8L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17L*x2) + (306L*x0)), 8); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } [&] { __at_align__ std::array<float, 8> tmpbuf; tmp_acc0_vec.store(tmpbuf.data(), 8); #pragma GCC unroll 8 for (long x1_inner = 0; x1_inner < 8; x1_inner++) { out_ptr1[static_cast<int64_t>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner]; } } () ; } } #pragma omp simd simdlen(4) for(int64_t x1=static_cast<int64_t>(16L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(1L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { auto tmp0 = in_ptr1[static_cast<int64_t>(x1 + (17L*x2) + (306L*x0))]; tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0); } out_ptr1[static_cast<int64_t>(x0 + (39L*x1))] = tmp_acc0; } } } ``` After collapse, the loop order will be `x1 -> x2 -> x1_tail_part`, thus we will need a `tmp_acc_arr` to store the reduction result for `x1_tail_part`. And for `reduction_stores`, we also need to check `x1`'s value like what we do in the `loopbody` since the `reduction_stores` happened between `x1` and `x2` loops. ``` #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(8L)) { { float tmp_acc0_arr[8]; ######### need an array to hold acc result for tail part for (int i = 0; i < 8; i++) { tmp_acc0_arr[i] = -std::numeric_limits<float>::infinity(); } float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17L*x2) + (306L*x0)), 8); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L))) { for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++) { auto tmp0 = in_ptr1[static_cast<int64_t>(x1_tail + (17L*x2) + (306L*x0))]; tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)] = max_propagate_nan(tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)], tmp0); } } } } ############### reduction stores if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L))) { [&] { __at_align__ std::array<float, 8> tmpbuf; tmp_acc0_vec.store(tmpbuf.data(), 8); #pragma GCC unroll 8 for (long x1_inner = 0; x1_inner < 8; x1_inner++) { out_ptr1[static_cast<int64_t>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner]; } } () ; } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L))) { for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++) { out_ptr1[static_cast<int64_t>(x0 + (39L*x1_tail))] = tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)]; } } } } } ``` cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 gujinghui PenghuiCheng jianyuh min-jean-cho yanbing-j Guobing-Chen Xia-Weiwen snadampal voznesenskym penguinwu EikanWang blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov peterbell10 [ghstack-poisoned]
This PR aims to improves parallelization by collapsing vectorized loop. #122281 For such case, the parallel level is only `2`. And the vectorized loop cannot be collapsed. ``` #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(199984L); x1+=static_cast<long>(16L)) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985L*x0)), 16); tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985L*x0)), 16); } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(199984L); x1<static_cast<long>(199985L); x1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985L*x0))]; out_ptr0[static_cast<long>(x1 + (209985L*x0))] = tmp0; } } ``` After this PR, we will gen code ``` #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(199985L); x1+=static_cast<long>(16L)) { if (x1 >= 0 && x1 <199984) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985L*x0)), 16); tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985L*x0)), 16); } if (x1 >= 199984 && x1 <199985) { auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985L*x0))]; out_ptr0[static_cast<long>(x1 + (209985L*x0))] = tmp0; } } } ``` ### Highlight For reduction case, we have some side-effect here. For below case, we vectorized `x1` dim and reduction at `x2` dim. ``` #pragma omp for for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L); x1+=static_cast<int64_t>(8L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17L*x2) + (306L*x0)), 8); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } [&] { __at_align__ std::array<float, 8> tmpbuf; tmp_acc0_vec.store(tmpbuf.data(), 8); #pragma GCC unroll 8 for (long x1_inner = 0; x1_inner < 8; x1_inner++) { out_ptr1[static_cast<int64_t>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner]; } } () ; } } #pragma omp simd simdlen(4) for(int64_t x1=static_cast<int64_t>(16L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(1L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { auto tmp0 = in_ptr1[static_cast<int64_t>(x1 + (17L*x2) + (306L*x0))]; tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0); } out_ptr1[static_cast<int64_t>(x0 + (39L*x1))] = tmp_acc0; } } } ``` After collapse, the loop order will be `x1 -> x2 -> x1_tail_part`, thus we will need a `tmp_acc_arr` to store the reduction result for `x1_tail_part`. And for `reduction_stores`, we also need to check `x1`'s value like what we do in the `loopbody` since the `reduction_stores` happened between `x1` and `x2` loops. ``` #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(8L)) { { float tmp_acc0_arr[8]; ######### need an array to hold acc result for tail part for (int i = 0; i < 8; i++) { tmp_acc0_arr[i] = -std::numeric_limits<float>::infinity(); } float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17L*x2) + (306L*x0)), 8); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L))) { for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++) { auto tmp0 = in_ptr1[static_cast<int64_t>(x1_tail + (17L*x2) + (306L*x0))]; tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)] = max_propagate_nan(tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)], tmp0); } } } } ############### reduction stores if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L))) { [&] { __at_align__ std::array<float, 8> tmpbuf; tmp_acc0_vec.store(tmpbuf.data(), 8); #pragma GCC unroll 8 for (long x1_inner = 0; x1_inner < 8; x1_inner++) { out_ptr1[static_cast<int64_t>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner]; } } () ; } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L))) { for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++) { out_ptr1[static_cast<int64_t>(x0 + (39L*x1_tail))] = tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)]; } } } } } ``` cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 gujinghui PenghuiCheng jianyuh min-jean-cho yanbing-j Guobing-Chen Xia-Weiwen snadampal voznesenskym penguinwu EikanWang blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov peterbell10 [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM now except for a small nit on the code simplification.
This PR aims to improves parallelization by collapsing vectorized loop. #122281 For such case, the parallel level is only `2`. And the vectorized loop cannot be collapsed. ``` #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(199984L); x1+=static_cast<long>(16L)) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985L*x0)), 16); tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985L*x0)), 16); } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(199984L); x1<static_cast<long>(199985L); x1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985L*x0))]; out_ptr0[static_cast<long>(x1 + (209985L*x0))] = tmp0; } } ``` After this PR, we will gen code ``` #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(199985L); x1+=static_cast<long>(16L)) { if (x1 >= 0 && x1 <199984) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985L*x0)), 16); tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985L*x0)), 16); } if (x1 >= 199984 && x1 <199985) { auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985L*x0))]; out_ptr0[static_cast<long>(x1 + (209985L*x0))] = tmp0; } } } ``` ### Highlight For reduction case, we have some side-effect here. For below case, we vectorized `x1` dim and reduction at `x2` dim. ``` #pragma omp for for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L); x1+=static_cast<int64_t>(8L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17L*x2) + (306L*x0)), 8); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } [&] { __at_align__ std::array<float, 8> tmpbuf; tmp_acc0_vec.store(tmpbuf.data(), 8); #pragma GCC unroll 8 for (long x1_inner = 0; x1_inner < 8; x1_inner++) { out_ptr1[static_cast<int64_t>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner]; } } () ; } } #pragma omp simd simdlen(4) for(int64_t x1=static_cast<int64_t>(16L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(1L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { auto tmp0 = in_ptr1[static_cast<int64_t>(x1 + (17L*x2) + (306L*x0))]; tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0); } out_ptr1[static_cast<int64_t>(x0 + (39L*x1))] = tmp_acc0; } } } ``` After collapse, the loop order will be `x1 -> x2 -> x1_tail_part`, thus we will need a `tmp_acc_arr` to store the reduction result for `x1_tail_part`. And for `reduction_stores`, we also need to check `x1`'s value like what we do in the `loopbody` since the `reduction_stores` happened between `x1` and `x2` loops. ``` #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(8L)) { { float tmp_acc0_arr[8]; ######### need an array to hold acc result for tail part for (int i = 0; i < 8; i++) { tmp_acc0_arr[i] = -std::numeric_limits<float>::infinity(); } float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17L*x2) + (306L*x0)), 8); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L))) { for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++) { auto tmp0 = in_ptr1[static_cast<int64_t>(x1_tail + (17L*x2) + (306L*x0))]; tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)] = max_propagate_nan(tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)], tmp0); } } } } ############### reduction stores if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L))) { [&] { __at_align__ std::array<float, 8> tmpbuf; tmp_acc0_vec.store(tmpbuf.data(), 8); #pragma GCC unroll 8 for (long x1_inner = 0; x1_inner < 8; x1_inner++) { out_ptr1[static_cast<int64_t>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner]; } } () ; } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L))) { for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++) { out_ptr1[static_cast<int64_t>(x0 + (39L*x1_tail))] = tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)]; } } } } } ``` cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 gujinghui PenghuiCheng jianyuh min-jean-cho yanbing-j Guobing-Chen Xia-Weiwen snadampal voznesenskym penguinwu EikanWang blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov peterbell10 [ghstack-poisoned]
This PR aims to improves parallelization by collapsing vectorized loop. #122281 For such case, the parallel level is only `2`. And the vectorized loop cannot be collapsed. ``` #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(199984L); x1+=static_cast<long>(16L)) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985L*x0)), 16); tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985L*x0)), 16); } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(199984L); x1<static_cast<long>(199985L); x1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985L*x0))]; out_ptr0[static_cast<long>(x1 + (209985L*x0))] = tmp0; } } ``` After this PR, we will gen code ``` #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(199985L); x1+=static_cast<long>(16L)) { if (x1 >= 0 && x1 <199984) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985L*x0)), 16); tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985L*x0)), 16); } if (x1 >= 199984 && x1 <199985) { auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985L*x0))]; out_ptr0[static_cast<long>(x1 + (209985L*x0))] = tmp0; } } } ``` ### Highlight For reduction case, we have some side-effect here. For below case, we vectorized `x1` dim and reduction at `x2` dim. ``` #pragma omp for for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L); x1+=static_cast<int64_t>(8L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17L*x2) + (306L*x0)), 8); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } [&] { __at_align__ std::array<float, 8> tmpbuf; tmp_acc0_vec.store(tmpbuf.data(), 8); #pragma GCC unroll 8 for (long x1_inner = 0; x1_inner < 8; x1_inner++) { out_ptr1[static_cast<int64_t>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner]; } } () ; } } #pragma omp simd simdlen(4) for(int64_t x1=static_cast<int64_t>(16L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(1L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { auto tmp0 = in_ptr1[static_cast<int64_t>(x1 + (17L*x2) + (306L*x0))]; tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0); } out_ptr1[static_cast<int64_t>(x0 + (39L*x1))] = tmp_acc0; } } } ``` After collapse, the loop order will be `x1 -> x2 -> x1_tail_part`, thus we will need a `tmp_acc_arr` to store the reduction result for `x1_tail_part`. And for `reduction_stores`, we also need to check `x1`'s value like what we do in the `loopbody` since the `reduction_stores` happened between `x1` and `x2` loops. ``` #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(8L)) { { float tmp_acc0_arr[8]; ######### need an array to hold acc result for tail part for (int i = 0; i < 8; i++) { tmp_acc0_arr[i] = -std::numeric_limits<float>::infinity(); } float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17L*x2) + (306L*x0)), 8); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L))) { for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++) { auto tmp0 = in_ptr1[static_cast<int64_t>(x1_tail + (17L*x2) + (306L*x0))]; tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)] = max_propagate_nan(tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)], tmp0); } } } } ############### reduction stores if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L))) { [&] { __at_align__ std::array<float, 8> tmpbuf; tmp_acc0_vec.store(tmpbuf.data(), 8); #pragma GCC unroll 8 for (long x1_inner = 0; x1_inner < 8; x1_inner++) { out_ptr1[static_cast<int64_t>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner]; } } () ; } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L))) { for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++) { out_ptr1[static_cast<int64_t>(x0 + (39L*x1_tail))] = tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)]; } } } } } ``` cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 gujinghui PenghuiCheng jianyuh min-jean-cho yanbing-j Guobing-Chen Xia-Weiwen snadampal voznesenskym penguinwu EikanWang blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov peterbell10 [ghstack-poisoned]
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This PR aims to improves parallelization by collapsing vectorized loop. pytorch#122281 For such case, the parallel level is only `2`. And the vectorized loop cannot be collapsed. ``` #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(199984L); x1+=static_cast<long>(16L)) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985L*x0)), 16); tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985L*x0)), 16); } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(199984L); x1<static_cast<long>(199985L); x1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985L*x0))]; out_ptr0[static_cast<long>(x1 + (209985L*x0))] = tmp0; } } ``` After this PR, we will gen code ``` #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(199985L); x1+=static_cast<long>(16L)) { if (x1 >= 0 && x1 <199984) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985L*x0)), 16); tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985L*x0)), 16); } if (x1 >= 199984 && x1 <199985) { auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985L*x0))]; out_ptr0[static_cast<long>(x1 + (209985L*x0))] = tmp0; } } } ``` ### Highlight For reduction case, we have some side-effect here. For below case, we vectorized `x1` dim and reduction at `x2` dim. ``` #pragma omp for for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L); x1+=static_cast<int64_t>(8L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17L*x2) + (306L*x0)), 8); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } [&] { __at_align__ std::array<float, 8> tmpbuf; tmp_acc0_vec.store(tmpbuf.data(), 8); #pragma GCC unroll 8 for (long x1_inner = 0; x1_inner < 8; x1_inner++) { out_ptr1[static_cast<int64_t>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner]; } } () ; } } #pragma omp simd simdlen(4) for(int64_t x1=static_cast<int64_t>(16L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(1L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { auto tmp0 = in_ptr1[static_cast<int64_t>(x1 + (17L*x2) + (306L*x0))]; tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0); } out_ptr1[static_cast<int64_t>(x0 + (39L*x1))] = tmp_acc0; } } } ``` After collapse, the loop order will be `x1 -> x2 -> x1_tail_part`, thus we will need a `tmp_acc_arr` to store the reduction result for `x1_tail_part`. And for `reduction_stores`, we also need to check `x1`'s value like what we do in the `loopbody` since the `reduction_stores` happened between `x1` and `x2` loops. ``` #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(8L)) { { float tmp_acc0_arr[8]; ######### need an array to hold acc result for tail part for (int i = 0; i < 8; i++) { tmp_acc0_arr[i] = -std::numeric_limits<float>::infinity(); } float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17L*x2) + (306L*x0)), 8); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L))) { for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++) { auto tmp0 = in_ptr1[static_cast<int64_t>(x1_tail + (17L*x2) + (306L*x0))]; tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)] = max_propagate_nan(tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)], tmp0); } } } } ############### reduction stores if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L))) { [&] { __at_align__ std::array<float, 8> tmpbuf; tmp_acc0_vec.store(tmpbuf.data(), 8); #pragma GCC unroll 8 for (long x1_inner = 0; x1_inner < 8; x1_inner++) { out_ptr1[static_cast<int64_t>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner]; } } () ; } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L))) { for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++) { out_ptr1[static_cast<int64_t>(x0 + (39L*x1_tail))] = tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)]; } } } } } ``` Pull Request resolved: pytorch#128812 Approved by: https://github.com/jgong5
This PR aims to improves parallelization by collapsing vectorized loop. pytorch#122281 For such case, the parallel level is only `2`. And the vectorized loop cannot be collapsed. ``` #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(199984L); x1+=static_cast<long>(16L)) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985L*x0)), 16); tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985L*x0)), 16); } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(199984L); x1<static_cast<long>(199985L); x1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985L*x0))]; out_ptr0[static_cast<long>(x1 + (209985L*x0))] = tmp0; } } ``` After this PR, we will gen code ``` #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(199985L); x1+=static_cast<long>(16L)) { if (x1 >= 0 && x1 <199984) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985L*x0)), 16); tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985L*x0)), 16); } if (x1 >= 199984 && x1 <199985) { auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985L*x0))]; out_ptr0[static_cast<long>(x1 + (209985L*x0))] = tmp0; } } } ``` ### Highlight For reduction case, we have some side-effect here. For below case, we vectorized `x1` dim and reduction at `x2` dim. ``` #pragma omp for for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L); x1+=static_cast<int64_t>(8L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17L*x2) + (306L*x0)), 8); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } [&] { __at_align__ std::array<float, 8> tmpbuf; tmp_acc0_vec.store(tmpbuf.data(), 8); #pragma GCC unroll 8 for (long x1_inner = 0; x1_inner < 8; x1_inner++) { out_ptr1[static_cast<int64_t>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner]; } } () ; } } #pragma omp simd simdlen(4) for(int64_t x1=static_cast<int64_t>(16L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(1L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { auto tmp0 = in_ptr1[static_cast<int64_t>(x1 + (17L*x2) + (306L*x0))]; tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0); } out_ptr1[static_cast<int64_t>(x0 + (39L*x1))] = tmp_acc0; } } } ``` After collapse, the loop order will be `x1 -> x2 -> x1_tail_part`, thus we will need a `tmp_acc_arr` to store the reduction result for `x1_tail_part`. And for `reduction_stores`, we also need to check `x1`'s value like what we do in the `loopbody` since the `reduction_stores` happened between `x1` and `x2` loops. ``` #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(8L)) { { float tmp_acc0_arr[8]; ######### need an array to hold acc result for tail part for (int i = 0; i < 8; i++) { tmp_acc0_arr[i] = -std::numeric_limits<float>::infinity(); } float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17L*x2) + (306L*x0)), 8); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L))) { for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++) { auto tmp0 = in_ptr1[static_cast<int64_t>(x1_tail + (17L*x2) + (306L*x0))]; tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)] = max_propagate_nan(tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)], tmp0); } } } } ############### reduction stores if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L))) { [&] { __at_align__ std::array<float, 8> tmpbuf; tmp_acc0_vec.store(tmpbuf.data(), 8); #pragma GCC unroll 8 for (long x1_inner = 0; x1_inner < 8; x1_inner++) { out_ptr1[static_cast<int64_t>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner]; } } () ; } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L))) { for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++) { out_ptr1[static_cast<int64_t>(x0 + (39L*x1_tail))] = tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)]; } } } } } ``` Pull Request resolved: pytorch#128812 Approved by: https://github.com/jgong5
This PR aims to improves parallelization by collapsing vectorized loop. #122281
For such case, the parallel level is only
2
.And the vectorized loop cannot be collapsed.
After this PR, we will gen code
Highlight
For reduction case, we have some side-effect here.
For below case, we vectorized
x1
dim and reduction atx2
dim.After collapse, the loop order will be
x1 -> x2 -> x1_tail_part
, thus we will need atmp_acc_arr
to store the reduction result forx1_tail_part
. And forreduction_stores
, we also need to checkx1
's value like what we do in theloopbody
since thereduction_stores
happened betweenx1
andx2
loops.Stack from ghstack (oldest at bottom):
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @gujinghui @PenghuiCheng @jianyuh @min-jean-cho @yanbing-j @Guobing-Chen @Xia-Weiwen @snadampal @voznesenskym @penguinwu @EikanWang @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @peterbell10