You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
from prefix_sum import prefix_sum_cpu, prefix_sum_cuda
# assuming input is a torch.cuda.IntTensor, num_elements is an integer
# allocate output_array on cuda
# e.g. output = torch.zeros((num_elements,), dtype=torch.int, device=torch.device('cuda'))
prefix_sum_cuda(input, num_elements, output)
# similarly for the CPU version
# except that both input and output are torch.IntTensor now
prefix_sum_cpu(input, num_elements, output)
Original README
My implementation of parallel exclusive scan in CUDA, following this NVIDIA paper.
Parallel prefix sum, also known as parallel Scan, is a useful building block for many
parallel algorithms including sorting and building data structures. In this document
we introduce Scan and describe step-by-step how it can be implemented efficiently
in NVIDIA CUDA. We start with a basic naïve algorithm and proceed through
more advanced techniques to obtain best performance. We then explain how to
scan arrays of arbitrary size that cannot be processed with a single block of threads.