| CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 404
Releases: NVIDIA/warp
v1.10.1
Warp v1.10.1
Warp v1.10.1 is a bugfix release following v1.10.0. For a complete list of changes, see the changelog.
Highlights
This is primarily a bugfix release with no major new features. Key fixes include:
- Module reuse with
module="unique": Fixed kernels using@wp.kernel(module="unique")to properly reuse existing module objects when the kernel is defined multiple times, avoiding unnecessary module creation overhead. - Kernel-local arrays: Fixed several issues with arrays created using
wp.zeros()inside kernels, including.ptraccess, indexing for subarrays, and accepting single integers for theshapeparameter. - Custom gradients: Fixed a code-generation ordering bug that could prevent custom gradient functions (
@wp.func_grad) from compiling when used with nested function calls. - FEM improvements: Fixed invalid reads when using
wp.fem.TemporaryStoreduring tape capture and resolved reference cycles inwp.fem.Temporaryandwp.fem.ShapeBasisSpace.
Announcements
Upcoming removals
The following feature is deprecated and will be removed in v1.11 (planned for January 2026):
-
graph_compatibleparameter injax_callable(): The booleangraph_compatibleflag has been deprecated in favor of the newgraph_modeparameter which acceptsGraphModeenum values. UseGraphMode.JAX,GraphMode.WARP, orGraphMode.NONEinstead.# Deprecated (v1.10.1, will be removed in v1.11) callable = wp.jax_experimental.jax_callable(func, graph_compatible=True) # Use instead from warp.jax_experimental import GraphMode callable = wp.jax_experimental.jax_callable(func, graph_mode=GraphMode.JAX)
Platform support
- Python 3.8: We plan to drop support for Python 3.8 (end-of-life since October 2024) starting with v1.11.
- CUDA Toolkit: Starting with v1.11, the default pre-built wheels published on PyPI will be built with CUDA Toolkit 12.9 instead of 12.8. This does not change driver requirements but enables new compiler options to control the tradeoff between kernel compilation speed and runtime performance. We plan a second transition to CUDA Toolkit 13.x in mid-2026.
Acknowledgments
We thank the following contributors:
- @mehdiataei for fixing loop unrolling with
wp.static()expressions that prevented certain code patterns from compiling correctly.
Assets 9
v1.10.0
Warp v1.10.0
Warp v1.10 expands JAX integration with automatic differentiation support and multi-device jax.pmap() compatibility. The tile programming model has been enhanced with axis-specific reductions, component-level indexing, and convenience functions for creating tiles.
Performance has been significantly improved in several areas: BVH operations now support in-place rebuilding for CUDA graphs and configurable leaf sizes, built-in function calls from Python are up to 70× faster, and additional sparse matrix and FEM operations can now be captured in CUDA graphs.
Additional usability improvements include negative indexing and slicing for arrays, atomic bitwise operations, and new built-in functions including error functions and type casting.
Important: This release removes the warp.sim module (deprecated since v1.8), which has been superseded by the Newton physics engine. See the Announcements section below for migration guidance and other upcoming changes.
For a complete list of changes, see the full changelog.
New features
JAX automatic differentiation (experimental)
Warp now supports experimental automatic differentiation with JAX, allowing kernels to participate in JAX automatic differentiation workflows. This feature is contributed by @mehdiataei and builds on earlier work by @jaro-sevcik. It enables computing gradients through Warp kernels using jax.grad() by passing enable_backward=True to jax_kernel().
Key capabilities include:
- Single and multiple output kernels: Compute gradients for kernels with one or more output arrays
- Static input auto-detection: Scalar inputs are automatically treated as static (non-differentiable) arguments
- Vector and matrix arrays: Arrays of composite types like
wp.vec2orwp.mat22are fully supported - Multi-device execution: Compatible with
jax.pmap()for distributed forward and backward passes across multiple GPUs
import jax
from warp.jax_experimental import jax_kernel
@wp.kernel
def my_kernel(a: wp.array(dtype=float), out: wp.array(dtype=float)):
i = wp.tid()
out[i] = a[i] ** 2.0
# Enable automatic differentiation
jax_func = jax_kernel(my_kernel, num_outputs=1, enable_backward=True)
# Compute gradients through the kernel
grad_fn = jax.grad(lambda a: jax.numpy.sum(jax_func(a)[0]))
gradient = grad_fn(input_array) # gradient: [2*a[0], 2*a[1], ...]This feature is experimental and has some current limitations. See the JAX Automatic Differentiation documentation for complete examples, usage details, and limitations.
Multi-device JAX support with jax.pmap()
Warp now properly supports jax.pmap() and jax.shard_map() for multi-device parallel execution, thanks to fixes contributed by @chaserileyroberts. Previously, device targeting issues prevented Warp callables from working correctly within these JAX primitives—JAX would invoke callbacks from multiple threads targeting different devices, but Warp would always execute on the default device. The fix ensures proper device coordination by extracting device ordinals from XLA FFI and adding thread synchronization for concurrent callbacks, enabling efficient data-parallel workflows across multiple GPUs.
In-place BVH rebuilding with CUDA graph support
A new wp.Bvh.rebuild() method enables rebuilding BVH hierarchies in-place without allocating new memory. This complements the existing refit() method and is particularly useful when primitive distributions change significantly.
CUDA graph capture: Unlike creating a new BVH, rebuild() reuses existing buffers, making it safe to capture in CUDA graphs. Previously captured graphs that include queries on the BVH remain valid after rebuilding, enabling high-performance repeated updates without graph re-capture overhead.
Construction algorithms: On CUDA devices, in-place rebuild supports "lbvh" only. On CPU, "sah" and "median" are supported. Defaults are chosen automatically based on the device.
Tile programming enhancements
The tile programming model has been enhanced with new capabilities to make tile-based computations more expressive and convenient:
Axis-specific reductions
The tile-reduction functions wp.tile_reduce() and wp.tile_sum() now support an optional axis parameter, enabling reductions along a specific dimension of a tile rather than reducing the entire tile to a single value. This enhancement brings NumPy-like axis semantics to tile operations.
@wp.kernel
def tile_reduce_axis(x: wp.array2d(dtype=float), y: wp.array(dtype=float)):
a = wp.tile_load(x, shape=(4, 8), storage="shared")
# Sum along axis 0, reducing shape from (4, 8) to (8,)
b = wp.tile_sum(a, axis=0)
wp.tile_store(y, b)
x = wp.array(np.arange(32).reshape(4, 8), dtype=float)
# x = [[ 0. 1. 2. 3. 4. 5. 6. 7.]
# [ 8. 9. 10. 11. 12. 13. 14. 15.]
# [16. 17. 18. 19. 20. 21. 22. 23.]
# [24. 25. 26. 27. 28. 29. 30. 31.]]
y = wp.zeros(8, dtype=float)
wp.launch_tiled(tile_reduce_axis, dim=(1,), inputs=[x], outputs=[y], block_dim=32)
# y = [48. 52. 56. 60. 64. 68. 72. 76.] (column sums)Component-level indexing
Tiles of composite types (vectors, matrices, quaternions) now support component-level indexing and assignment. You can directly index into individual components using extended indexing syntax:
- Vector components:
tile[i][1]extracts the second component of a vector at positioni - Matrix elements:
tile[i][1, 1]accesses the element at row 1, column 1 of a matrix at positioni
This provides more convenient and expressive syntax for working with structured data in tiles.
Creating tiles filled with a constant value
The new wp.tile_full() function provides a convenient way to create tiles initialized with a constant value, similar to NumPy's np.full():
# Create an 8x8 tile filled with 3.14
tile = wp.tile_full(shape=(8, 8), value=3.14, dtype=float)New example
The new example_tile_mcgp.py example demonstrates tile-based Monte Carlo methods by implementing a walk-on-spheres algorithm for solving Laplace's equation on volumetric domains.
Performance improvements
Built-in function calls from Python
Calling Warp built-in functions from Python scope (e.g., wp.normalize(), wp.transform_identity(), matrix arithmetic like mat * mat) is now significantly faster thanks to optimizations in overload resolution. Previously, each function call would iterate through all overloads, attempt argument binding, and pack parameters into C types until finding a match. Now, Warp caches the resolved overload and parameter packing strategy based on argument types using @functools.lru_cache, eliminating redundant resolution overhead on subsequent calls.
In microbenchmarks, repeated wp.mat44 multiplication at Python scope is up to 70× faster (~570 μs → ~8 μs), while operations like wp.transform_identity() see 3-4× speedups (~100 μs → ~30 μs). The magnitude of improvement varies by operation complexity, with greater gains for operations requiring more expensive overload resolution.
Breaking change: As part of this optimization, support for passing lists, tuples, and other non-Warp array arguments to built-in functions has been removed. Calls like wp.normalize([1.0, 2.0, 3.0]) must now be written as wp.normalize(wp.vec3(1.0, 2.0, 3.0)). This simplifies the function call path and removes expensive sequence-flattening logic that was incompatible with efficient caching.
Configurable BVH leaf size
wp.Bvh and wp.Mesh now expose tunable leaf_size and bvh_leaf_size parameters, respectively, allowing users to control the number of primitives stored in each leaf node for performance optimization. The optimal leaf size depends on the query workload:
- Intersection queries (ray casting, AABB overlap): Smaller leaf sizes (e.g., 1) are generally optimal, reducing unnecessary primitive checks
- Closest point queries: Larger leaf sizes (e.g., 4-8) can improve performance by checking more primitives together and reducing traversal overhead
- Mixed workloads: Moderate values (e.g., 4) provide a balanced trade-off
Behavior change: The default leaf_size for wp.Bvh has changed from 4 (hardcoded) to 1, optimizing for intersection queries which are more common. wp.Mesh retains a default bvh_leaf_size of 4 as a compromise between intersection and closest-point query performance. Users performing primarily closest-point queries may benefit from explicitly setting larger leaf sizes.
Sparse matrix operations with CUDA graphs
Sparse matrix operations in warp.sparse can now be captured in CUDA graphs for allocation-free execution. Operations like bsr_axpy(), bsr_assign(), and bsr_set_transpose() preserve matrix topology when using masked=True, while bsr_mm() adds a new max_new_nnz parameter that allows specifying an upper bound on new non-zero blocks for flexible graph capture when sparsity patterns vary within known bounds.
FEM operations with CUDA graphs
Building warp.fem geometry and function space partitions can now be captured in CUDA graphs by specifying upper bounds on partition sizes: max_cell_count and max_side_count for ExplicitGeometryPartition, and max_node_count for make_space_partition(). Additionally, building fields and restrictions is now synchronization-free by default.
Language enhancements
Array indexing and slicing improvements
Warp arrays now support negative in...
Assets 9
v1.9.1
Warp v1.9.1
Warp 1.9.1 is a bugfix release that follows our recent feature update. For a full list of changes, see the changelog.
Highlights
- GPU Compatibility: Support for older NVIDIA GPU architectures (Maxwell, Pascal, Volta) was unintentionally dropped in the pre-built wheels distributed for Warp 1.9.0 on PyPI. These architectures have been added back.
- Documentation Improvements: We have corrected the documentation for
wp.mesh_query_aabb()andwp.mesh_query_aabb_next(), added a caveat concerning the use of__cuda_array_interface__on a system with multiple GPUs, and fixed the labeling of built-in functions that were incorrectly labeled as differentiable. - Corrected Slice Behavior: Empty slices (e.g.
arr[i:i]) are now handled correctly at the Python scope, returning an empty array instead of raising an error. - Tile Stability and Correctness: A critical memory management issue with shared tiles has been fixed to prevent unpredictable crashes and memory leaks. Additionally, functions like
wp.copy()andwp.where()now work with tiles and compute correct gradients (adjoints). - Tuple Type Hints: Resolved a
TypeErrorthat occurred when using modern tuple type hints (e.g.,tuple[int, int]) with@wp.func-decorated functions on Python 3.9 and 3.10.
Announcements
Known limitations
- CPU Kernels on ARM: Launching CPU kernels on Linux ARM systems, such as NVIDIA Jetson Thor and Grace Hopper, may result in segmentation faults. A fix for this issue is planned for the v1.10 release. GPU kernels are not affected.
Upcoming removals
The following features have been deprecated in prior releases and will be removed in v1.10 (early November):
warp.sim- Use the Newton engine.- Constructing a
wp.matrix()from column vectors - Usewp.matrix_from_rows()orwp.matrix_from_cols()instead. wp.select()- Usewp.where()instead (node: different argument order).wp.matrix(pos, quat, scale)- Usewp.transform_compose()instead.
Platform support
- We plan to drop support for Intel macOS (x86-64) in a future release (tentatively planned for v1.10).
Acknowledgments
We thank the following contributors for their valuable contributions to this release:
- @RSchwan for a major contribution that fixed memory management issues with tiles and enabled functions like
wp.copy()andwp.where()to work correctly with tile arguments (#777). - @liblaf for reporting issues related to GPU architecture compatibility (#960, #966) and code generation for
wp.map()(#953).
Assets 9
v1.9.0
Warp 1.9 ships with a rewritten marching cubes implementation, compatibility with the CUDA 13 toolkit, and new functions for ahead-of-time module compilation. The programming model has also been enhanced with more flexible indexing for composite types, direct IntEnum support, and the ability to initialize local arrays in kernels.
New Features
Differentiable marching cubes
A fully differentiable wp.MarchingCubes implementation, contributed by @mikacuy and @nmwsharp, has been added. This version is written entirely in Warp, replacing the previous native CUDA C++ implementation and enabling it to run on both CPU and GPU devices. The implementation also addresses a long-standing off-by-one bug (#324). For more details, see the updated documentation.
Functions for module compilation and loading
We have added wp.compile_aot_module() and wp.load_aot_module() for more flexible ahead-of-time (AOT) compilation.
These functions include a strip_hash=True argument, which removes the unique hashes from compiled module and function
names. This change makes it possible to distribute pre-compiled modules without shipping the original Python source code.
See the documentation on ahead-of-time compilation workflows for more details. In future releases, we plan to continue to expand Warp's support for ahead-of-time workflows.
CUDA 13 Support
CUDA Toolkit 13.0 was released in early August.
PyPI Distribution: Warp wheels on PyPI and NVIDIA PyPI will continue to be built with CUDA 12.8 to provide a transition period for users upgrading their CUDA drivers.
CUDA 13.0 Compatibility: Users requiring Warp compiled against CUDA 13.x have two options:
- Build Warp from source
- Install pre-built wheels from GitHub releases
Driver Compatibility: CUDA 12.8 Warp wheels can run on systems with CUDA 13.x drivers thanks to CUDA's backward compatibility.
Performance Improvements
Graph-capturable linear solvers
The iterative linear solvers in warp.optim.linear (CG, BiCGSTAB, GMRES) are now fully compatible with CUDA graph capture. This adds support for device-side convergence checking via wp.capture_while(), enabling full CUDA graph capture when check_every=0. Users can now choose between traditional host-side convergence checks or fully graph-capturable device-side termination.
Automatic tiling for sparse linear algebra
warp.sparse now supports arbitrary-sized blocks and can leverage tile-based computations for certain matrix types. The system automatically chooses between tiled and non-tiled execution using heuristics based on matrix characteristics (block sizes, sparsity patterns, and workload dimensions). Note that the heuristic for choosing between tiled and non-tiled variants is still being refined, and that it can be manually overridden by providing the tile_size parameter to bsr_mm or bsr_mv.
Automatic tiling for finite element quadrature
warp.fem.integrate now leverages tile-based computations for quadrature point accumulation, with automatic tile size selection based on workload characteristics. The system automatically chooses between tiled and non-tiled execution to optimize performance based on the integration problem size and complexity.
Programming Model Updates
Slice and negative indexing improvements for composite types
We have enhanced the support for slice operations and negative indexing across all composite types (vectors, matrices, quaternions, and transforms).
m = wp.matrix_from_rows(
wp.vec3(1.0, 2.0, 3.0),
wp.vec3(4.0, 5.0, 6.0),
wp.vec3(7.0, 8.0, 9.0),
)
subm = m[:-1, 1:]
print(subm)
# [[2.0, 3.0],
# [5.0, 6.0]]Support for IntEnum and IntFlag inside kernels
It is now possible to directly reference IntEnum and IntFlag values inside Warp functions and kernels. Previously, workarounds involving wp.static() were required.
from enum import IntEnum
class JointType(IntEnum):
PRISMATIC = 0
REVOLUTE = 1
BALL = 2
@wp.kernel
def count_revolute_joints(
joint_types: wp.array(dtype=JointType),
counter: wp.array(dtype=int)
):
tid = wp.tid()
joint = joint_types[tid]
# No longer requires wp.static(JointType.REVOLUTE.value)
if joint == JointType.REVOLUTE:
wp.atomic_add(counter, 0, 1)Improved support for wp.array() views inside kernels
This enhancement allows kernels to create array views by accessing the ptr attribute of an array.
@wp.kernel
def kernel_array_from_ptr(arr_orig: wp.array2d(dtype=wp.float32)):
arr = wp.array(ptr=arr_orig.ptr, shape=(2, 3), dtype=wp.float32)
arr[0, 0] = 1.0
arr[0, 1] = 2.0
arr[0, 2] = 3.0Additionally, these in-kernel views now support dynamic shapes and struct types.
Support for initializing fixed-size arrays inside kernels
It is now possible to allocate local arrays of a fixed size in kernels using wp.zeros(). The resulting arrays are allocated in registers, providing fast access and avoiding global memory overhead.
Previously, developers needed to create vectors to achieve a similar capability, e.g. v = wp.vector(length=8, dtype=float), but this came with various limitations.
@wp.kernel
def kernel_with_local_array():
local_arr = wp.zeros(8, dtype=wp.float32) # Allocated in registers
# ... use local_arrIndexed tile operations
Warp now provides three new indexed tile operations that enable more flexible memory access patterns beyond simple contiguous tile operations. These functions allow you to load, store, and perform atomic operations on tiles using custom index mappings along specified axes.
wp.tile_load_indexed()- Load tiles with custom index mapping along a specified axiswp.tile_store_indexed()- Store tiles with custom index mapping along a specified axiswp.tile_atomic_add_indexed()- Perform atomic additions with custom index mapping along a specified axis
x = wp.array(
[
[0.77395605, 0.43887844, 0.85859792, 0.69736803],
[0.09417735, 0.97562235, 0.7611397, 0.78606431],
[0.12811363, 0.45038594, 0.37079802, 0.92676499],
],
dtype=float,
)
indices = wp.array([0, 2], dtype=int)
@wp.kernel
def indexed_data_lookup(data: wp.array2d(dtype=float), indices: wp.array(dtype=int)):
# [0 2] = tile(shape=(2), storage=shared)
indices_tile = wp.tile_load(indices, shape=(2,))
# [[0.773956 0.438878 0.858598 0.697368]
# [0.128114 0.450386 0.370798 0.926765]] = tile(shape=(2,4), storage=register)
data_rows_tile = wp.tile_load_indexed(data, indices_tile, axis=0, shape=(2, 4))
print(data_rows_tile)
# [[0.773956 0.858598]
# [0.0941774 0.76114]
# [0.128114 0.370798]] = tile(shape=(3,2), storage=register)
data_columns_tile = wp.tile_load_indexed(data, indices_tile, axis=1, shape=(3, 2))
wp.launch_tiled(indexed_data_lookup, dim=1, inputs=[x, indices], block_dim=2)Fixed nested matrix component support
Warp now properly supports writing to individual matrix elements stored within struct fields. Previously, operations like struct.matrix[1, 2] = value would result in a compile-time error.
@wp.struct
class MatStruct:
m: wp.mat44
@wp.kernel
def kernel_nested_mat(out: wp.array(dtype=MatStruct)):
s = MatStruct()
s.m[1, 2] = 3.0 # This now works correctly (no longer raises a WarpCodegenError)
s.m[2][2] = 5.0 # This has also been fixed (used to silently fail)
out[0] = sAnnouncements
Known limitations
Early testing on NVIDIA Jetson Thor indicates that launching CPU kernels may sometimes result in segmentation faults. GPU kernel launches are unaffected. We believe this can be resolved by building Warp from source against LLVM/Clang version 18 or newer.
Upcoming removals
The following features have been deprecated in prior releases and will be removed in v1.10 (early November):
warp.sim- Use the Newton engine.- Constructing a
wp.matrix()from column vectors - Usewp.matrix_from_rows()orwp.matrix_from_cols()instead. wp.select()- Usewp.where()instead (note: different argument order).wp.matrix(pos, quat, scale)- Usewp.transform_compose()instead.
Platform support
- We plan to drop support for Intel macOS (x86-64) in a future release (tentatively planned for v1.10).
Acknowledgments
We thank the following contributors for their valuable contributions to this release:
- @liblaf for fixing an issue with using
warp.jax_experimental.ffi.jax_callable()with a function annotated with the-> Nonereturn type (#893). - @matthewdcong for providing an updated version of NanoVDB compatible with CUDA 13 (#888).
- @YuyangLee for contributing an early prototype that helped shape the
strip_hash=Trueoption for the new ahead-of-time compilation functions (#661).
Full Changelog
For a curated list of all changes in this release, please see the v1.9.0 section in CHANGELOG.md.
Assets 9
v1.9.0rc1
v1.8.1
This patch release primarily contains bug fixes as expected.
However, to support the adoption of Warp by the MuJoCo MJX physics engine, it also includes new features and deprecations limited to the jax_experimental module. We are flagging this deviation from our standard versioning practices to ensure clarity. Normal versioning practices will resume with the next release.
Full Changelog
Deprecated
- This is the final release that will provide builds for or support the CUDA 11.x Toolkit and driver. Starting with v1.9.0, Warp will require CUDA 12.x or newer.
- Deprecate the
graph_compatibleboolean flag injax_callable()in favor of the newgraph_modeargument withGraphModeenum (#848).
Added
- Add documentation for creating and manipulating Warp structured arrays using NumPy (#852)
- Add documentation for
wp.indexedarray()(#468). - Support input-output aliasing in JAX FFI (#815).
- Support capturing
jax_callable()using Warp via the newgraph_modeparameter (GraphMode.WARP), enabling capture of graphs with conditional nodes that cannot be used as subgraphs in a JAX capture (#848).
Fixed
- Fix
tape.zero()to correctly reset gradient arrays in nested structs (#807). - Fix incorrect adjoints for
div(scalar, vec),div(scalar, mat), anddiv(scalar, quat), and other miscellaneous issues with adjoints (#831). - Fix a module-hashing issue for functions or kernels using static expressions that cannot be resolved at the time of declaration (#830).
- Fix a bug in which changes to
wp.config.modewere not being picked up after module initialization (#856). - Fix a bug where CUDA modules could get prematurely unloaded when conditional graph nodes are used.
- Fix compile time regression for kernels using matmul, Cholesky, and FFT solvers by upgrading to libmathdx 0.2.2 (#809).
- Fix potential uninitialized memory issues in
wp.tile_sort()(#836). - Fix
wp.tile_min()andwp.tile_argmin()to return correct values for large tiles with low occupancy (#725). - Fix codegen errors associated with adjoint of
wp.tile_sum()when using shared tiles (#822). - Fix driver entry point error for
cuDeviceGetUuidcaused by using an incorrect version (#851). - Fix an issue that caused Warp to request PTX generation from NVRTC for architectures unsupported by the compiler (#858).
- Fix a regression where
wp.sparse.bsr_from_triplets()ignored theprune_numerical_zeros=Falsesetting (#832). - Fix missing cloth-body contact in
wp.sim.VBDIntegratorwithhandle_self_contact=False(#862). - Fix a bug causing potential infinite loops in the color balancing calculation (#816).
- Fix box-box collision by computing the contact normal at the closest point of approach instead of at the center of the source box (#839).
- Fix the OpenGL renderer not correctly displaying colors for box shapes (#810).
- Fix a bug in
OpenGLRendererwhere meshes with differentscaleattributes were incorrectly instanced, causing them all to be rendered with the same scaleOpenGLRenderer(#828).
Assets 9
v1.8.0
Changelog
[1.8.0] - 2025-07-01
Added
- Add
wp.map()to map a function over arrays and add math operators for Warp arrays (docs, #694). - Add support for dynamic control flow in CUDA graphs, see
wp.capture_if()andwp.capture_while()(docs, #597). - Add
wp.capture_debug_dot_print()to write a DOT file describing the structure of a captured CUDA graph (#746). - Add the
Device.sm_countproperty to get the number of streaming multiprocessors on a CUDA device (#584). - Add
wp.block_dim()to query the number of threads in the current block inside a kernel (#695). - Add
wp.atomic_cas()andwp.atomic_exch()built-ins for atomic compare-and-swap and exchange operations (#767). - Add support for profiling GPU runtime module compilation using the global
wp.config.compile_time_tracesetting or the module-level"compile_time_trace"option. When used, JSON files in the Trace Event format will be written in the kernel cache, which can be opened in a viewer likechrome://tracing/(docs, #609). - Add support for returning multiple values from native functions like
wp.svd3()andwp.quat_to_axis_angle()(#503). - Add support for passing tiles to user
wp.funcfunctions (#682). - Add
wp.tile_squeeze()to remove axes of length one (#662). - Add
wp.tile_reshape()to reshape a tile (#663). - Add
wp.tile_astype()to return a new tile with the same data but different data type. (#683). - Add support for in-place tile add and subtract operations (#518).
- Add support for in-place tile-component addition and subtraction (#659).
- Add support for 2D solves using
wp.tile_cholesky_solve()(#773). - Add
wp.tile_scan_inclusive()andwp.tile_scan_exclusive()for performing inclusive and exclusive scans over tiles (#731). - Support attribute indexing for quaternions on the right-hand side of expressions (#625).
- Add
wp.transform_compose()andwp.transform_decompose()for converting between transforms and 4x4 matrices with 3D scale information (#576). - Add various
wp.transformsyntax operations for loading and storing (#710). - Add the
as_spheresparameter toUsdRenderer.render_points()in order to choose whether to render the points as USD spheres using a point instancer or as simple USD points (#634). - Add support for animating visibility of objects in the USD renderer (#598).
- Add
wp.sim.VBDIntegrator.rebuild_bvh()to rebuild the BVH used for detecting self-contacts. - Add damping terms
wp.sim.VBDIntegratorcollisions, with strength is controlled byModel.soft_contact_kd. - Improve consistency of the
wp.fem.lookup()operator across geometries and add filtering parameters (#618). - Add two examples demonstrating shape optimization using
warp.fem:fem/example_elastic_shape_optimization.pyandfem/example_darcy_ls_optimization.py(#698). - Add a
py.typedmarker file (per PEP 561) to the package to formally support static type checking by downstream users (#780).
Removed
- Remove
wp.mlp()(deprecated in v1.6.0). Use tile primitives instead. - Remove
wp.autograd.plot_kernel_jacobians()(deprecated in v1.4.0). Usewp.autograd.jacobian_plot()instead. - Remove the
lengthandownerkeyword arguments fromwp.array()constructor (deprecated in v1.6.0). Use theshapeanddeleterkeywords instead. - Remove the
kernelkeyword argument fromwp.autograd.jacobian()andwp.autograd.jacobian_fd()(deprecated in v1.6.0). Use thefunctionkeyword argument instead. - Remove the
outputskeyword argument fromwp.autograd.jacobian_plot()(deprecated in v1.6.0).
Changed
- Deprecate the
warp.simmodule (planned for removal in v1.10). It will be superseded by the upcoming Newton library, a separate package with a new API. Migrating will require code changes; a future guide will be provided (current draft). See the GitHub announcement for details (#735). - Deprecate the
wp.matrix(pos, quat, scale)built-in function. Usewp.transform_compose()instead (#576). - Improve support for tuples in kernels (#506).
- Return a constant value from
len()where possible. - Rename the internal function
wp.types.type_length()towp.types.type_size(). - Rename
wp.tile_cholesky_solve()input parameters to align with its docstring (#726). - Change
wp.tile_upper_solve()andwp.tile_lower_solve()to use libmathdx 0.2.1 TRSM solver (#773). - Skip adjoint compilation for
wp.tile_matmul()ifenable_backwardis disabled (#644). - Allow tile reductions to work with non-scalar tile types (#771).
- Permit data-type preservation with
preserve_type=Truewhen tiling a value across the block withwp.Tile()(#772). - Make
wp.sparse.bsr_[set_]from_tripletsdifferentiable with respect to the input triplet values (#760). - Expose new
warp.femoperators:node_count,node_index,element_coordinates,element_closest_point. - Change
wp.sim.VBDIntegratorrigid-body-contact handling to use only the shape's friction coefficient, rather than averaging the shape's and the cloth's coefficients. - Limit usage of the
wp.assign_copy()hidden built-in to the kernel scope. - Describe the distinction between
inputsandoutputsarguments in the Kernel documentation. - Reduce the overhead of
wp.launch()by avoiding costly native API calls (#774). - Improve error reporting when calling
@wp.func-decorated functions from the Python scope (#521).
Fixed
- Fix missing documentation for geometric structs (#674).
- Fix the type annotations in various tile functions (#714).
- Fix incorrect stride initialization in tiles returned from functions taking transposed tiles as input (#722).
- Fix adjoint generation for user functions that return a tile (#749).
- Fix tile-based solvers failing to accept and return transposed tiles (#768).
- Fix the
Formal parameter space overflowederror duringwp.sim.VBDIntegratorkernel compilation for the backward pass in CUDA 11 Warp builds. This was resolved by decoupling collision and elasticity evaluations into separate kernels, increasing parallelism and speeding up the solver (#442). - Fix an issue with graph coloring on an empty graph (#509).
- Fix an integer overflow bug in the native graph coloring module (#718).
- Fix
UsdRenderer.render_points()not supporting multiple colors (#634). - Fix an inconsistency in the
wp.femmodule regarding the orientation of 2D geometry side normals (#629). - Fix premature unloading of CUDA modules used in JAX FFI graph captures (#782).
Assets 9
v1.7.2.post1
Changelog
[1.7.2] - 2025-05-31
Added
- Add missing adjoint method for tile
assignoperations (#680). - Add documentation for the fact that
+=and-=invokewp.atomic_add()andwp.atomic_sub(), respectively (#505). - Add a publications list of academic and research projects leveraging Warp (#686).
Changed
- Prevent and document that class inheritance is not supported for
wp.struct(now throwsRuntimeError) (#656). - Warn when an incompatible data type conversion is detected when constructing an array using the
__cuda_array_interface__(#624, #670). - Relax the exact version requirement in
omni.warptowardsomni.warp.core(#702). - Rename the "Kernel Reference" documentation page to "Built-Ins Reference", with each built-in now having annotations to denote whether they are accessible only from the kernel scope or also from the Python runtime scope (#532).
Fixed
- Fix an issue where arrays stored in structs could be garbage collected without updating the struct ctype (#720).
- Fix an issue with preserving the base class of nested struct attributes (#574).
- Allow recovering from out-of-memory errors during
wp.Volumeallocation (#611). - Fix 2D tile load when source array and tile have incompatible strides (#688).
- Fix compilation errors with
wp.tile_atomic_add()(#681). - Fix
wp.svd2()with duplicate singular values and improved accuracy (#679). - Fix
OpenGLRenderer.update_shape_instance()not having color buffers created for the shape instances. - Fix text rendering in
wp.render.OpenGLRenderer(#704). - Fix assembly of rigid body inertia in
ModelBuilder.collapse_fixed_joints()(#631). - Fix
UsdRenderer.render_points()erroring out when passed 4 points or less (#708). - Fix
wp.atomic_*()built-ins not working with some types (#733). - Fix garbage-collection issues with JAX FFI callbacks (#711).
Assets 9
v1.7.1
Changelog
[1.7.1] - 2025-04-30
Added
- Add example of a distributed Jacobi solver using
mpi4pyinwarp/examples/distributed/example_jacobi_mpi.py(#475).
Changed
- Improve
repr()for Warp types, including addingrepr()forwp.array. - Change the USD renderer to use
framesPerSecondfor time sampling instead oftimeCodesPerSecondto avoid playback speed issues in some viewers (#617). Model.rigid_contact_tidsare now -1 at non-active contact indices which allows to retrieve the vertex index of a mesh collision, seetest_collision.py(#623).- Improve handling of deprecated JAX features (#613).
Fixed
- Fix a code generation bug involving return statements in Warp kernels, which could result in some threads in Warp being skipped when processed on the GPU (#594).
- Fix constructing
DeformedGeometryfromwp.fem.Trimesh3Dgeometries (#614). - Fix
lookupoperator forwp.fem.Trimesh3D(#618). - Include the block dimension in the LTO file hash for the Cholesky solver (#639).
- Fix tile loads for small tiles with aligned source memory (#622).
- Fix length/shape matching for vectors and matrices from the Python scope.
- Fix the
dtypeparameter missing forwp.quaternion(). - Fix invalid
dtypecomparison when using thewp.matrix()/wp.vector()/wp.quaternion()constructors with literal values and an explicitdtypeargument (#651). - Fix incorrect thread index lookup for the backward pass of
wp.sim.collide()(#459). - Fix a bug where
wp.sim.ModelBuilderadds springs with -1 as vertex indices (#621). - Fix center of mass, inertia computation for mesh shapes (#251).
- Fix computation of body center of mass to account for shape orientation (#648).
- Fix
show_jointsnot working withwp.sim.render.SimRendererset to render to USD (#510). - Fix the jitter for the
OgnParticlesFromMeshnode not being computed correctly. - Fix documentation of
atolandrtolarguments towp.autograd.gradcheck()andwp.autograd.gradcheck_tape()(#508).
Assets 9
v1.7.0
Changelog
[1.7.0] - 2025-03-30
Added
- Support JAX foreign function interface (FFI) (docs, #511).
- Support Python/SASS correlation in Nsight Compute reports by emitting
#linedirectives in CUDA-C code. This setting is controlled bywp.config.line_directivesand isTrueby default. (docs, #437) - Support
vec4fgrid construction inwp.Volume.allocate_by_tiles(). - Add 2D SVD
wp.svd2()(#436). - Add
wp.randu()for randomuint32generation. - Add matrix construction functions
wp.matrix_from_cols()andwp.matrix_from_rows()(#278). - Add
wp.transform_from_matrix()to obtain a transform from a 4x4 matrix (#211). - Add
wp.where()to select between two arguments conditionally using a more intuitive argument order (cond,value_if_true,value_if_false) (#469). - Add
wp.get_mempool_used_mem_current()andwp.get_mempool_used_mem_high()to query the respective current and high-water mark memory pool allocator usage (#446 ). - Add
Stream.is_completeandEvent.is_completeproperties to query completion status (#435). - Support timing events inside of CUDA graphs (#556).
- Add LTO cache to speed up compilation times for kernels using MathDx-based tile functions. Use
wp.clear_lto_cache()to clear the LTO cache (#507). - Add example demonstrating gradient checkpointing for fluid optimization in
warp/examples/optim/example_fluid_checkpoint.py. - Add a hinge-angle-based bending force to
wp.sim.VBDIntegrator. - Add an example to show mesh sampling using a CDF (#476).
Changed
- Breaking: Remove CUTLASS dependency and
wp.matmul()functionality (including batched version). Users should use tile primitives for matrix multiplication operations instead. - Deprecate constructing a matrix from vectors using
wp.matrix(). - Deprecate
wp.select()in favor ofwp.where(). Users should update their code to usewp.where(cond, value_if_true, value_if_false)instead ofwp.select(cond, value_if_false, value_if_true). wp.sim.Controlno longer has amodelattribute (#487).wp.sim.Control.reset()is deprecated and now only zeros-out the controls (previously restored controls to initialmodelstate). Usewp.sim.Control.clear()instead.- Vector/matrix/quaternion component assignment operations (e.g.,
v[0] = x) now compile and run faster in the backward pass. Note: For correct gradient computation, each component should only be assigned once. @wp.kernelhas now an optionalmoduleargument that allows passing awp.context.Moduleto the kernel, or, if set to"unique"let Warp create a new unique module just for this kernel. The default behavior to use the current module is unchanged.- Default PTX architecture is now automatically determined by the devices present in the system, ensuring optimal compatibility and performance (#537).
- Structs now have a trivial default constructor, allowing for
wp.tile_reduce()on tiles with struct data types. - Extend
wp.tile_broadcast()to support broadcasting to 1D, 3D, and 4D shapes (in addition to existing 2D support). wp.fem.integrate()andwp.fem.interpolate()may now perform parallel evaluation of quadrature points within elements.wp.fem.interpolate()can now build Jacobian sparse matrices of interpolated functions with respect to a trial field.- Multiple
wp.sparseroutines (bsr_set_from_triplets,bsr_assign,bsr_axpy,bsr_mm) now accept amaskedflag to discard any non-zero not already present in the destination matrix. wp.sparse.bsr_assign()no longer requires source and destination block shapes to evenly divide each other.- Extend
wp.expect_near()to support all vectors and quaternions. - Extend
wp.quat_from_matrix()to support 4x4 matrices. - Update the
OgnClothSimulatenode to use the VBD integrator (#512). - Remove the
globalScaleparameter from theOgnClothSimulatenode.