Carview!

A minor ISPC update with reverted support of blend stores for structures causing a stability regression on SSE2/SSE4/PS4 targets.

ISPC release featuring sample-based profile-guided optimization, optimized dispatcher, new avx512gnr targets for Intel Granite Rapids, and numerous bug fixes and performance improvements. Based on a patched LLVM 20.1.8.

Compiler Switches:

Added --profile-sample-use=<file> flag to enable sample-based profile-guided optimization (PGO). When provided, the ISPC compiler loads sample profile data to guide optimization decisions during compilation. Use in conjunction with --sample-profiling-debug-info flag that enables debug info suitable for sample-based profiling. Sample-based PGO can provide up to 30% performance
gains thanks to aggressive loop unrolling, optimized memory access patterns, and specialized hot code paths guided by actual branch frequencies.
Added --[no-]internal-export-functions to control generation of internal (ISPC-callable) versions of exported functions. The flag is enabled by default. When disabled (--no-internal-export-functions), only external versions are generated and calling exported functions from ISPC code will result in a compilation error.
Added --stack-protector[=<level>] flag to enable Stack Smash Protection (SSP) for ISPC functions, providing runtime detection of stack buffer overflows. --stack-protector (equivalent to --stack-protector=on) enables stack protectors for functions vulnerable to stack smashing. --stack-protector=strong enables stack protectors for functions that contain arrays of any size or take addresses of local variables. --stack-protector=all enables stack protectors for all functions. --stack-protector=none disables stack protectors (default).
The default DWARF version has been updated to match the LLVM default (DWARF 5 on most platforms).

Behavioral Changes:

A new warning has been introduced when an exported function without the external_only attribute is called from ISPC code. This warning prepares for an upcoming behavior change in ISPC 1.30, where export functions will by default generate only external (C/C++-callable) versions instead of both internal and external versions. To address this warning, use a non-exported function for ISPC-to-ISPC calls, add the external_only attribute, or use the --no-internal-export-functions flag.

Language Changes:

soa<> types can now be used as struct members. Previously, soa<> members in structs were not supported by the grammar.
The compiler now assumes that all loops with non-constant conditions will make forward progress and eventually terminate. This enables additional optimizations. Infinite loops with constant conditions like for (;;) or while (1) are treated specially and do not have this assumption applied.

Dispatcher:

The dispatcher has been made more efficient with a caching mechanism and enabling LLVM optimization passes, resulting in approximately 50% faster dispatch overhead.

Targets:

New avx512gnr-x4, avx512gnr-x8, avx512gnr-x16, avx512gnr-x32, and avx512gnr-x64 targets have been added for Intel Granite Rapids processors. These targets support AVX-512 with AMX-FP16 capabilities.
The avx10.2 targets have been renamed to avx10.2dmr to reflect Diamond Rapids (DMR) codename alignment.
Fixed --opt=disable-zmm option to work correctly on avx512skx-x16 and avx512icl-x16 targets. This option avoids ZMM registers, which can be beneficial for workloads sensitive to frequency throttling on some processors.

Removed Targets:

The gen9-x8 and gen9-x16 GPU targets have been removed.

Deprecated Targets:

The sse2-i32x4 and sse2-i32x8 targets are now deprecated and will be removed in a future release.

Predefined Macros:

New predefined macros ISPC_TARGET_HAS_FP16_SUPPORT and ISPC_TARGET_HAS_FP64_SUPPORT have been added following the consistent naming convention used by other target capability macros. The old macro names ISPC_FP16_SUPPORTED and ISPC_FP64_SUPPORTED remain available for backward compatibility but are now deprecated.
The ISPC_TARGET_AVX10_2 macro has been replaced with ISPC_TARGET_AVX10_2DMR to match the target renaming.

Performance:

Optimized popcnt (population count) implementation for AVX512ICL and newer targets, achieving up to 3.5x speedup.
Improved code generation for avx512-x16 and avx10.2-x16 targets with ~10% improvement in geomean on benchmarks. This includes better shuffle instruction generation and improved optimization pass ordering that prevents suboptimal masked load transformations blocking SROA registerization.
Improved masked store promotion to blend stores for structures, providing up to 53% improvement on targets without hardware masked stores (such as NEON and SSE4).
Fixed inefficient loop code generation when using unsigned loop counters.
Fixed incorrect loop full unroll behavior that caused partial unrolling for loops with unknown trip counts.

Build System:

Optimized stdlib compilation by implementing a width family system that reduces bitcode duplication, reducing ISPC binary size by approximately 30%. This also allows adding new targets to ISPC with minimal increase in binary size.

Bug Fixes:

Fixed crashes when casting SOA (slice) pointers to non-SOA pointer types.
Fixed handling of enum negation in constant folding.
Fixed slice pointer handling in pointer-to-integer casts.
Fixed type checking of expressions wrapped by TypeCastExpr.
Fixed indexing into function call results that return pointer types.
Fixed uniform bool return values that could incorrectly return 255 instead of 1.
Fixed shuffle-related optimization issues.
Fixed enum fields missing from generated C/C++ headers.
Fixed VNNI intrinsic validation on SKX target.
Fixed rounding operations for float16 on SSE2 targets by adding emulation.

New Example:

Added an AMX (Advanced Matrix Extensions) example demonstrating tile matrix operations.

Experimental RISC-V Support:

Initial support for the RISC-V 64-bit (riscv64) architecture has been added with RISC-V Vector Extension (RVV) ISA, introducing the rvv-x4 target for 4-wide vectorization. This support is experimental and not included in official ISPC binaries. To use it, build ISPC from source with the RISCV_ENABLED=ON CMake option or use pre-release binaries. Feedback and contributions are welcome.

Recommended versions of Runtime Dependencies when targeting GPU:

Linux:

Intel(R) Graphics Compute Runtime
https://github.com/intel/compute-runtime/releases/tag/25.13.33276.16
Level Zero Loader
https://github.com/oneapi-src/level-zero/releases/tag/v1.20.2
Threading Building Blocks (TBB)

Alternatively, you can use a validated gfx driver stack supporting Intel Arc(TM)
available at https://dgpu-docs.intel.com/driver/installation.html

Windows:

Intel(R) Graphics Windows(R) DCH Drivers 32.0.101.8250
https://www.intel.com/content/www/us/en/download/785597/869290/intel-arc-graphics-windows.html
Level Zero Loader
https://github.com/oneapi-src/level-zero/releases/tag/v1.20.2
OpenCL(TM) Offline Compiler (OCLOC)
https://www.intel.com/content/www/us/en/developer/articles/tool/oneapi-standalone-components.html
(this is needed for AoT compilation on Windows only)
Supported GPU platforms: Intel(R) Arc Graphics, 11th-13th Gen Intel(R) Core
processor graphics

Components revisions used in GPU-enabled build:

KhronosGroup/SPIRV-LLVM-Translator@6dd8f2a
intel/vc-intrinsics@b980474
https://github.com/oneapi-src/level-zero/commit/v1.20.2
llvm/llvm-project@87f0227 (llvmorg-20.1.8) +
patches from llvm_patches folder

A minor ISPC update with bug fixes:

Fixed enum fields missing from generated headers.
Fixed boolean return value representation in exported functions.

A minor ISPC update with bug fixes:

Fixed compiler crash when indexing with pointers returned by function calls
Fixed compiler crash when assigning null pointers to struct members

ISPC release with enhanced struct operator support, the ability to use ISPC as a library with JIT support, simplified integration with Python via nanobind wrappers, enhanced standard library functions, and numerous stability and performance improvements. Based on a patched LLVM 20.1.8.

Language Changes

Struct operator overloading has been extended. Added support for overloading unary (++, --, -, !, ~), binary (*, /, %, +, -, >>, <<, ==, !=, <, >, <=, >=, &, |, ^, &&, ||), and assignment (=, +=, -=, *=, /=, %=, <<=, >>=, &=, |=, ^=) operators for struct types.
Integer literal rules are now stricter:
- Limits the number of [uUlL] symbols (e.g., ulll, uul, lulu are no longer valid).
- The value modification suffix ([kMG]) must precede the type modification suffix ([uUlL]).
- Like C/C++, lL and Ll suffixes are no longer allowed (mixing cases to form LL).
NaN to bool conversion now matches C/C++ behavior.

ISPC as a Library

ISPC can now be used as a C++ library (libispc) for embedding ISPC compilation directly into applications. It also provides CMake configuration files for easy integration into other CMake projects. The library includes experimental Just-In-Time (JIT) compilation for runtime code generation and execution. See the documentation and the new simple_lib and simple_jit examples demonstrating the ISPCEngine API.

Python Integration and Nanobind Support

ISPC can now generate nanobind wrappers for ISPC modules, enabling easy and lightweight integration with Python. The generated wrappers can be built into native Python modules and imported into Python code. The --nanobind-wrapper=<filename> option enables this feature.
Three new examples show ISPC integration with Python:
- point_transform_ctypes - calling an ISPC function via ctypes with three different input types.
- point_transform_nanobind - using nanobind to wrap an ISPC function for geometric transformations on 2D points, integrating with NumPy for high-performance vectorized processing.
- attention - single-head attention as used in transformer networks, featuring:
  - Multiple matrix multiplication methods (GOTO-based, tiled)
  - Memory pool management for efficient intermediate storage
  - Task-based parallelism for multi-core scaling
  - Softmax with optimized memory access patterns

Float16

Added a new command-line option --include-float16-conversions to include float16 conversion functions in the compiled module. Useful for targets without native float16 conversion instructions, such as x86 prior to AVX2. Disabled by default.

Standard Library Changes

select now supports unsigned integer types uint8, uint16, uint32, and uint64, as well as uniform short vectors.
Added new functions: isinf, isfinite, srgb8_to_float.
Short vector standard library functions have been moved to short_vec.isph. They are no longer implicitly available and must be
included explicitly.
Added short vector type support for: fmod, isnan, rsqrt_fast, clamp.
Added an include/intrinsics directory with SSE intrinsic headers, useful for porting intrinsics-based code to ISPC.

Performance

Optimized shuffle/shift/rotate and reduce_equal on AVX-512 targets, with up to 90% speedup.

New Targets

Added CPU targets for AMD Zen4 and Zen5 architectures.

Crash Fixes

Fixed crashes related to atomic type handling.
Resolved nested foreach assertion failures.
Fixed variable scoping crashes in single-statement blocks.
Addressed unhandled atomic type crashes.
Fixed crashes related to signed keyword usage.
Resolved crash from extra index access in aggregate initialization.
Fixed crashes from templates without arguments.
Fixed crashes in short vector initialization and indexing.
Fixed type signature mismatch crash with uniform bool<N> parameters.

Function Call Enhancements

Added struct access support on function call returns.
Improved handling of functions returning pointer types.
Enhanced lvalue handling for function return expressions.

Ecosystem Improvements and Community Engagement

Modernized the ISPC website with a refreshed look and structure.
Open-sourced the ISPC VS Code plugin (https://github.com/ispc/vscode-ispc). Though it has not been maintained for some time, we are committed to improving it and releasing a new version. Community contributions are welcome.
We have set up a Discord server to better connect with our community and make it easier for ISPC users to support one another. Join the ISPC Discord server: https://discord.com/invite/9e7E7sFe2D

Recommended versions of Runtime Dependencies when targeting GPU

Linux:

Intel(R) Graphics Compute Runtime https://github.com/intel/compute-runtime/releases/tag/25.13.33276.16
Level Zero Loader https://github.com/oneapi-src/level-zero/releases/tag/v1.20.2
Threading Building Blocks (TBB)

Alternatively, you can use a validated gfx driver stack supporting Intel® Arc™ available at https://dgpu-docs.intel.com/driver/installation.html

Windows:

Intel(R) Graphics Windows(R) DCH Drivers 32.0.101.6083 https://www.intel.com/content/www/us/en/download/785597/intel-arc-iris-xe-graphics-windows.html
Level Zero Loader https://github.com/oneapi-src/level-zero/releases/tag/v1.20.2
OpenCL™ Offline Compiler (OCLOC) https://www.intel.com/content/www/us/en/developer/articles/tool/oneapi-standalone-components.html (this is needed for AoT compilation on Windows only)
Supported GPU platforms: Intel(R) Arc Graphics, 11th-13th Gen Intel(R) Core processor graphics

Components revisions used in GPU-enabled build:

KhronosGroup/SPIRV-LLVM-Translator@6dd8f2a1
intel/vc-intrinsics@b980474c
oneapi-src/level-zero@d7a44e0 (v1.20.2)
llvm/llvm-project@87f0227 (llvmorg-20.1.8) + patches from llvm_patches folder

This release introduces AVX10.2 support, extended standard library coverage for small integer types, full support for element-wise functions on short vectors in the standard library, and numerous bug fixes. It is based on a patched LLVM 20.1.4.

New targets

New targets have been added for platforms supporting Intel® Advanced Vector Extensions 10.2: avx10.2-x4, avx10.2-x8, avx10.2-x16, avx10.2-x32, and avx10.2-x64.

Standard library

Cross-lane operations - broadcast, rotate, shift, and shuffle - are now supported for unsigned types.
Reduction functions now support signed and unsigned int8 and int16 types.
Support for packed_load and packed_store has been extended to include: int8, int16 (signed and unsigned), float16, float, and double.
The cube root function cbrt has been added to the standard library for float and double types.
Dot product functionality has been enhanced with mixed signedness support for 16-bit integers. The following input combinations are now supported: u16 x u16 (unsigned x unsigned), i16 x i16 (signed x signed), u16 x i16 (mixed signedness). For consistency with other naming conventions, the function dot2add_i16_packed has been renamed to dot2add_i16i16_packed.
The support for short vector types has been added for the following element wise functions: min, max, abs, round, floor, ceil, trunc, rcp, rcp_fast, sqrt, rsqrt, sin, asin, cos, acos, tan, atan, exp, log, atan2, pow, and cbrt.

Language changes

The aligned(N) attribute is now available to specify the alignment of variables and struct types.
A bug was fixed where unsigned array indices or pointer arithmetic with unsigned offsets could result in overflow due to sign extension when promoting to pointer size. This issue is now resolved, and the compiler correctly handles unsigned integer indexing and pointer arithmetic.

Compiler Switches Behavior

The -dD and -dM flags are now supported, aiding in debugging the preprocessor and inspecting defined macros.

Template support bug fixes

Fixed instantiation of template functions when assigned to function pointers.
Improved implicit template argument deduction.
Fixed a crash occurring when a nested template function did not use a templated argument.

Performance improvements

Improved the performance of masked loads and stores for AVX-512 x32 and x64 targets by an order of magnitude (approximately ~10x on microbenchmarks).
packed_store_active2 on AVX2 has been improved: ~65% speedup for int32, ~45% speedup for int64

Other bug fixes

Fixed a crash during integer division by ensuring it occurs only on active lanes, improving stability and performance.
Resolved crashes related to:
- Incomplete struct types
- Use of enum fields in structs
- Pointer declarations to function types
- Unsupported binary operations on pointer types
- Casting to unsized arrays in malformed code
- Accessing array elements through pointers
- Structure member access within pointer arithmetic
Improved compiler warnings for incomplete types.
Corrected address calculations involving unsigned indices in array accesses and pointer arithmetic.

Ecosystem improvements

ISPC is now supported by GitHub's Linguist, enabling proper syntax highlighting for .ispc files on GitHub.
ISPC syntax support has been added to the following editors, thanks to community contributions:
- CudaText - Alexey-T/CudaText#5944
- ECoder - SpartanJ/ecode#436
If you are integrating ISPC with Python, we recommend using nanobind. Examples are available, and we plan to generate nanobind-compatible headers in a future release.

Recommended versions of Runtime Dependencies when targeting GPU

Linux:

Intel(R) Graphics Compute Runtime
https://github.com/intel/compute-runtime/releases/tag/25.13.33276.16
Level Zero Loader
https://github.com/oneapi-src/level-zero/releases/tag/v1.20.2
Threading Building Blocks (TBB)

Alternatively, you can use a validated gfx driver stack supporting Intel® Arc™
available at https://dgpu-docs.intel.com/driver/installation.html

Windows:

Intel(R) Graphics Windows(R) DCH Drivers 32.0.101.6083
https://www.intel.com/content/www/us/en/download/785597/intel-arc-iris-xe-graphics-windows.html
Level Zero Loader
https://github.com/oneapi-src/level-zero/releases/tag/v1.20.2
OpenCL™ Offline Compiler (OCLOC)
https://www.intel.com/content/www/us/en/developer/articles/tool/oneapi-standalone-components.html
(this is needed for AoT compilation on Windows only)
Supported GPU platforms: Intel(R) Arc Graphics, 11th-13th Gen Intel(R) Core
processor graphics

Components revisions used in GPU-enabled build:

KhronosGroup/SPIRV-LLVM-Translator@6dd8f2a1
intel/vc-intrinsics@b980474c
oneapi-src/level-zero@d7a44e0 (v1.20.2)
llvm/llvm-project@ec28b8f (llvmorg-20.1.4) + patches from llvm_patches folder

ISPC release featuring improved ARM support, new "generic" targets that simplify ISPC's internal design and streamline the addition of new targets, improved code generation across x86 and ARM, and multiple stability fixes. This release is based on a patched LLVM 18.1.8.

ARM Support Changes:

The --arch=arm flag, which previously mapped to ARMv7 (32-bit), now maps to ARMv8 (32-bit). There are no changes to
--arch=aarch64, which continues to map to ARMv8 (64-bit).
The CPU definitions for the ARMv7 architecture have been removed: cortex-a9 and cortex-a15.
New CPU definitions have been introduced, including cortex-a55, cortex-a78, cortex-a510, and cortex-a520, along with support for new Apple devices.
New double-pumped targets have been introduced: neon-i16x16 and neon-i8x32.
Dot product operations are now supported using native ARM instructions (sdot/udot).
Performance on ARMv8 has been improved by an average on 13%.

Generic Targets:

In this release, generic targets were introduced in ISPC. Their main goal is to simplify ISPC target management and serve as the foundation for hardware-specific targets, requiring only selective tuning when performance expectations are not met.

ARM targets have been refactored to use generic targets as a baseline, resulting in cleaner code and improved performance. This change also makes it easier to add support for new architectures, such as RISC-V or any other LLVM-supported target.

Generic targets can also be used as standalone targets in cases where no native target exists with the required width for a particular CPU (e.g., a 32-wide target for SSE4). This can be done by specifying the following options in ISPC:

--target=generic-i1x32 --cpu=penryn

A complete list of all generic targets and the architectures they support can be found in the output of:

ispc --support-matrix

Code Generation:

The -O1 optimization pipeline has been further optimized for size: loop unrolling and function inlining have been adjusted accordingly.
Improved generated code for the count_leading_zeros and count_trailing_zeros functions by producing native instructions ( e.g.
vplzcntq).
Improved generated code for masked load/stores for int8/int16 types on AVX512 by generating native instructions (vmovdqu8, vmovdqu16).
Improved code generation when returning structs from functions by eliminating unnecessary mov instructions.

Language Changes:

Enhanced support for LLVM intrinsics when the --enable-llvm-intrinsics flag is used, including support for intrinsics with no arguments and overloaded intrinsics.
Added user-visible macro definitions for the LLVM version that ISPC is based on.
The __attribute__((deprecated)) attribute can now be applied to functions, generating a warning when the function is called.

Deprecated Targets:

The KNL (avx512knl-x16) target has been removed.

Compiler Switches Behavior:

The --darwin-version-min option has been added to specify the minimum deployment target version for macOS and iOS applications. This addresses a new linker behavior introduced in Xcode 15.0, which issues a warning when no version is provided.
The --nocpp command-line flag is now deprecated and will be removed in a future release.

Dispatch Behavior:

The behavior of user programs when no supported ISA is detected in the auto-dispatch code has changed. Instead of raising the SIGABRT signal, the system will now raise SIGILL. This affects users who rely on SIGABRT in their signal handlers for error handling or recovery. Such users must update their code to handle SIGILL instead. This change improves predictability and removes the dispatcher's reliance on the C standard library.

Bug Fixes:

Fixed a crash for functions returning pointers.
Fixed incorrect values for some predefined macros.
Fixed a crash when using sizeof as a global variable initializer.
Fixed function template overload resolution issues.
Fixed incorrect behavior in short vector casts inside templates.
Fixed incorrect zero handling in the ldexp standard library function.

Recommended versions of Runtime Dependencies when targeting GPU:

Linux:

Intel(R) Graphics Compute Runtime https://github.com/intel/compute-runtime/releases/tag/24.35.30872.22
Level Zero Loader https://github.com/oneapi-src/level-zero/releases/tag/v1.17.28
Threading Building Blocks (TBB)

Alternatively, you can use a validated gfx driver stack supporting Intel® Arc™ available at https://dgpu-docs.intel.com/driver/installation.html

Windows:

Intel(R) Graphics Windows(R) DCH Drivers 32.0.101.6083
https://www.intel.com/content/www/us/en/download/785597/intel-arc-iris-xe-graphics-windows.html
Level Zero Loader
https://github.com/oneapi-src/level-zero/releases/tag/v1.17.28
OpenCL™ Offline Compiler (OCLOC)
https://www.intel.com/content/www/us/en/developer/articles/tool/oneapi-standalone-components.html
(this is needed for AoT compilation on Windows only)
Supported GPU platforms: Intel(R) Arc Graphics, 11th-13th Gen Intel(R) Core processor graphics

Components revisions used in GPU-enabled build:

KhronosGroup/SPIRV-LLVM-Translator@43fb73fe
intel/vc-intrinsics@4f5bc1bb
oneapi-src/level-zero@c1f6e28 (v1.17.28)
https://github.com/llvm/llvm-project/commit/3b5b5c1(llvmorg-18.1.8) + patches from llvm_patches folder

Automatically updated trunk artifacts

A minor ISPC update with a fix for --vectorcall calling convention mode on Windows.

A minor ISPC update with several bug fixes:

Fixed broken --vectorcall calling convention mode on Windows.
Fix build error on FreeBSD.
Removed in-memory ISPC headers (/core.isph, /stdlib.isph) from dependencies for -M switch for Windows binaries.
Fixed linker errors on Windows for multi-target compilation.

Releases: ispc/ispc

=== v1.29.1 === (19 December 2025)

Uh oh!

=== v1.29.0 === (17 December 2025)

Uh oh!

=== v1.28.2 === (24 September 2025)

Uh oh!

=== v1.28.1 === (21 August 2025)

Uh oh!

=== v1.28.0 === (13 August 2025)

Uh oh!

=== v1.27.0 === (15 May 2025)

Uh oh!

=== v1.26.0 === (6 February 2025)

Uh oh!

trunk-artifacts

Uh oh!

=== v1.25.3=== (8 November 2024)

Uh oh!

=== v1.25.2=== (2 November 2024)

Uh oh!