You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A high-performance tensor library built on WebGPU, designed for both eager and lazy execution with automatic CPU/GPU device management.
What are Compute Shaders?
Compute shaders are GPU programs that run massively parallel computations. Unlike graphics shaders that render pixels, compute shaders can perform arbitrary calculations on large datasets. WebGPU exposes this power through a modern, cross-platform API that works in browsers and native environments.
Performance Note: For small matrices (< 1K elements), CPU is often faster due to GPU setup overhead. For large matrices (> 100K elements), GPU parallelism dominates. Our library handles both seamlessly.
TypeScript: Full type safety with device-aware types
API Examples
Eager Execution (Current)
import{Tensor}from"./src/tensor.ts";// Create tensors on CPU or GPUconsta=Tensor.fromArray([1,2,3,4,5],{device: "gpu"});constb=Tensor.fromArray([2,3,4,5,6],{device: "gpu"});// Method chaining works - operations execute immediatelyconstresult=awaita.add(b).mul(2);// Each step creates new tensorconstcpuResult=awaitresult.to("cpu");// Transfer backconsole.log(cpuResult.toArray());// [6, 10, 14, 18, 22]
Compiled Execution (Planned)
import{Tensor,compile}from"./src/index.ts";// Same chaining syntax, but with major optimizations:constfusedOp=compile((x: Tensor,y: Tensor)=>{returnx.add(y).mul(2).relu();// Fused into single kernel});// Compiled mode advantages:// ✅ Kernel fusion: Single compute shader instead of 3 separate ones// ✅ In-place operations: Dynamic buffer allocator minimizes memory usage// ✅ Auto cleanup: Intermediate tensors destroyed at closure end// ✅ Reuse: First call compiles, subsequent calls blazing fastconstresult=fusedOp(a,b);// Much faster than eager
Design Decisions
WebGPU Async Nature: WebGPU operations are inherently async, but we don't always await intermediate steps since the runtime automatically queues and awaits necessary operations. This allows for better performance through automatic batching.
Syntax Choices:
Tensor.fromArray() for explicit construction
.to(device) for clear device transfers
Method chaining with automatic queueing
Device-aware TypeScript types prevent cross-device errors at compile time
Platform Support
Currently using Deno exclusively due to its excellent built-in WebGPU support (--unstable-webgpu). Future plans include bundling for web browsers and Node.js.
deno task dev # Run basic examples
deno run --unstable-webgpu --allow-all examples/basic_add.ts
deno run --unstable-webgpu --allow-all examples/performance_comparison.ts
Run Tests
deno task test# Run all tests
deno test --unstable-webgpu --allow-all tests/tensor_test.ts
Contributing
We need help with several areas:
🔧 API Improvements
Better TypeScript support: Current device typing could be more ergonomic
Shape broadcasting: Automatic shape compatibility
Error handling: Better error messages and recovery
🐛 Bug Hunting
Memory leaks: GPU buffer cleanup
Edge cases: Empty tensors, large arrays, device switching
Performance regressions: Benchmark against baselines
⚡ New Kernels
Easy starter tasks! Copy src/kernels/add.ts to implement: