WebGPU Meets Zero-Knowledge: Can Browser-Based Proving Finally Work?

zkSecurity integrated WebGPU compute shaders with StarkWare's Stwo prover and hit 5x speedups on constraint polynomial evaluation.

Zero-knowledge proofs have a performance problem. The cryptographic math that lets you prove something without revealing it is computationally brutal; and running that math in a browser on consumer hardware has felt like a distant stretch goal. But recent work from zkSecurity suggests WebGPU might actually close the gap.

The team integrated WebGPU compute shaders with StarkWare's Stwo prover and walked away with numbers worth paying attention to: a 5x improvement for constraint polynomial evaluation and 2x for the overall proving pipeline. Not theoretical benchmarks: actual speedups on a production-grade ZK system.


0:00
/0:06

WGSL compute shader implementation of the Number Theoretic Transform algorithm with butterfly operations for zero-knowledge proof generation.

Why Browser Proving Matters

Client-side proof generation is the holy grail for privacy-focused applications. If a user generates their ZK proof locally, their private data never touches a server. The whole point of zero-knowledge is keeping secrets secret, and shipping your secrets to the cloud kind of defeats the purpose.

The catch is that proof generation involves heavy polynomial arithmetic—lots of it. The Number Theoretic Transform, essentially an FFT over finite fields, dominates the workload. In ZK protocols, NTT handles polynomial interpolation and evaluation. These operations are extremely (or should be) parallel, which makes them a natural fit for GPU acceleration. The open question has always been whether WebGPU could deliver on that promise in practice.


Technical Context

For those interested in the technical details, there are interesting techniques, considerations, and constraints that are guiding these solutions. Some of these include:

Building an NTT From Scratch

zkSecurity's post walks through building an NTT implementation in WGSL from scratch, using the Cooley-Tukey algorithm with its characteristic butterfly operations. They start with a naive single-threaded version and systematically add parallelism and workgroup memory optimizations.

Here's the counterintuitive part: the single-threaded GPU version actually runs slower than single-threaded Rust. GPUs want thousands of threads doing simple work in lockstep, not one thread chugging through sequential logic. A CPU has a few dozen powerful cores optimized for complex sequential tasks. A GPU has thousands of simpler execution units that only shine when you feed them parallel workloads. The payoff comes when you restructure the algorithm to let the hardware do what it does best.

The Memory Hierarchy Problem

The article digs into the GPU memory hierarchy, which matters enormously here. WebGPU exposes three levels of memory through WGSL variable types: private memory per thread, workgroup shared memory accessible to threads in the same workgroup, and global storage buffers. Getting data into the right level at the right time is where the real optimization happens. Workgroup memory lets threads cooperate on butterfly operations without hammering global memory on every access.

But the biggest lesson from their profiling is familiar to anyone who's touched GPU compute: the computation itself often isn't your bottleneck. CPU-to-GPU buffer copies are. Data transfer overhead can eclipse the gains from parallel execution if you're not careful. The fix is restructuring algorithms to keep data on the GPU longer and minimize round trips. Basically it's the typical advice to do more work per dispatch, batch aggressively.

WGSL's Rough Edges

WGSL throws some curveballs too. There are no native 64-bit integers, which matters when you're doing modular arithmetic on large field elements. You have to decompose values and manage the pieces yourself. Runtime-sized arrays can only appear as the last element of a storage buffer, which constrains how you structure your data. Not dealbreakers, but friction you need to plan for when porting existing proving code.

Where This Fits in the Ecosystem

This work targets StarkWare's Stwo prover, which implements the Circle STARK protocol over a 31-bit Mersenne prime field. The field choice is deliberate—Mersenne primes allow faster modular reduction, and 31 bits fits nicely into hardware that likes 32-bit operations. Stwo already supports CPU, SIMD, and CUDA GPU backends. WebGPU and WASM are next on the roadmap.

StarkWare recently rebranded Stwo as S-two and positioned it as the foundation for client-side proving. Their benchmarks claim it outperforms competing zkVMs by 10-30x on operations like Keccak hashing. The target is instant local proving with compact proofs verified on-chain: privacy-preserving transactions, verifiable AI inference, digital identity without revealing your actual identity. In November 2025, S-two went live on Starknet mainnet, replacing the original Stone prover.

Ingonyama has also been working on GPU acceleration for Stwo through their ICICLE library, reporting 3.25-7x improvements over the SIMD backend using CUDA. The WebGPU approach trades some raw performance for portability—you lose access to CUDA-specific optimizations, but you gain the ability to run in any browser on any platform without asking users to install anything.

The Maintenance Challenge

zkSecurity notes the maintenance challenge ahead: porting proving code to WGSL, keeping it updated as upstream implementations evolve, running additional audits. They suggest the ecosystem needs a production-ready library of common shader routines—something like ICICLE but targeting WebGPU specifically.


Is It Worth Your Time?

If you work on cryptographic applications in the browser, this is required reading. The NTT walkthrough alone is a solid tutorial on compute shader optimization patterns that apply well beyond ZK. The progression from naive single-threaded code to optimized parallel implementation with workgroup memory is a useful template for any compute-heavy WGSL work.

For the rest of us, it's a signal that WebGPU compute is maturing into a serious tool for non-graphics workloads. The ZK space is pushing hard on client-side proving, and WebGPU is how they plan to get there.


Further Reading