No Login Data Private Local Save

WebGPU Compute Benchmark - Online GPU Performance Test

7
0
0
0

WebGPU Compute Benchmark

Test your GPU's compute performance with real WebGPU compute shaders

Test Configuration

Results
Peak Compute Performance
--
GFLOPS
Run a benchmark to see results

Test Breakdown

No results yet

History

Date GPU Test Size GFLOPS Time
No history
Frequently Asked Questions

WebGPU is a modern graphics and compute API for the web, providing low-level access to your GPU. This benchmark runs real compute shaders (matrix multiplication and vector operations) directly on your GPU using WebGPU, measuring raw computational throughput in GFLOPS (billions of floating-point operations per second). It's a genuine GPU workload — not a synthetic browser test.

GFLOPS = Giga (Billion) Floating-Point Operations Per Second. It measures how many billion math operations your GPU can perform each second.

Rough reference (tiled matrix mul):
• Integrated laptop GPU: 200–800 GFLOPS
• Mid-range dedicated GPU: 2,000–8,000 GFLOPS
• High-end GPU (RTX 4080/4090): 15,000–40,000+ GFLOPS
• Mobile phone GPU: 50–400 GFLOPS
Note: WebGPU may not reach peak theoretical performance due to API overhead and browser limitations.

Several factors affect WebGPU benchmark scores compared to native tools (like CUDA/OpenCL benchmarks):
• Browser overhead: WebGPU runs inside a browser sandbox with additional safety checks.
• Shader compilation: WGSL shaders are compiled by the browser, potentially with different optimizations.
• Power management: Browsers may throttle GPU performance to save energy, especially on laptops.
• API maturity: WebGPU is still evolving; performance may improve with browser updates.
For the most accurate comparison, always use the same browser version and close other GPU-intensive applications.

Desktop:
• Chrome 113+ (fully supported)
• Edge 113+ (fully supported)
• Opera 99+ (fully supported)
• Firefox: Nightly builds with experimental flag (dom.webgpu.enabled)
• Safari: Experimental support in Safari 17+ (needs feature flag)

Mobile:
• Chrome for Android 121+
• Samsung Internet 23+
If your browser doesn't support WebGPU, this tool will show a notification with upgrade instructions.

Tiled Matrix Multiplication: Uses workgroup shared memory (tiles) to minimize global memory access. This is the optimized approach and typically achieves much higher GFLOPS — it better reflects peak compute capability.

Naive Matrix Multiplication: Each thread directly reads from global memory without tiling. Performance is limited by memory bandwidth — this reflects real-world performance when data access patterns aren't optimized.

Vector Addition: A simple element-wise operation (C = A + B). This is entirely memory-bandwidth-bound and measures how fast your GPU can move data, reported in GB/s rather than GFLOPS.

• Close other browser tabs and GPU-intensive applications before testing.
• Run with higher iteration counts (10–20) for more stable averages.
• Plug in your laptop — on battery, the GPU is often throttled.
• Use the largest matrix size your GPU can handle comfortably for more meaningful measurements.
• Allow a warmup run (the first run includes shader compilation overhead; subsequent runs are faster).
• Disable browser extensions that might interfere with GPU access.

Yes, it's completely safe. This benchmark runs short, controlled compute workloads — typically just a few milliseconds each. It does not run sustained stress tests that could cause overheating. Your browser also enforces time limits on GPU operations (usually 1–2 seconds), automatically terminating any excessively long shader execution. The workload is similar to what a complex web application or game might demand from your GPU.
Understanding GPU Compute Performance
Compute vs. Graphics

While graphics rendering focuses on pixels and triangles, GPU compute uses the same hardware for general-purpose parallel calculations — from AI inference to scientific simulations. WebGPU exposes both capabilities to the browser.

Memory Bandwidth Matters

Many compute workloads are bottlenecked by how fast data can move between GPU memory and compute units, not by raw calculation speed. Our naive matrix multiply and vector add tests specifically measure this aspect.

Workgroup Optimization

Tiled algorithms use fast on-chip shared memory (workgroup memory) to reduce global memory traffic. A well-optimized tiled implementation can be 5–20× faster than a naive approach on the same hardware.