LFM2-MoE: An 8.3B Parameter Language Model Running Entirely in Your Browser via WebGPU

0:00

/0:57

A browser-based chat interface generating text from Liquid AI's LFM2-MoE model, running 8.3 billion parameters locally via WebGPU with no server backend.

An 8.3-billion-parameter language model is running in your browser tab right now. Or it will be, about thirty seconds after you open this demo. Liquid AI's LFM2-MoE is a Mixture-of-Experts model that carries 8.3B total parameters but only activates 1.5B per token, and in this Hugging Face Space it loads, quantizes, and generates text entirely client-side via WebGPU. No server round-trip, no API key, no prayer to a distant GPU cluster. You type a prompt, the model downloads its weights into your GPU's memory, and tokens start appearing. It feels like a magic trick whose punchline is that there's no trick.

The collaboration behind it pairs two people worth following. Maxime Labonne is Head of Post-Training at Liquid AI and the author of the LLM Engineer's Handbook. He's responsible for much of LFM2's post-training pipeline: large-scale SFT, custom DPO with length normalization, iterative model merging, and RL with verifiable rewards. Joshua Lochner, the creator of Transformers.js at Hugging Face, contributed the WebGPU inference plumbing that makes the whole thing possible in a browser window. Transformers.js v3 brought WebGPU support that can hit up to 100x faster than WASM, and this demo is a clean proof of that promise applied to a genuinely exotic architecture.

And that architecture is the interesting part. LFM2 isn't a standard transformer. It's a hybrid built from double-gated short-range convolution blocks (what Liquid AI calls Linear Input-Varying operators) interleaved with grouped query attention. The MoE variant stacks experts on top of this already unusual backbone. The result is a model whose KV cache footprint stays small enough that quantized variants fit in the memory budget of a laptop GPU, yet whose output quality benchmarks against dense models two to three times its active size. The source is right there in the Space's file tree if you want to trace how the ONNX model gets loaded and dispatched to WebGPU compute shaders. Try prompting it with something that tests reasoning or multilingual capability and see where it surprises you.

Launch Demo

Live Demo: https://huggingface.co/spaces/LiquidAI/LFM2-MoE-WebGPU
Source Code: https://huggingface.co/spaces/LiquidAI/LFM2-MoE-WebGPU/tree/main
Author(s):
- Joshua Lochner (X, GitHub, Hugging Face, LinkedIn)
- Maxime Labonne (X, GitHub, Hugging Face, LinkedIn)

Stay in the loop