Blog

webgpu ai nlp

NLP in the browser with WebGPU

Turns out you can actually run transformer models in the browser now.

In 2018 I created Nooshub, a news reader that groups similar content like Google News, but with RSS sources. The original goal was that articles from different languages could be grouped by similarity, but I ended up training different models for each language. During the years the nlp technology advanced quickly and there are multilingual transformer models that are able to do this, and even have a better understanding for the context of words. The problem with these transformer models at least to me is hosting, which usually needs a GPU to be fast enough, while the old embedding models can do inference on a normal CPU quickly without issues. A server with a GPU is much more expensive than a normal server, so my idea was - why not just run these models on the client side using webgpu? That would also make hosting easier. Well, turns out that with the latest release of transformers.js that just landed it is possible to do that!

I did a little test setup clustering tech news on a M1 pro macbook with 32gb ram, using the multilingual model Xenova/paraphrase-multilingual-MiniLM-L12-v2. In Safari, which apparently cannot use webgpu on macOS 15, the inference takes about 30sec running on wasm/CPU - too much to be usable. In Firefox using webgpu it takes about 5sec - that is still slow for an actual product, but it is a big step forward! I think it is not too far stretched to think of a future where client devices have the builtin power to run these small models with reasonable performance.

To make the comparison with the existing product a little bit more fair I also tested with a smaller monolingual model Xenova/all-MiniLM-L6-v2 - it was about twice as fast for the same task - 18sec without GPU and <3sec with GPU.

Accessing something powerful like the gpu client side while transformer models are getting more efficient could hint at a future where there are more specialized models running locally instead of a frontier-model-for-everything approach. Webgpu is already widely supported at about 80% coverage according to caniuse, but of course not everybody with theoretical webgpu support also runs powerful gpus.

If you want to play around yourself in your browser, I created a little repo.