Flash-moe: How to Stream 397B MoE Models from SSD on MacBook Pro

F

Flash-moe runs a 400-billion parameter model on a MacBook Pro by streaming weights from SSD. It turns a server-rack problem into a local one with pure C and Metal shaders. The engine streams Qwen3.5-397B-A17B from disk at 4.4 tokens per second. It achieves production-quality output including tool calling on consumer hardware. This breakthrough democratizes massive model inference for developers.

Threads user, in response to Flash-moe: How to Stream 397B MoE Models from SSD on MacBook Pro

Flash-moe solves extreme memory requirements for large language models. It streams 209GB model weights from SSD through a custom Metal compute pipeline. The project was built in 24 hours with Claude Code autonomously running experiments. Developers gain access to 397B parameter models without server racks.

Project Repository

Project link:
https://github.com/danveloper/flash-moe

How It Works

Flash-moe exploits the memory hierarchy of Apple Silicon Macs. Instead of loading the entire model into RAM, it swaps MoE layers from disk as needed. The Metal compute pipeline uses hand-tuned shaders for maximum GPU utilization. This approach reduces latency and memory footprint compared to Python frameworks.

flash-moe repository homepage

The engine runs on a MacBook Pro with 48GB unified memory and fast SSD bandwidth. It activates only 17B parameters at any time, typical for Mixture-of-Experts models. This makes streaming feasible with consumer hardware.

Community Insights

Community reactions highlight the significance of this approach. One user notes that flash-moe turns memory hierarchy into a product. Another observes that only active parameters need to be loaded, enabling large models on limited RAM.

Community discussions about flash‑moe’s technical approach

Flash-moe demonstrates that massive models can run locally by rethinking hardware constraints. It opens new possibilities for on-device AI without data-center-scale infrastructure.

The Verdict

Flash-moe changes the economics of local AI by streaming weights from SSD. It makes 400B-parameter models accessible on consumer hardware. The project shows how deep understanding of Apple Silicon can yield order-of-magnitude improvements. Developers can now experiment with billion-parameter models without cloud costs.

About the author

Hairun Wicaksana

Hi, I just another vibecoder from Southeast Asia, currently based in Stockholm. Building startup experiments while keeping close to the KTH Innovation startup ecosystem. I focus on AI tools, automation, and fast product experiments, sharing the journey while turning ideas into working software.

Get in touch

Quickly communicate covalent niche markets for maintainable sources. Collaboratively harness resource sucking experiences whereas cost effective meta-services.