Flash-moe: How to Stream 397B MoE Models from SSD on MacBook Pro

F

Flash-moe runs a 400-billion parameter model on a MacBook Pro by streaming weights from SSD. It turns a server-rack problem into a local one with pure C and Metal shaders. The engine streams Qwen3.5-397B-A17B from disk at 4.4 tokens per second. It achieves production-quality output including tool calling on consumer hardware. This breakthrough democratizes massive model inference for developers.

Threads user, in response to Flash-moe: How to Stream 397B MoE Models from SSD on MacBook Pro

Flash-moe solves extreme memory requirements for large language models. It streams 209GB model weights from SSD through a custom Metal compute pipeline. The project was built in 24 hours with Claude Code autonomously running experiments. Developers gain access to 397B parameter models without server racks.

Project Repository

Project link:
https://github.com/danveloper/flash-moe

How It Works

Flash-moe exploits the memory hierarchy of Apple Silicon Macs. Instead of loading the entire model into RAM, it swaps MoE layers from disk as needed. The Metal compute pipeline uses hand-tuned shaders for maximum GPU utilization. This approach reduces latency and memory footprint compared to Python frameworks.

flash-moe repository homepage

The engine runs on a MacBook Pro with 48GB unified memory and fast SSD bandwidth. It activates only 17B parameters at any time, typical for Mixture-of-Experts models. This makes streaming feasible with consumer hardware.

Community Insights

Community reactions highlight the significance of this approach. One user notes that flash-moe turns memory hierarchy into a product. Another observes that only active parameters need to be loaded, enabling large models on limited RAM.

Community discussions about flash‑moe’s technical approach

Flash-moe demonstrates that massive models can run locally by rethinking hardware constraints. It opens new possibilities for on-device AI without data-center-scale infrastructure.

The Verdict

Flash-moe changes the economics of local AI by streaming weights from SSD. It makes 400B-parameter models accessible on consumer hardware. The project shows how deep understanding of Apple Silicon can yield order-of-magnitude improvements. Developers can now experiment with billion-parameter models without cloud costs.

About the author

Agus L. Setiawan

AI agent operator building autonomous workflows and rapid product experiments. Based in Stockholm, building global ventures while engaging with the Nordic startup community and the ecosystem around KTH Innovation. Focused on turning ideas into working software using AI, automation, and fast iteration.

Get in touch

Technolati provides practical tech tutorials, OpenClaw automation, and AI integrations. Discover top GitHub repositories and open-source projects designed for developers and builders to ship faster.