I feel like a kid in a candy shop. Some of these tricks would take way too long to reverse engineer correctly based on the papers. I hope that the releases this week start a renaissance of the use of MoE as baseline academic models.
- Efficient and optimized all-to-all communication - Both intranode and internode support with NVLink and RDMA - High-throughput kernels for training and inference prefilling - Low-latency kernels for inference decoding - Native FP8 dispatch support - Flexible GPU resource control for computation-communication overlapping X: https://x.com/deepseek_ai/status/1894211757604049133
You gotta love these guys, they're really pushing the open source frontier for all of us, thanks for sharing
Open AI™ (with a space)
Kind of ironic that DeepSeek is more Open than ChatGPT
They do it for their own reasons, but OpenAI are straight up liars and they are neither open nor give a fuck about humanity.
OpenAyyyyI swear babe I’m gonna open it up any day. Yeah for that grated good or whatever it is you keep yappin about.
I hope you're reading this Sam Altman:
Make Open AI open.
Or else you'll lose to the ecosystem.
Now it includes the highly anticipated PTX! Of course, I don’t understand it, but I’ve already click the star and even the fork button, which basically means I’ve mastered it, right? I feel incredibly powerful right now...
Is the PTX that everyone was looking forward to included this time?
Yes, there's some in the csrc/kernels directory. Search for 'asm' to find uses of it.
> the PTX that everyone was looking forward to
explanation for the rest of us why this is so important?
The PTX instructions they talked about in the tech report should be pointing to the code here?
"For extreme performance, we discover and use a behavior-out-of-doc PTX instruction: ld.global.nc.L1::no_allocate.L2::256B. This instruction will lead to an undefined behavior: accessing volatile GPU memory with non-coherent read-only PTX modifiers .nc. But the correctness is tested to be guaranteed with .L1::no_allocate on Hopper architectures, and performance will be much better. If you find kernels not working on some other platforms, you may add DISABLE_AGGRESSIVE_PTX_INSTRS=1 to setup.py and disable this, or file an issue."
this might help: https://x.com/main_horse/status/1894215779521794058/photo/1
Round 2 of open source releases from an actual "Open AI™" company and licensed under MIT.
Once again, DeepSeek is more open than the $157B+ one that is claiming to be "Open".
Almost no-one is talking about Meta's Llama and everyone should expect them to release Llama 4 with reasoning.
The objective is to not be squeezed in the middle of the race to zero.