

Been trying to play with this in ik_llama.cpp, and it’s a temperamental model. It feels deep fried, like it wants to be smart if it would just stop looping or getting its own think template wrong.
It works great in 24GB VRAM though. I’m getting like 16 tok/sec at longish context, with 15 experts on the GPU and the rest offloaded.
Here, we are safe. Here, we are free.