Depends on the version you’re running.
- 19 Posts
- 103 Comments
robber@lemmy.mlOPto
LocalLLaMA@sh.itjust.works•Relevance of GPU driver version for inference performanceEnglish
1·29 days agoI see. When I run the inference engine containerized, will the container be able to run its own version of CUDA or use the host’s version?
robber@lemmy.mlOPto
LocalLLaMA@sh.itjust.works•Relevance of GPU driver version for inference performanceEnglish
2·29 days agoThank you for taking the time to respond.
I’ve used vLLM for hosting a smaller model which could fit in two of GPUs, it was very performant especially for multiple requests at the same time. The major drawback for my setup was that it only supports tensor parallelism for 2, 4, 8, etc. GPUs and data paralellism slowed inference down considerably, at least for my cards. exllamav3 is the only engine I’m aware of which support 3-way TP.
But I’m fully with you in that vLLM seems to be the most recommended and battle-tested solution.
I might take a look at how I can safely upgrade the driver until I can afford a fourth card and switch back to vLLM.
robber@lemmy.mlOPto
LocalLLaMA@sh.itjust.works•Relevance of GPU driver version for inference performanceEnglish
2·30 days agoI use the the proprietary ones from Nvidia, they’re at 535 on oldstable IIRC but there are a lot newer ones.
I use 3xRTX2000e Ada. It’s a rather new, quite power efficient GPU manufactured by PNY.
As inference engine I use exllamav3 with tabbyAPI. I like it very much because it supports 3-way tensor paralellism, making it a lot faster for me than llamacpp.
robber@lemmy.mlOPto
LocalLLaMA@sh.itjust.works•Relevance of GPU driver version for inference performanceEnglish
1·30 days agoI use the the proprietary ones from Nvidia, they’re at 535 on oldstable IIRC but there are a lot newer ones.
robber@lemmy.mlto
Asklemmy@lemmy.ml•Looking for a movie about a guy who gets their brian transfered to this like construction worker thing. I can't remember the name.
6·1 month agoThat brian typo really gave me a chuckle. Hope you found the movie you were looking for.
Wikipedia states the UI layer is propriertary, is that true?
The country’s official app for COVID immunity certificates or whatever they were called was available on F-Droid at the time.
robber@lemmy.mlOPto
LocalLLaMA@sh.itjust.works•Magistral-Small-2509 by Mistral has been releasedEnglish
2·2 months agoToo bad they’ve only been dropping dense models recently. Also kind of interesting since with Mixtral back in the days they were way ahead of time.
A review from earlier this year didn’t sound too bad.
Edit: as pointed out, the review seems to be about the previous version of the phone.
robber@lemmy.mlOPto
LocalLLaMA@sh.itjust.works•Qwen3-Next with 80b-a3b parameters is outEnglish
5·2 months agoI’d add that memory bandwidth is still a relevant factor, so the faster the RAM the faster the inference will be. I think this model would be a perfect fit for the Strix Halo or a >= 64GB Apple Silicon machine, when aiming for CPU-only inference. But mind that llamacpp does not yet support the qwen3-next architecture.
robber@lemmy.mlto
Selfhosted@lemmy.world•Been seeing a lot of posts about replacing Spotify and such, so I wrote up a guide on how I did just thatEnglish
3·2 months agoOne reason could be that the audience on lemmy has a left-ish bias and there’s a political component to the Spotify exodus.
Edit: don’t get me wrong, I love seeing content and engagement on here.
SFTPGo is such an awesome project, never had any problems with it.
robber@lemmy.mlto
LocalLLaMA@sh.itjust.works•Current best local models for tool use?English
1·5 months agoSome people on another discussion platform were praising the new Mistral Small models for agentic use. I wasn’t able to try them myself yet, but with 24b params you would certainly fit a quantized version in your 24GB.
Thanks for the tip about kobold, didn’t know about that.
And yeah, I can understand that building your own rig might feel overwhelming at first, but there’s tons of information online that I’m sure will help you get there!
Alright thanks! I found it somewhat difficult to find information about the hardware requirements online, but yeah, maybe I just have to try it.
Thanks that looks cool, I’ll definitely try and report back. Do you happen to know what the hardware requirements are? I have access 64GB of RAM and 48GB of VRAM across 3 RTX2000e Ada GPUs.
robber@lemmy.mlto
Selfhosted@lemmy.world•Best way to get IPv4 connectivity to my self-hosted servicesEnglish
5·5 months agoI’ll add Pangolin to the list, it’s a self-hosted Cloudflare tunnel alternative.
It really depends on how much you enjoy to set things up for yourself and how much it hurts you to give up control over your data with managed solutions.
If you want to do it yourself, I recommend taking a look at ZFS and its RAIDZ configurations, snapshots and replication capabilities. It’s probably the most solid setup you will achieve, but possibly also a bit complicated to wrap your head around at first.
But there are a ton of options as beautifully represented by all the comments.











Given that Google generated more than 250 billion U.S. dollars in ad revenue in 2024, I’d say they must be pretty effective.
Source