GPU VRAM Price (€) Bandwidth (TB/s) TFLOP16 €/GB €/TB/s €/TFLOP16
NVIDIA H200 NVL 141GB 36284 4.89 1671 257 7423 21
NVIDIA RTX PRO 6000 Blackwell 96GB 8450 1.79 126.0 88 4720 67
NVIDIA RTX 5090 32GB 2299 1.79 104.8 71 1284 22
AMD RADEON 9070XT 16GB 665 0.6446 97.32 41 1031 7
AMD RADEON 9070 16GB 619 0.6446 72.25 38 960 8.5
AMD RADEON 9060XT 16GB 382 0.3223 51.28 23 1186 7.45

This post is part “hear me out” and part asking for advice.

Looking at the table above AI gpus are a pure scam, and it would make much more sense to (atleast looking at this) to use gaming gpus instead, either trough a frankenstein of pcie switches or high bandwith network.

so my question is if somebody has build a similar setup and what their experience has been. And what the expected overhead performance hit is and if it can be made up for by having just way more raw peformance for the same price.

  • brucethemoose@lemmy.world
    link
    fedilink
    English
    arrow-up
    4
    ·
    edit-2
    2 hours ago

    Be specific!

    • What models size (or model) are you looking to host?

    • At what context length?

    • What kind of speed (token/s) do you need?

    • Is it just for you, or many people? How many? In other words should the serving be parallel?

    In other words, it depends, but the sweetpsot option for a self hosted rig, OP, is probably:

    • One 5090 or A6000 ADA GPU. Or maybe 2x 3090s/4090s, underclocked.

    • A cost-effective EPYC CPU/Mobo

    • At least 256 GB DDR5

    Now run ik_llama.cpp, and you can serve Deepseek 671B faster than you can read without burning your house down with H200s: https://github.com/ikawrakow/ik_llama.cpp

    It will also do for dots.llm, kimi, pretty much any of the mega MoEs de joure.

    But there’s all sorts of niches. In a nutshell, don’t think “How much do I need for AI?” But “What is my target use case, what model is good for that, and what’s the best runtime for it?” Then build your rig around that.

    • PeriodicallyPedantic@lemmy.ca
      link
      fedilink
      English
      arrow-up
      1
      ·
      6 minutes ago

      Is Nvidia still a defacto requirement? I’ve heard of and support being added to OLlama and etc, but I haven’t found robust comparisons on value.

    • TheMightyCat@ani.socialOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 hour ago

      My target model is Qwen/Qwen3-235B-A22B-FP8. Ideally its maxium context lenght of 131K but i’m willing to compromise. I find it hard to give an concrete t/s awnser, let’s put it around 50. At max load probably around 8 concurrent users, but these situations will be rare enough that oprimizing for single user is probably more worth it.

      My current setup is already: Xeon w7-3465X 128gb DDR5 2x 4090

      It gets nice enough peformance loading 32B models completely in vram, but i am skeptical that a simillar system can run a 671B at higher speeds then a snails space, i currently run vLLM because it has higher peformance with tensor parrelism then lama.cpp but i shall check out ik_lama.cpp.

      • brucethemoose@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        58 minutes ago

        Ah, here we go:

        https://huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF

        Ubergarm is great. See this part in particular: https://huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF#quick-start

        You will need to modify the syntax for 2x GPUs. I’d recommend starting f16/f16 K/V cache with 32K (to see if that’s acceptable, as then theres no dequantization compute overhead), and try not go lower than q8_0/q5_1 (as the V is more amenable to quantization).

      • brucethemoose@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        49 minutes ago

        Qwen3-235B-A22B-FP8

        Good! An MoE.

        Ideally its maxium context lenght of 131K but i’m willing to compromise.

        I can tell you from experience all Qwen models are terrible past 32K. What’s more, going over 32K, you have to run them in a special “mode” (YaRN) that degrades performance under 32K. This is particularly bad in vllm, as it does not support dynamic YaRN scaling.

        Also, you lose a lot of quality with FP8/AWQ quantization unless it’s native FP8 (like deepseek). Exllama and ik_llama.cpp quants are much higher quality, and their low batch performance is still quite good. Also, VLLM has no good K/V cache quantization (its FP8 destroys quality), while llama.cpp’s is good, and exllama’s is excellent, making it less than ideal for >16K. Its niche is more highly parallel, low context size serving.

        My current setup is already: Xeon w7-3465X 128gb DDR5 2x 4090

        Honestly, you should be set now. I can get 16+ t/s with high context Hunyuan 70B (which is 13B active) on a 7800 CPU/3090 GPU system with ik_llama.cpp. That rig (8 channel DDR5, and plenty of it, vs my 2 channels) should at least double that with 235B, with the right quantization, and you could speed it up by throwing in 2 more 4090s. The project is explicitly optimized for your exact rig, basically :)

        It is poorly documented through. The general strategy is to keep the “core” of the LLM on the GPUs while offloading the less compute intense experts to RAM, and it takes some tinkering. There’s even a project to try and calculate it automatically:

        https://github.com/k-koehler/gguf-tensor-overrider

        IK_llama.cpp can also use special GGUFs regular llama.cpp can’t take, for faster inference in less space. I’m not sure if one for 235B is floating around huggingface, I will check.


        Side note: I hope you can see why I asked. The web of engine strengths/quirks is extremely complicated, heh, and the answer could be totally different for different models.

  • enumerator4829@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    18
    arrow-down
    1
    ·
    5 hours ago

    Well, a few issues:

    • For hosting or training large models you want high bandwidth between GPUs. PCIe is too slow, NVLink has literally a magnitude more bandwidth. See what Nvidia is doing with NVLink and AMD is doing with InfinityFabric. Only available if you pay the premium, and if you need the bandwidth, you are most likely happy to pay.
    • Same thing as above, but with memory bandwidth. The HBM-chips in a H200 will run in circles around the GDDR-garbage they hand out to the poor people with filthy consumer cards. By the way, your inference and training is most likely bottlenecked by memory bandwidth, not available compute.
    • Commercially supported cooling of gaming GPUs in rack servers? Lol. Good luck getting any reputable hardware vendor to sell you that, and definitely not at the power densities you want in a data center.
    • TFLOP16 isn’t enough. Look at 4 and 8 bit tensor numbers, that’s where the expensive silicon is used.
    • Nvidias licensing agreements basically prohibit gaming cards in servers. No one will sell it to you at any scale.

    For fun, home use, research or small time hacking? Sure, buy all the gaming cards you can. If you actually need support and have a commercial use case? Pony up. Either way, benchmark your workload, don’t look at marketing numbers.

    Is it a scam? Of course, but you can’t avoid it.

    • TheMightyCat@ani.socialOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      edit-2
      4 hours ago
      • I know the more bandwidth the better, but i wonder how does it scale. I can only test my own setup which is less then optimal for this purpose with pcie 4.0 x16 and no p2p, but it goes as follows: a single 4090 gets 40.9 t/s while 2 get 58.5 t/s using tensor parrelism tested on Qwen/Qwen3-8B-FP8 with vLLM. I am really curious how this scales over more then 2 pcie 5.0 cards with p2p, which all cards here listed except the 5090 support.
      • The theory goes that yes while the H200 has a very impressive bandwith of 4.89 TB/s, but for the same price you can get 37 TB/s spread across 58 RX 9070s, but if this actually works in practice i don’t know.
      • I don’t need to build a datacenter, i’m fine with building a rack myself in my garage. And i don’t think that requires higher volumes than just purchasing at different retailers
      • I intend to run at fp8 so i wanted to show that instead of fp16 but its surprisingly difficult to find the numbers for that, only the H200 datasheet, cleary displays FP8 Tensor Core, the RTX pro 6000 datasheet keeps it vague with only mentioning AI TOPS, which they define as Effective FP4 TOPS with sparsity, and they didn’t even bother writing a datasheet for he 5090 only saying 3352 AI TOPS, which i suppose is fp4 then. the AMD datasheets only list fp16 and int8 matrix, whether int8 matrix is equal to fp8 i don’t know. So FP16 was the common denominator for all the cards i could find without comparing apples with oranges.
      • non_burglar@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        3 hours ago

        I don’t need to build a datacenter, i’m fine with building a rack myself in my garage.

        During the last GPU mining craze, I helped build a 3-rack mining operation. Gpus are unregulated pieces of power-sucking shit from a power management perspective. You do not have the power requirements to do this on residential power, even at 300amp service.

        Think of a microwave’s behaviour ; yes, a 1000w microwave pulls between 700 and 900w while cooking, but the startup load is massive, almost 1800w sometimes, depending on how cheap the thing is.

        GPUs also behave like this, but not at startup. They spin up load predictively, which means the hardware demands more power to get the job done, it doesn’t scale down the job to save power. Multiply by 58 rx9070. Now add cooling.

        You cannot do this.

        • TheMightyCat@ani.socialOP
          link
          fedilink
          English
          arrow-up
          1
          ·
          2 hours ago

          Thanks, While I still would like to know thr peformance scaling of a cheap cluster this does awnser the question, pay way more for high end cards like the H200 for greater efficiency, or pay less and have to deal with these issues.

  • AreaKode@lemmy.world
    link
    fedilink
    English
    arrow-up
    16
    arrow-down
    4
    ·
    6 hours ago

    “AI” in it’s current form, is a scam. Nvidia is making the most of this grift. They are now worth more money in the world than any other company.

      • AreaKode@lemmy.world
        link
        fedilink
        English
        arrow-up
        9
        arrow-down
        3
        ·
        5 hours ago

        LLMs are experimental, alpha-level technologies. Nvidia showed investors how fast their cards could compute this information. Now investors can just tell the LLM what they want, and it will spit out something that probably looks similar to what they want. But Nvidia is going to sell as many cards as possible before the bubble bursts.

          • AreaKode@lemmy.world
            link
            fedilink
            English
            arrow-up
            3
            ·
            3 hours ago

            Any time you need a CPU that can do a shit load of basic math, a GPU will win every time.

            • iopq@lemmy.world
              link
              fedilink
              English
              arrow-up
              2
              ·
              3 hours ago

              You can design algorithms specifically to mess up parallelism by branching a lot. For example, if you want your password hashes to be GPU-resistant.

      • metaStatic@kbin.earth
        link
        fedilink
        arrow-up
        3
        ·
        5 hours ago

        ML has been sold as AI and honestly that’s enough of a scam for me to call it one.

        but I also don’t really see end users getting scammed just venture capital and I’m ok with this.

        • AreaKode@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          3 hours ago

          Correct. Pattern recognition + prompts to desire a positive result even if the answer isn’t entirely true. If it’s close enough to the desired pattern, it get pushed.

  • Mordikan@kbin.earth
    link
    fedilink
    arrow-up
    5
    arrow-down
    1
    ·
    5 hours ago

    The AI cards prioritize compute density instead of frame rate, etc so you can’t directly compare price points between them like that without including that data. You could cluster gaming cards, though, using NVLink or the AMD Fabric thing. You aren’t going to get any where near the same performance, and you are really going to rely on quantization to make it work, but depending on your use case in self-hosting you probably don’t need a $30,000 card.

    Its not a scam, but its also something you probably don’t need.

  • Quik@infosec.pub
    link
    fedilink
    English
    arrow-up
    3
    ·
    5 hours ago

    For your personal use, you probably shouldn’t get a “AI” GPU. If you start needing a terrabyte of VRAM and heat, space, and energy start getting real problems, reconsider.

  • atzanteol@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    3
    arrow-down
    1
    ·
    6 hours ago

    Looking at the table above AI gpus are a pure scam

    How much more power are your gaming GPUs going to use? How much more space will they use? How much more heat will you need to dissipate?

    • TheMightyCat@ani.socialOP
      link
      fedilink
      English
      arrow-up
      3
      arrow-down
      4
      ·
      6 hours ago

      Well a scam for selfhosters, for datacenters it’s different ofcourse.

      Im looking to upgrade to my first dedicated built server coming from only SBCs so I’m not sure how much of a concern heat will be, but space and power shouldn’t be an issue. (Within reason ofcourse)

        • TheMightyCat@ani.socialOP
          link
          fedilink
          English
          arrow-up
          1
          ·
          2 hours ago

          While I would still say it’s excessive to respond with “😑” i was too quick in waving these issues away.

          Another commenter explained that residential power physically does not suppply enough to match high end gpus is why even for selfhosters they could be worth it.

  • BrightCandle@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    arrow-down
    1
    ·
    edit-2
    5 hours ago

    Initially a lot of the AI was getting trained on lower class GPUs and none of these AI special cards/blades existed. The problem is that the problems are quite large and hence require a lot of VRAM to work on or you split it and pay enormous latency penalties going across the network. Putting it all into one giant package costs a lot more but it also performs a lot better, because AI is not an embarrassingly parallel problem that can be easily split across many GPUs without penalty. So the goal is often to reduce the number of GPUs you need to get a result quickly enough and it brings its own set of problems of power density in server racks.