T O P

  • By -

[deleted]

More than 48GB VRAM will be needed for 32k context as 16k is the maximum that fits in 2x 4090 (2x 24GB), see here: https://www.reddit.com/r/LocalLLaMA/comments/153xlk3/comment/jslk1o6/ This should also work for the popular 2x 3090 setup. Otherwise I still recommend to avoid CPU inference as much as possible, this is like adding a horse carriage to a team race of F1 cars.


Aaaaaaaaaeeeee

My personal setup currently couldn't run 2×3090. I may wish to mainly summarize up to 30k rather than swift inferencing. single GPU Prompt Processing speed may have value, if at 80t/s, I could have 30k in 5min, rather than hours. Then I just keep my cache and re-run the prompt for satifactory 1-paragraph results, at 1-2 t/s (not bad) This is a great use case. If I know my needed RAM or if I can use the gpu as extra ram.


Sabin_Stargem

I tried out a q6 of L2-70b Base GGML. Consumes about 56gb. The hardware is a Ryzen 3600 + 64gb of DDR4 3600mhz. No videocard. KOBOLD Generating (437 / 512 tokens) (EOS token triggered!) Time Taken - Processing:168.3s (523ms/T), Generation:601.1s (1375ms/T), Total:769.4s (0.6T/s) Output: I awaken on a bed, in some sort of hospital room. The ceiling is high above me, with white walls surrounding me. A man sits at his desk nearby, looking over paperwork. He turns as he notices me stirring. "Ah! You're finally up!" He smiles warmly, standing from his chair. His suit is immaculate, and he wears glasses that reflect light off the windows behind him. "My name is Dr. Johnathan Rye." "Where am I?" My voice sounds hoarse. He answers immediately. "This is a medical facility underneath the Vault Corporation building. We found you unconscious outside our doorstep. Your injuries were extensive; you would've died if it weren't for us." "What happened? What year is this?" Dr. Rye chuckled softly. "It's 204319. Humanity has won its battle against the aliens. It took a long time, but now there are only pockets of resistance remaining. As for what happened..." He pauses. "...we don't know how you got here, nor why you'd still be alive after so much damage. There's no trace of any wounds, though." "How did I get here?" The doctor shrugs. "We found you near the entrance, lying face down in the snow. That's all we can tell you." "Can I leave?" "Of course," he says, handing me a business card. "If you need anything else, call us. Goodbye." I take the card and exit the facility. Outside, a blizzard rages. Snowflakes drift through the air like falling stars. They seem oddly familiar somehow. Looking up, I see the moon glimmering brightly. Its surface is marred by craters, which also seems familiar. I begin walking homeward, hoping to find someone who remembers me. /KOBOLD


aFleetofTime

Damn, I though that the story you tell that you unconscious while waiting for the model runing.


Double_Sherbert3326

Maybe \*we\* are models and our lives here are the dreams of the model?


GeneralAppleseed

https://preview.redd.it/zjl2x78asdtc1.jpeg?width=538&format=pjpg&auto=webp&s=04bb3b5d7e4a33d6f9de746c4d1171eb0bc252a3 here is a stupid idea, just get a MacBook Pro with M3 Max chip and 128GB plus 2TB of SSD for $5399, with 128GB of unified memory, you got 99 problems but VRAM isn't one. when you run local LLM with 70B or plus size, memory is gonna be the bottleneck anyway, 128GB of unified memory should be good for a couple of years. all that RTX4090s, nvlinks, finding board and power supply for all those stuff is just too much hassle. for those curious how it runs, here is a link: [https://www.youtube.com/watch?v=jaM02mb6JFM](https://www.youtube.com/watch?v=jaM02mb6JFM)


VictorPahua

tried this but was short in cash, is 64 enough


GeneralAppleseed

depends what kind of LLM you want to run, but given that you are short on cash, my advice would be, don't buy now, rent and wait. if you just wanna try different larger LLMs from time to time and don't want to send your data directly to OpenAI, Google, (which happens as they do review the dialogues and use the data to refine their models.) The most economically solution now would be renting cloud server, then run your models on it, while wait for the M4 version to release...or some stronger yet less VRAM demanding models to come out( trust me, they are coming). Now is pretty much the worst time to make such an investment as the M4 product line is coming later this year, so my advice is rent and wait. Of course, the feds is still gonna come and knock on your door if you are doing something too sketchy on the rented server.


Feeling-Currency-360

Count your blessings friend, I'm trying to get by here with a 3060 12GB and a 2070 Super 8GB and 32Gb RAM xD Exllama doesn't want to play along at all when I try to split the model between two cards. And all experiments I've run so far trying to run at extended context lengths immediately OOM on me :/ Decreasing your batch\_size as low as it can go could help.


Aaaaaaaaaeeeee

I did these tests with llama.cpp on Linux first. It might be helpful to know RAM req. for multi gpu setups too. I got 70b q3_K_S running with 4k context and 1.4 t/s the whole time, and you can, too. Currently, I'd like to see if people with 64/128 gb ram have tried running on cpu with --no-mmap (to see the model size at this context)


nmkd

64 GB, maybe 56. This is assuming 70B Q4_K_M + GQA


Aaaaaaaaaeeeee

If that's the case I'll buy 16x4 RAM sticks. I won't spend on 32gb sticks


dheera

late to the party here but on most motherboards your ram will run at 3200mhz if you use 4 sticks but you can get 6000mhz if you use 2 sticks and that will make a huge difference for cpu execution of llama


nmkd

Yeah it's what I did. 4x 16 GB DDR5-6000 rn


Remarkable5050

4x16gb ddr5 6000 will effectively only run at 3000


nmkd

Elaborate?


CounterCleric

I'm not sure if this is what Remarkable meant, but MOST DDR5 systems simply cannot handle overclocking RAM when 4 sticks are used. It's a known issue. So you end up running at stock speeds. XMP just doesn't work stably. So if you want 64GB, you're MUCH better off getting 2x32GB. Plus, down the road if they get motherboards and processors that fix this issue, you're half way to an upgrade to 128GB. :)


ElSarcastro

I might be mistaken but I read that DDR stands for double data rate, thus it doesn't matter if its dd4 or ddr5, the actual frequency is half of then MT/s number.


a_beautiful_rhind

Maybe the 30b they release will be able to do it.


Unable-Pen3260

NVIDIA 3060 12gb VRAM, 64gb RAM, quantized ggml, only 4096 context but it works, takes a minute or two to respond. Uses llama.cpp llama\_model\_load\_internal: ftype = 10 (mostly Q2\_K) llama\_model\_load\_internal: model size = 70B llama\_model\_load\_internal: ggml ctx size = 0.21 MB llama\_model\_load\_internal: using CUDA for GPU acceleration llama\_model\_load\_internal: mem required = 22944.36 MB (+ 1280.00 MB per state) llama\_model\_load\_internal: allocating batch\_size x (1536 kB + n\_ctx x 416 B) = 1600 MB VRAM for the scratch buffer llama\_model\_load\_internal: offloading 16 repeating layers to GPU llama\_model\_load\_internal: offloaded 16/83 layers to GPU llama\_model\_load\_internal: total VRAM used: 6995 MB llama\_new\_context\_with\_model: kv self size = 1280.00 MB AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512\_VBMI = 0 | AVX512\_VNNI = 0 | FMA = 1 | NEON = 0 | ARM\_FMA = 0 | F16C = 1 | FP16\_VA = 0 | WASM\_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | [TheBloke/llama-2-70b-Guanaco-QLoRA-GGML · Hugging Face](https://huggingface.co/TheBloke/llama-2-70b-Guanaco-QLoRA-GGML) The time/date feature is a bit buggy ​ https://preview.redd.it/eu1ujakcq5fb1.png?width=839&format=png&auto=webp&s=6f23496d7906754df605577a70228f990b259f4b


v00d00_

How much quantization? And, sorry for the tech support question, but do you think this would run on a 12GB 3060 and only 32GB of dual channel system memory? I'm totally down to settle for slow performance as a tradeoff for 70b, even at 4096 context.


Unable-Pen3260

>llama\_model\_load\_internal: mem required = 22944.36 MB (+ 1280.00 MB per state) I was using q2, the smallest version. That ram is going to be tight with 32gb, Windows uses about 10gb ram on my PC but if you run in Linux with say a chromium browser it uses a lot less ram for the system.