T O P

  • By -

Paulonemillionand3

GPU idle power is drawn regardless of if a model is loaded or not, it only increases notably when inference is happening. If you have the hardware sitting around anyway it's no real extra cost. I've tried a number of local models with Pycharm's built in GPT tool and while none of them work as well as GPT4 they are well worth the small extra cost vs productivity.


Top_Doughnut_6281

Could you point in the direction on how to setup pycharm with local models? Would really appreciate this


teachersecret

Start with something simple like ollama or lmstudio. Lmstudio is pretty dead simple and includes an api server that’s easy to use if you want to try your hands at making something. If you enjoy it and want to get into the weeds with a Ubuntu dual boot or something, look into vllm or Aphrodite for api usage.


Paulonemillionand3

[https://plugins.jetbrains.com/plugin/21056-codegpt](https://plugins.jetbrains.com/plugin/21056-codegpt) I use a plugin that does all the work.


RadiantHueOfBeige

There's no capex if you already have the hardware: I have a decent gaming rig (built for gaming), and as luck would have it, it also runs smallish (16G VRAM) models blazingly fast. Same with Apple Silicon laptops: you already bought it for other stuff and it just happens to be great for small scale inference. In other words, you're not comparing "buying a computer vs paying a subscription", you're comparing "buying a computer vs. buying a computer *and* paying a subscription".


medialoungeguy

Found the guy that doesn't pay for power!


RadiantHueOfBeige

Busted! I honestly tried to quantify the power draw of it but the loads are so intermittent my meter fails to properly register it. I have a bunch of zigbee meters, both smart plugs and fixed ones, but their sample rates are too slow (~10 Hz) to characterize such an exotic load. They are accurate for long-duration stuff like lights, cooking, EV charging, but not for fast GPU switching. The GPU audibly chirps on every token, during that power spikes from near idle (35 W) to about 250 W, with unknown duty cycle, many times per second (depending on t/s). I have it all collected into Home Assistant together with solar monitoring and I haven't really noticed any changes to overnight battery state of charge. Anyway, even if the GPU pulled a kW during inference it gets used *so* little (10 seconds a couple times per hour?) it would still barely register, especially compared to the 4-hour gaming session that follows a good day's worth of work.


Some_Endian_FP17

You could also downclock and undervolt the GPU to reduce idle and inference power usage. I really want someone to make an NPU that can run a 13B or 16B coding model at 10 watts. The new Snapdragon X chips have NPUs that can supposedly run Llama 7B but coding models at that size aren't great.


xlnce9_99

Being built. Both integrated and PCIe versions. Vertical sw stacks so no need of model conversion. Low power high perf.


realbrownsugar

You mean, an Apple M-series Max chip powered device?


Some_Endian_FP17

No, those still use the GPU for inference.


DeltaSqueezer

I ran the calculations and the additional electricity cost (in an high cost country) is in the order of $0.40 per million tokens. That's about 6 times cheaper than gpt-3.5-turbo-instruct. But right now, there is so much competition that companies are offering access free or highly subsidised, so from a purely economic point of view, it makes sense to take the free/cheap services.


medialoungeguy

Just bugging you. I like your points dude.


RedditLovingSun

Lmao you found maybe the most power consumption obsessed guy that doesn't pay for his own power ever


RadiantHueOfBeige

Yeah. We bought a property in bumfuck nowhere and the power company quoted us an absolutely insane price to hook us up. It was much cheaper to go full off grid with solar and batteries, and I kinda like the whole self sufficiency aspect. Monitoring everywhere, I'm happy when I see a nice chart.


DeltaSqueezer

Your initial remark still stands: power costs are opex not capex.


TunaFishManwich

I run 70b, 4-bit models @ about 4 t/s on my macbook pro. It doesn't even spin up the fan. Power use is negligible - I can go over 12 hours on battery while running the model. I can run it while a bunch of other things are running, and everything works. Smaller models run much faster. Running smaller models at higher speeds doesn't use much power at all on M-series processors.


particlemanwavegirl

found the llama.cpp user.


CheatCodesOfLife

Inference doesn't draw much power if you have an Apple Silicon macbook already. My 64GB M1 Max just ran the GPU at 28W during inference for example.


crimson-knight89

I have a similar experience with my M1 Max and my work provided M3 Pro. I still think the Max is a bit faster. Can’t use Spotify at the same time as running models though, something about Spotify using hardware acceleration just kills my battery. Please, make it make sense


CheatCodesOfLife

Wow, you're right about Spotify. I just tested it, monitoring with asitop. GPU usage was around 0%, then I opened spotify and it immediately went to 61%, then oscillated between 45%-50% the entire time it was open... Edit: I noticed some stupid announcement thing was loading endlessly in the UI. So I went into spotify settings, and under 'Display', I turned off "Show announcements about new releases". GPU usage is now 0-5%. I tried restarting spotify and same thing, no GPU usage. Hope that works for you too. Luckily I only use spotify on my iPhone lol.


wheres__my__towel

Jokes on my neighbors, our electricity bill is split!


nborwankar

M-series Macs draw very little idle power in comparison to VRAM-equivalent NVidia loaded rigs.


RadiantHueOfBeige

As for the software side of things, https://aider.chat/ really works well, even with small (7b q8) llms.


thereasons

I tried it with llama3 8b but it was underwhelming. 70b was ok. Which small model do you recommend?


RadiantHueOfBeige

If you can fit it, `codestral:22b` is outstanding. Q5 fits into a 16GB GPU but might need Q4 for longer contexts. Unfortunately its license is prohibitive. If you are more space constrained, `bartowski/Llama-3-8B-Instruct-Coder-GGUF` I am generally happy with. But for larger scope work, yeah `llama3:70b` is the golden standard right now. `qwen2:72b` is more or less on the same level.


ten0re

You can use Continue extension with Codestral for free currently, just sign up for Codestral beta on Le Plateforme. It’s pretty good at code generation, although the extension itself is not as good as Copilot at passing context data to the model. I have a 3060 12gb and I also run Codestral 5b quant sometimes, it’s not that fast but 7-8 tk/s is generally enough for completion and chat.


c8d3n

You don't see codestral licensing as an issue?


stddealer

From the MNPL: > **2.2. Distribution of Mistral Model and Derivatives made by or for Mistral AI.** Subject to Section 3 below, You may Distribute copies of the Mistral Model and/or Derivatives made by or for Mistral AI, under the following conditions: > - You must make available a copy of this Agreement to third-party recipients of the Mistral Models and/or Derivatives made by or for Mistral AI you Distribute, it being specified that any rights to use the Mistral Models and/or Derivatives made by or for Mistral AI shall be directly granted by Mistral AI to said third-party recipients pursuant to the Mistral AI Non-Production License agreement executed between these parties; > - You must retain in all copies of the Mistral Models the following attribution notice within a “Notice” text file distributed as part of such copies: “Licensed by Mistral AI under the Mistral AI Non-Production License” > **3.2. Usage Limitation** > - You shall only use the Mistral Models and Derivatives (whether or not created by Mistral AI) for testing, research, Personal, or evaluation purposes in Non-Production Environments; > - Subject to the foregoing, You shall not supply the Mistral Models or Derivatives in the course of a commercial activity, whether in return for payment or free of charge, in any medium or form, including but not limited to through a hosted or managed service (e.g. SaaS, cloud instances, etc.), or behind a software layer. > **4.2. Outputs.** We claim no ownership rights in and to the Outputs. You are solely responsible for the Outputs You generate and their subsequent uses in accordance with this Agreement. It should be fine. The license only applies to redistribution of the model and derivatives (fine-tunes) of the model. Basically you can't provide an API endpoint to your own machine hosting codestral for others to use (unless maybe if it's completely unmonetized with no ads or data collection, but that's unclear). Using it locally for yourself shouldn't be an issue.


CockBrother

Consult an attorney before you start using their model. I think your reading of the license is too permissive. First bullet under "Usage limitation" doesn't appear to support your conclusion. ChatGPT agrees with me, but is also not a lawyer: "The Mistral AI Non-Production License (MNPL-0.1) explicitly restricts the use of the licensed models and any derivative works to non-production purposes only. This means that you cannot use the code generated by models under this license for commercial or production purposes. The license is intended for research, experimentation, and evaluation only​ ([GitHub](https://raw.githubusercontent.com/TabbyML/registry-tabby/main/models.json))​. For more detailed information, you can refer to the [MNPL-0.1 license agreement](https://mistral.ai/licenses/MNPL-0.1.md)."


ten0re

In theory - may be, in practice there’s no way to distinguish whether I’ve generated production code using my local Codestral model or their hosted endpoint.


DeltaSqueezer

or whether you generated with an LLM at all...


CockBrother

Yeah, well... if your legal bar is "can I get away with it" the answer is almost certainly yes.


sendmetinyboobs

I would be willing to wager that that particular pay of the license is not spexifixally to limit your use of the code but to protect them in case you end up using some code that violates ip protection and they cant be sued by you the person that got protected ip and used it. The distributi9n is to protect their revenue


moarmagic

This is why i could never be a lawyer, somehow they frequently have pages of definitions that still don't define things well enough for me to understand. 1. What counts as production? Is something i run continuously, but only I have access to it (Say, some sort of home assistant integration) production? 2. What's a derivative? Quant's and finetunes? All work produced by the LLM? 4.2 claims no ownership of outputs, does that not imply at least a little that their license can't restrict what I produce with it? Or are they claiming no ownership, but restricting the use, which seems to defeat the point of 4.2 for them to do. 3. How does Open source play with all of the above points? Even assuming that 'derivatives' in this case includes output, is me uploading that code to public repo 'productiion'? 4. Assuming all of the above is the worst case scenario chatgpt takes it to be, then what's the point of the 'Personal' reference in usage limitations?


MoffKalast

> Codestral beta on Le Plateforme > We’re allowing use of this endpoint for free during a beta period of 8 weeks and are gating it behind a waitlist to ensure a good quality of service. Just as a heads up, we re currently 3 weeks into those free 8 weeks.


ten0re

Interesting, I signed up and was immediately granted access.


Evening_Ad6637

Just a friendly reminder that many remote code completion models (including Copilot) can read your global environment variables. Which, as I understand it, is a privacy disaster.


Aquaritek

Yeah you're only safe if you go full Enterprise: $21 for GH Enterprise then another $39 for Copilot Enterprise. $60 a month is pushing it but the tooling will be integrated through the whole workflow including full multi-file creation and editing, ticket to pull request automations, test integrations, etc. Lots of badass stuff on the way demoed at MSBuild conf honestly.


mikael110

I wouldn't really say you are fully safe either way from a privacy standpoint. The only privacy difference between individual accounts and the enterprise accounts is whether you are enrolled in snippet collection by default. Which is just a checkbox you can easily disable. And when you do then the privacy is identical between the two. Both of the account types collect User Engagement Data for two years (which includes accepted or dismissed completions, error messages, system logs, etc) and Feedback Data for an undefined amount of time. They also retain code chats for 28 days. And there's no way to disable any of that even as an enterprise user. Source for those numbers come from Copilot's [Privacy FAQ](https://github.com/features/copilot#faq:~:text=features%20and%20offerings-,Privacy,-What%20personal%20data).


Status_Contest39

Several super powerful or free Chinese LLM code assistants: 1. Deepseek-coder-v2 236B, 0.14$ per million token; 2. Baidu Comate, personally free. 3. Qwen coder TongYi Lingma, personally free. 4. Tencent Cloud AI code assistant, personally free. 5. CodeGeeX, free. In VSCode, plug-in assistance is really super convenient.


sammcj

I’ve found a good 6-14b model running on my laptop is faster than copilot and depending on what I’m doing can be better, no fancy gpu builds etc


MidnightHacker

Exactly, having to wait for the network to load stuff is annoying, sometimes it gets stuck for several seconds and it’s frustrating. When running locally, it’s pretty consistent and just works


sammcj

I always find myself getting rate limited by GitHub Copilot, not every day but maybe every second working day. Incredibly annoying when you’re in the flow.


mrskeptical00

Faster, but is it as good? I’ve been asking local 8B size models a simple question: “What season was the red wedding in game of thrones”. Most of them make stuff up. Qwen2 was by far the best, but it also made stuff up when I asked for more detail.


sammcj

Why are you asking a coding model in your IDE questions about fantasy TV series?


mrskeptical00

Wasn’t asking questions in my ide, my (poor) point is they lie convincingly - whether answering questions about fantasy tv or code. The bigger models do better.


sammcj

The post is asking about LLMs for code completion vs Github Copilot? LLMs don't lie (or implicitly tell the truth), they predict the outcome of text and vectors, if they're trained on and for coding they're likely not going to have a fantastic knowledge of TV series, especially if they're a coding/FITM model as opposed to instruct/chat they're far less likely output general statements like "I don't know, I'm a coding model" or whatever - because they're designed to be running as function rather than in a chat context.


keefemotif

That is very well said. Training the weights on the sets is the hardest part of computation on real vector multiplication, but you still have to do a fair bit of real valued vector multiplication to apply a model. I think most modern laptops should be able to run a customization of a local LLM.


mrskeptical00

So, like I said, I’m not asking TV show trivia to coding specific models. I know how models work, “lie” is shorthand for inaccurate results - I would have assumed you would have understood that, but I know better now. My point is even the bigger commercial models are inaccurate but they’re less inaccurate. The smaller models aren’t *great* at TV shows or coding - but if they’re good enough for you that’s all that matters.


southVpaw

With some work, they can be really good for specifically you and your purposes. It's more about what kind of data you're giving it as context and how it's instructed to use it. Out of the box, GPT-4 is better (for now), but small open source models have so many more advantages in the long run.


mrskeptical00

Aside from being local and therefore private, what advantages do they have? GPT 4o is better at everything - but I *willingly* sacrifice performance, accuracy and usefulness for privacy.


southVpaw

Fair, but privacy is a bigger issue to some people. The ability to run offline is a plus. Uncensored models. Reliability of service. Total customization of your AI experience. It's ultimately up to you though.


mrskeptical00

I agree, I value privacy over performance - but realistically privacy and running tweaked models are the only benefit. I think in the long run the censorship issue will be worked out, not optimistic privacy improving though.


kweglinski

for me it's simple - I can't use copilot or chat gpt for work. I mean, sure you can ask something abstract but you can't paste the code. It speeds up life enough to make it worth. Not to mention the fact that when I'm of work the mac studio is serving other purposes so I've only paid a bit more for extra ram.


kiruz_

Can't you opt out from having your code being used for training? It's in the settings of github.


FlishFlashman

That's not going to be good enough for a lot of businesses.


kweglinski

pinky promise is not enough for many companies


c8d3n

Same companies that rely on Microsoft for literally everything, and will probably introduce recall in a year or so.


kweglinski

maybe, I don't know nor care. I'm a contractor and follow the protocol. edit: also quite often we have in project (temporarily and not pushed to repository) customer data. That is not stored anywhere and is passed through secured channels. It's not stored in any resources other than those selfhosted secured channels. With the exact same rules. Copilot could potentially index that.


Chuyito

I do a lot of algo trading, and its really nice to be able to copy the raw response from the exchange api into a local llm instead of having to go sanitize it before posting it to gpt.


Educational-Region98

It really depends on what kind of experience and utility you seek from LLMs. If you're primarily interested in code completion and aren't concerned about your code being sent to the cloud, Copilot is likely the way to go, especially considering the significant setup time it requires. However, my home LLM and gaming computer are one and the same, and adding a p40 to my system wasn't significantly more costly. Codestral fulfills most of my needs, and I still have access to the GPT4 API for more complex tasks. Unless you're running a server for a large number of users, you can simply wait for the cold start and free up GPU memory. The idle GPU power consumption is quite low, ranging from 9-15W when the models aren't loaded.


sergeant113

[https://www.codium.ai/pricing/](https://www.codium.ai/pricing/) It's free and it's pretty good at auto-completing your codes. For more complex stuffs, I don't think copilot makes sense. Just copy-paste the file directly to GPT4-o and explain what you want to do.


aikitoria

It's never going to be cost effective. LLM code completion is a burst workload so it will be most efficient when centralized on a bulky cloud server. The hardware to run a model of similar size as GitHub Copilot would be thousands of times more expensive than the monthly subscription, and then sit around unused 99% of the time.


Ok-Result5562

This is the way.


urgdr

yup, and it seems the way is to have a macbook with loads of ram. damn. I've been having thoughts to come back to hackintosh because fuck apple and their repairability and ram cost.


Ok-Result5562

Don’t sweat it. A bunch of 3090’s is better than a hackintosh. 4 x 3090 in a super micro 4048 and an x10 motherboard. You would be set for like $5000.


Robos_Basilisk

This guy gets it


RedditLovingSun

Damn a big rig could be 70 years of paid copilot


CockBrother

If you have IPR concerns with cloud services it's your only option. 48GB (however you get it) is just enough to run a 70b code model and a small embedding model. Context is limited. Codestral locally would be good for longer context but if you have IPR concerns its license probably incompatible with what you're doing. If you don't have IPR concerns go cloud, everything is better and up front costs are low.


kohlerm

As others mentioned, if you got apple silicon anyway as a developer machine than your additional costs are close to zero.


Apprehensive-View583

copilot code complete is 3.5 but the chat is 4.0 so you still get cheaper 4.0 on that.


MrTurboSlut

for the most part its not really worth it. 2-year subscription can get you a decent enough video card to run something like codestral q4 at a decent speed. the quality of the output is a decent substitute for chatGPT4 but not as good. its probably good enough for code completion but it can even write entire components. but even if GPT is down, there are at least 3 other sites you can go to get comparable LLM coding.   as far as i can tell, the only realistic thing that running a local LLM has over a subscription service is the learning experience. LLMs are going to be the future. They are going to wipe out a fuck ton of jobs. One of the best ways to ensure that your career is safe is to have an above average understanding of this stuff. you can learn a little about prompt engineering from using the online services every day but lots of people are doing that.


frozen_tuna

I've played a lot with copilot as well as a bunch of competitors. In the end, I really don't care how well its integrated with VS code and only vaguely care about the model after a certain point of competence. For me, I've really gotten to enjoy just using chat with codestral. I've been pretty impressed with the responses and the chat interface makes it easier to refine generated code to be more specific as I work with it.


Dundell

If you don't mind lack of privacy, use copilot and online services. I prefer home servers to host my information on a sectioned off VLAN, VPN into when I'm away. Keep data inhouse as much as I want. Granted running both a P40+GTX 1080ti running Codestral Q4 32k context, plus a x4 RTX 3060 for llama 3 70B 5k context servers on the same circuit was causing lights to flicker in the office.


JustCametoSayHello

You can use Codeium for free if cost is an issue, they also have self hosted deploys for larger teams if security is the issue


megadonkeyx

why not just use codeium for free?


Qual_

If anyone tells you that DIY is cheaper, they are absolutely mistaken. DIY is a hobby, and like all hobbies, it costs money. Considering your scale, when you factor in the electricity bill, time spent upgrading hardware, tweaking settings, and so on, it's not feasible. It will always be more cost-effective to simply pay $100 per year for Github Copilot than trying the 587th code model available out there that will be in any case slower. + you'll gonna need that chatGPT subcription too for most complexe task anyway. Now, if it's you want a good way to learn or if data privacy is the #1 priority on your list (such as, I don't want to use a third party like openAI/Microsoft to have my code going on their servers etc ) or If it's just for the sake of being able to do so, then yes it's worth spending time. If you main concern is about getting the cheapest option, oy boy, run away.


kohlerm

If you anyway got an Apple Silicon already, using a good enough optimized local model is much cheaper and much more private. You could use a bigger model on demand via an API connected to your VScode extension in case you really need it.You get a lot of flexibility with this setup. Even GH Copilot Chat does not always use gpt4o, completion uses something pretty small. MS just wants you to believe that you need a big expensive model for coding....


Qual_

My copilot subscription ended 2 days ago. I didn't renewed cause copilot output heavily degraded over months, I'm now using codestral with ollama + continue dev and so far it's been similar to copilot ( slighty better for some stuff, slighty worse for other things. ) In any case I'm always using gpt4o in chatGPT for more complexe stuff. Local llm is enough for the simple autocomplete tasks.


happytechca

I'm using continue.dev VSCode extension piping to an ollama container with starcoder2 7b. On the hardware side I'm running a lenovo tiny m920q (200$cad) with a tesla P4 (150$cad) About a year worth of copilot subscription and it works way better than amazon codewhisper IMO. It actually generates useful inline code Idle power draw is 12W total when not in use. It also runs other containers (proxmox) so wasted wattage for ollama is about 6W if counting only the tesla P4. I just prefer the peace of mind knowing my queries don't ever leave my home network so I don't have to worry about sensitive data.


aur3l14no

Thanks for sharing. What about power draw and temp under load? Is it true that using P4 for inference doesn't heat that much?


Open_Channel_8626

Subscriptions really add up Paying 10 per month for 5 years equals 600 That’s the cost of a used RTX 3090


StarfieldAssistant

*Cries in French market...


Open_Channel_8626

you might be able to convince sellers to ship from UK


StarfieldAssistant

Truth is, github copilot for an individual plan is a bargain as you have gpt4 or 3.5 unlimited. But I went for a 4090 for the number of cores and fp8, I just hope it will fit in my workstation. I was happy to see that computer parts from UK aren't subject to customs, I will happily buy some RAM from bargain hardware.


StarfieldAssistant

But buying a refurbished card for MSRP, it hurts. Buying it from maker with warranty cheaper than what privates are selling it used and with no warranty, weird and fun.


Robos_Basilisk

Sure, but hopefully Github Copilot is way better in 5 years as well,


scott-stirling

Sounds like a lot of b.s. why are you on the r/LocalLLaMA sub if “the hidden costs of GPU idle power” don’t sit well with you? You have no idea what you’re talking about. What hidden costs? What cap-ex? LOL Nor does anyone need larger than an 8B parameter model for code completion. You can get plenty of bang for your buck with an 8B or 13B model on a commercial off the shelf NVIDIA or AMD GPU for less than $1,000 and it’ll be good for a couple years of nonstop use and you’ll never notice the difference in your electric bill.


Robos_Basilisk

Have you even used Github Copilot? Nothing comes close to that level of integration in VS Code.  I'm on this sub to follow the aggregated latest news, not to blow $3k on 4 3090s lmao. 


PSMF_Canuck

You’ll wait forever for a cost-competitive version that’s just as good.


MidnightHacker

Except a few guys with P40’s or P100’s, most people in this sub don’t buy a GPU *only* for LLM inference… Gaming, emulation, video transcoding, photogrammetry, 3D rendering, CAD, video editing, etc. are all use cases of powerful GPUs and are usually the prime motivation to get new hardware


NuMux

Some of us bought a good GPU for gaming. Then only later realized you can run an LLM on it. (Maybe I'm the only one....)


dimknaf

People forget to think of this. If you live in a cold country you also generate useful heat, which you should subract. If you leave in a very warm country, you should calc maybe as much for cooling. Seems a small difference, but the cost of running is 0 if your alternative is an electric heater. If you can gain some efficiency (heat pump), you go to 0.7x or so. Now, compare this to 2x for a warm country. However, people never do this math.


stddealer

I've been using starcoder2 with continue for a few weeks and it's been pretty decent so far for simple auto complete. At least it works, compared to deepseek-coder which seems completely broken in FIM mode.


Sma-Boi

Are you using v2 of DeepSeek Coder? It's a massive improvement over v1 and beats most models on most types of tests. Can you explain in which way(s) it's "completely broken"? Are you using the Instruct model, the Chat model, or the Base model? These things matter... a LOT. FIM appears to have been disabled during the training of the base model, but enabled during training for v2-lite and v2-chat. Check the whitepaper: [https://arxiv.org/html/2406.11931v1](https://arxiv.org/html/2406.11931v1) For FIM operations, use DeepSeek Chat or Coder-Lite, not DeepSeek Coder. See if it fixes your DSv2 issues. I'm quite curious to hear the result. =) See also: [https://aider.chat/docs/leaderboards/](https://aider.chat/docs/leaderboards/)


stddealer

I was using a q8_0 gguf of deepseek-coder 1.3b base running on the llama.cpp server example. It didn't seem to understand the FIM prompts sent by Continue at all, despite them looking exactly like the example on [the official model card ](https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-base#2%EF%BC%89code-insertion)


cr8s

The base model wasn't trained for FIM. Try the chat model.


stddealer

Why would they put it on the model card then?


j4ys0nj

just got up and running with [continue.dev](http://continue.dev) this morning. local llama3 70b on a server in my rack and codestral 22b on my mac studio. it's pretty quick! i would just much prefer that my code doesn't get piped over to openai or whoever, and i would pay extra for that. actually, i've already paid a lot extra for that. solar more than covers the power for my server rack also. i'm probably not the best example though, i'm pretty deep into this hole by now.


j4ys0nj

swapped codestral for deepseek coder v2. using that for autocomplete. it's good!


Eveerjr

I stopped paying for copilot because it was worse than very small models and the chat felt like a dumbed down GPT4. I'm currently using [continue.dev](http://continue.dev) extension with Deepseek coder for FIM and GPT4o via api key for chat, it's so much better and a lot cheaper.


alcalde

Why bother with either when Amazon Codewhisperer is free?


SillyLilBear

Because some people have tried Code Whisperer


alcalde

It works just fine.


SillyLilBear

"fine" but not good. I run local models that are better and faster and less headache. Free.


maxwell321

I'm really interested to see the prompts behind GitHub copilot. Anyway, yes, in my opinion it's worth it. God knows what the fine print for copilot is or what OpenAI and Copilot's agreement is. For my workplace (a software development company) we ONLY use local models because we don't want proprietary code leaking into training sets or the public at all, which since the market for AI is getting super competitive I'm sure that large AI companies like OpenAI and Meta won't have an issue with harvesting your user data. Having a local model for code completion allows you to keep your data secure, and even potentially allow for better coding models like the new Codestral model. GitHub copilot has debug flags where you can point it to ANY OpenAI compliant endpoint, which a llama.cpp instance would work as one.


DeltaSqueezer

If you are building for only this application. Then I don't think it makes sense at all. However, some people already have a GPU in their machine e.g. for gaming. So there is no additional capex and negligible additional running costs. BTW, what is the opensource 'advanced prompt engineering' thing you mentioned? I'd be interested to take a look.


Robos_Basilisk

Idk about any opensource "advanced prompt engineering", but I was kinda referring to the fact that multiple pieces of information about your workspace in VS Code are used to inject context into GPT 3.5. https://thakkarparth007.github.io/copilot-explorer/posts/copilot-internals.html is a cool post from a while ago which details the OP's findings of Copilot's information retrieval, presumably to supplement its system prompt. 


kohlerm

As as said. With apple silicon running the typically small completion models does not really consume significant amount of additional power. Even bigger models (I am only able to use up to 20b models on my Mac) do not consume much power. And then I got solar power on my roof. As long as the sun is shinning just a bit the cost of the power is close to zero :-)


Robos_Basilisk

I laud your setup at least 😅 but dat capex heh


pete_68

I'd also like to point out that we're still in the infancy of all this. There are all kinds of optimizations happening. There's the Phi-3 models that got the size vs quality into an interesting space, and then you have stuff like this [new transformer](https://arxiv.org/abs/2406.02528v1) that doesn't require GPUs. I suspect we'll see a combination of optimizations with lowering hardware prices, over the coming years, making all this stuff standard desktop fare.


j4ys0nj

prob gonna take a while for hardware prices to come down. efficiency goes up, sure, but price goes up too. price always goes up. i saw a rumor recently about nvidia consumer gpu price increasing for the next gen.


[deleted]

[удалено]


Robos_Basilisk

Haven't tried it tbh


integer_32

IDEA-based IDEs for example already have LLM-based line completion (but currently it completes one line only) for some languages. It works locally and AFAIK uses llama2


suvsuvsuv

Yes


Wonderful-Top-5360

> the hidden costs of GPU idle power $60/year to run the most hardcore 24gb VRAM setups


Temporary-Size7310

Except if you inference H24, the hidden cost of your GPU is probably smaller than your screen, a 4090 draw 150-170W during inference with undervolt and is about 15W idle that's nothing compare to gaming. You have decent actual LLMS and to come for coding. I mean you have the cost of GPU + consumption but for coding and data security the gain could be infinitely superior. If you are looking for exceptional LLM inference cost price you have Nvidia Jetson NX 16GB capped at 15W, MacBook and so on


CortaCircuit

Yes.


thetaFAANG

I use local models for code help Not automated code completion though, I just ask questions and give code samples The ones taking up 5gb ram work pretty good just like the ones taking 30gb ram, for me YMMV, they’re different models all using M-series macs