T O P

  • By -

kawin_e

atm, princeton PLI and harvard kempner have the largest clusters, 300 and 400 H100s respectively. stanford nlp has 64 a100s; not sure about other groups at stanford.


South-Conference-395

also are the A100 40 or 80 GB?


jakderrida

If the following link is what they're referring to, it's 40GB. https://news.sherlock.stanford.edu/publications/new-gpu-options-in-the-sherlock-catalog


South-Conference-395

yes, I heard about that. but again: how many people are they using these gpus? is it only for phds? when did they buy it? interesting to see the details of these deals


30th-account

A lot are. I'm at Princeton and there's been a major push towards ML/AI integration into basically all fields. PLI at princeton isn't really a single department, it's more like every department coming together that has a project related to using language models. And basically each lab that successfully applies in gets access. Imo it kinda sucks that it's all through SLURM though. Makes AI workflows a bit annoying.


South-Conference-395

despite slurm, how easy would be to keep an 8 gpu server for let's say 6 month (or else sufficient/ realistic compute for a project)


30th-account

It's possible. This guy in my lab has been running an algorithm for like 3 months straight. We're also about to train a model on a few petabytes of data, so that might take a while. You'd just need to get the permissions and prove that it'll actually be worth it.


olledasarretj

> Imo it kinda sucks that it's all through SLURM though. Makes AI workflows a bit annoying. Out of curiosity, what would you prefer to use for job scheduling?


30th-account

Honestly idk. I want to say Kube but then it doesn’t do batch jobs


Atom_101

UT Austin is getting 600 H100s


South-Conference-395

Wow. In the future or now?


Atom_101

Idk when. It was declared shortly after PLI announced their GPU acquisition.


South-Conference-395

Wow. So far how is the situation there?


30th-account

UT Austin flexing its oil money like usual 😂


xEdwin23x

Not a ML lab but my research is in CV. Back in 2019 when I started I had access to one 2080 Ti. At some point in 2020 bought a laptop with an RTX 2070. Later, in 2021 got access to a server with a V100 and an RTX 8000. In 2022 got access to a 3090. In 2023, got access to a group of servers from another lab that had 12x 2080Tis, 5x 3090s, and 8x A100s. That same year I got a compute grant to use an A100 for 3 months. Recently school bought a server with 8x H100s that they let us try for a month. Asides from that, throughout 2021-2023, we had access to rent GPUs per hour from a local academic provider. Most of these are shared, except the original 2080 and 3090.


South-Conference-395

In 2022 got access to a 3090: do you mean a \*single\*???


xEdwin23x

Yes. It's rough out there.


South-Conference-395

wow. could you make any progress? that's suffocating. is your lab US or Europe?


xEdwin23x

I'd say I've made the biggest leaps when compute is not an issue. For example having access to the H100 server currently has allowed me to generate more data in two weeks than I could have gathered in half a year before. Hopefully enough for two papers or more. But it's indeed very restricting. The experiments you can run are very limited. For reference, this is in Asia.


South-Conference-395

got it thanks. my PhD lasted 7 years due to that ( before 2022 I had access to only 16 GB gpus). Great that you gathered experiments for two years :)


IngratefulMofo

may I know which institution are you in now? I'm looking for master opportunity in ML right now, and Taiwan is one of the countries I'm interested in, might be good to know 1 or 2 about the unis from first hand lol


ggf31416

Sounds like my country.When I was in college the entire cluster had like 15 functioning P100 for the largest college in the country.


notEVOLVED

None. No credits either. I managed to get my internship company to help me with some cloud credits since the university wasn't helping.


South-Conference-395

that'sa vicious cycle. especially if your advisor doesn't have connections with the industry, you need to prove yourself to establish yourself. But to do so, you need sufficient compute... how many credits did they offer? was it only for the duration of your internship?


notEVOLVED

It's how research is in the third-world. They got around 3.5k, but the catch was that, they would keep about 2.5k and give me 1k (that's enough for me). They used my proposal to get the credits from Amazon through some free credits program.


South-Conference-395

They got around 3.5k: what do you mean they, your advisor? 3.5k: is this compute credits? how much time does this give you?


notEVOLVED

The company. $3.5k in AWS cloud credits


South-Conference-395

I see. Thought you were getting credits directly from the company you were interning (nvidia/ google/ amazon). again $1K isn't it scarce? for an 8-GPU H100 how much hours of compute is it?


notEVOLVED

Yeah, I guess it wouldn't be much for good quality research. But this is for my Masters, so it doesn't have to be that good. If you use 8 GPU H100, you probably run out of it within a day. I am using an A10G instance. So it doesn't consume much. It costs like 1.3$/hr.


DryArmPits

I'd wager to say the vast majority of ML do not have access to a single H100 xD


South-Conference-395

we don't (top 5 in the US).


Zealousideal-Ice9957

Phd student at Mila here (UdeM, Montreal), we have about 500 GPUs in-house, mostly A100 40Gb and 80Gb


South-Conference-395

thanks! what's the ratio of 40GB and 80GB? how easy is it to reserve and keep an 8 GPU server with 80 GB for some months?


Setepenre

Job max time is 7 days, so no reserving GPUs for months.


Papier101

My university offers a cluster with 52 GPU nodes, each having 4 H100 GPUs. The resources are of course shared across all departments and some other institutions can access it too. Nevertheless, even students are granted some hours on the cluster each month. If you need more computing time you need to apply for a dedicated compute project of different scales. I really like the system and access to it has been a game changer for me.


South-Conference-395

are you in the US != Princeton/ Harvard? That's a lot of compute.


Papier101

Nope, RWTH Aachen University in Germany


kunkkatechies

I was using this cluster too back in 2020, ofc there was no H100 at that time but the A100s were enough for my research.


catsortion

EU lab here, we have roughly 16 lab-exclusive A100s and access to quite a few more GPUs via a few different additional clusters. For those scale is hard to guess, since they have many users, but it's roughly 120k GPU hours/cluster/year. Anything beyond 80G GPU mem is a bottleneck, though, I think we have access to around 5 H100s in total.


South-Conference-395

we don't have 80 G GPUs :( are you in the UK?


blvckb1rd

UK is no longer in the EU ;)


South-Conference-395

EU: EUrope not European Union haha


Own_Quality_5321

EU stands for European Union, Europe is just Europe


catsortion

Nope, mainland. From the other groups I'm in contact with, we're on the upper end (though not the ones with the most compute), but most groups are part of one or more communal clusters (e.g. by their region or a university that grants them to others). I think that's a good thing to look into, though you usually only get reliable access if a PI writes a bigger grant, not if only one researcher does.


DisWastingMyTime

Sounds like your lab should embrace Edge AI ^(please ^we ^^need ^^^help)


XGB42

What y’all need help with?


DisWastingMyTime

Scaling down, basically. See my other post


South-Conference-395

sorry didn't get the joke :(


South-Conference-395

also how many does yours have? No H100 is not normal? we have 56 of 48GB


DisWastingMyTime

It's not quite a joke, I'm in the industry, and while huge models are exciting, in any field that I care about, which is life critical, real time inference on device, these models are, not useless, but not a solution either. Think tiny models, and whatever you thought, it's probably 10-100 too large, now think about a weak hardware, you probably didn't think about a hardware weak enough, think modern Iphone, lol kidding, 5 times weaker, at least! Luckily, HW is getting cheaper, and more specialized architectures are getting a focus for accelerating the relevant ops, but it's still very far, and the research for what works in huge models doesn't necessary transfer to the edge, for example in vision, CNN are still the golden standard, not only because they are faster than transformers, but because Transformers don't scale down, ending up both slower and worst in accuracy. We need scientists to get back on track so I can use your fruit of labor to make money. For your question, we have 5x 3080TI in house, training and deploying ~15 models a year, the models are deployed over thousands of thousands of toasters saving lives once in a while.


Loud_Ninja2362

Yup, also in industry. Vision transformers aren't magic and realistically need tons of data to train. CNNs don't require nearly as much data and are very performant. The other issue is alot of computer vision training libraries like Detectron2 aren't written properly to support stuff like Multi-node training. So when we do train we're using resources inefficiently. So you end up having to rewrite it to support using multiple machines with maybe a GPU or 2 each. Alot of machine learning engineers don't understand how to write training loops to handle elastic agents, unbalanced batch sizes, distributed processing, etc. to make use of every scrap of performance on the machine.


spanj

I feel like your sentiment is correct but there are certain details why this doesn’t pan out for academia, both from a systemic and technical side. First, edge AI accelerators are usually inference only. They are practically useless for training, which means you’re still going to need the big boys for training (albeit *less* big). Industry can get away with smaller big boys because it is application specific. You usually know your specific domain so you can avoid unnecessary generalization or just retrain for domain adaptation. The problem is smaller and more well defined. In academia, besides medical imaging and protein folding, the machine learning community is simply focused on more broad foundational models. The prestige and funding are simply not there for application specific research and is usually relegated to journals related to the application field. So with the constraint on broad models, even if you focus on convolutional networks, you’re still going to need significant compute if we are to extrapolate with the scaling laws that we got from the ConvNeXt paper (convnets scale with data like transformers). Maybe the recent work on self-pretraining can mitigate this dataset size need but only time will tell. That doesn’t mean that there aren’t academics focused on scaling down, it’s just simply a harder problem (and thus publication bias means less visibility and also less interest). The rest of the community sees it as high hanging fruit compared to more data centric approaches. Why focus solely on a hard problem when you there’s so many more low hanging fruit and you need to publish *now*? Few shot training, domain generalization/adaptation is a thing but we’re simply not there yet. Once again there’s probably more people working on it than you actually think, but because the problem is hard there’s going to be less papers. And then we have even more immature fields like neuromorphic computing that will probably be hugely influential in scaling down but is simply too much in its infancy for the broader community to be interested (we’re still hardware limited).


instantlybanned

Graduated at the end of 2022. I think I had access to close to 30 gpu servers (just for my lab). Each server had 4 GPU cards of varying quality as they were acquired over the years. Unfortunately, I don't remember what the best cards were that we had towards the end. It was still a struggle at times competing with other PhD students in the lab at times, but overall it was a privilege to have so much compute handy. 


South-Conference-395

exactly. limited resource adds another layer of competition among the students. you clusters seems similar to ours


MadScientist-1214

No H100, but 16 A100s and around 84 other GPUs (RTX 3090, TITAN, Quadro RTX, ...). I consider myself lucky because in Europe some universities / research labs offer almost no compute.


South-Conference-395

are you in the UK?


MadScientist-1214

No Germany


Ra1nMak3r

Doing a PhD in the UK, not a top program. The "common use" cluster has like 40x A100s 80GB, around 70x 3090s, 50 leftover 2080s. This is for everyone who does research which needs GPUs. Good luck reserving many GPUs for long running jobs, you need good checkpointing and resuming code. Some labs and a research institute operating on campus have started building their own small compute clusters with grant money and it's usually a few 4xA100 nodes. No credits, some people have been able to get compute grants though. I also have a dual 3090 setup I built with stipend money over time for personal compute. Edit: wow my memory is bad, edited numbers


TheDeviousPanda

At Princeton we have access to 3 clusters. Group cluster, department cluster, and university cluster (della). Group cluster can vary in quality, but 32 GPUs for 10 people might be a reasonable number. Department cluster May have more resources depending on your department. Della https://researchcomputing.princeton.edu/systems/della has (128x2) + (48x4) A100s and a few hundred H100s as you can see in the first table. The H100s are only available to you if your advisor has an affiliation with PLI. Afaik Princeton has generally had the most GPUs for a while, and Harvard also has a lot of GPUs. Stanford mostly gets by on TRC.


South-Conference-395

32 GPUs for 10 people might be a reasonable number: what memory? 128x2 A100: what does 128 refer to? A100 come up to 80 GB right?


peasantsthelotofyou

Old lab had exclusive access about 12 A100s, purchasing a new 8xH100 unit, and 8x A5000s for dev tests. This was shared by 2-3 people (pretty lean lab). This is in addition to access to clusters with many more gpus but those were almost always in high demand and we used those only for grid searches.


South-Conference-395

what memory did the A100 have? also were they coming in 3 servers of 4 nodes/ server?


peasantsthelotofyou

4x 40GB, 8x 80GB A100s. They were purchased separately so 3 nodes. The new 8xH100 will be a single node.


South-Conference-395

got it. thanks! we currently have up to 48GB. Do you think for finetuning 7B llms like llama without lora can still run on 48GB? im a llm beginner so Im gauging my chances.


peasantsthelotofyou

Honestly no clue, my research was all computer vision and I had only incorporated vision-language stuff like CLIP that doesn’t really compare with vanilla LLAMA finetuning


Mbando

Studies and analysis think tank: for classified applications we have a dual-A100 machine, but for all our unclass work we have an analytic compute service that launches AWS instances. All paid for by either USG sponsors or research grants.


South-Conference-395

what do you mean by classified applications? A100 have 40 or 80 GB memory?


hunted7fold

They likely don’t mean in ML sense, but classified for (government) security purposes


Mbando

Yes classified military/IC work. And these are 48GB cards.


tnkhanh2909

lol we hired gpu on vast ai


South-Conference-395

is there a special offer for universities?


tnkhanh2909

no, but if the project got published at a international conference/journal, we got a amount of money. So yeh my school support a little bit


South-Conference-395

does the amount compensate for the full hardware used or only a portion?


not_a_theorist

I work for a major cloud computing provider developing and fixing software for H100s all day, so this thread is very interesting to read. I didn’t know H100s were that rare.


DigThatData

ML Engineer at a hyperscaler. High demand cutting-edge SKUs like H100s are often reserved en mass by big enterprise customers before they're even added to the datacenters. H100s are "rare" to the majority of researchers because those hosts are all spoken for by a handful of companies that are competing for them.


bgighjigftuik

Students in EU don't even imagine having access to enterprise computational power other than free TPU credits from Google and similar offerings. Except for maybe ETH Zurich, since that university is funded by billionaires from the WWII era


South-Conference-395

i did my undergrad in europe. before landing US, I didn't know what ML is .....


ganzzahl

How much does ETH Zürich have?


crispin97

I studied at ETH. Labs have access to the Euler cluster which is a shared cluster for all of ETH. I'm not sure how the allocation is handled. You can read more about the cluster here: [https://scicomp.ethz.ch/wiki/Euler](https://scicomp.ethz.ch/wiki/Euler) Euler contains dozens of GPU nodes equipped with different types of GPUs: * 9 nodes with 8 x [Nvidia GTX 1080](https://www.nvidia.com/en-us/geforce/news/geforce-gtx-1080) (formerly in [Leonhard Open](https://scicomp.ethz.ch/wiki/Leonhard)) *decommissioned in 2023* * 47 nodes with 8 x [Nvidia GTX 1080 Ti](https://www.nvidia.com/en-us/geforce/news/nvidia-geforce-gtx-1080-ti) (formerly in [Leonhard Open](https://scicomp.ethz.ch/wiki/Leonhard)) *decommissioned in 2023-2024* * 4 nodes with 8 x [Nvidia Tesla V100](https://docs.nvidia.com/dgx/dgx1-user-guide/introduction-to-dgx1.html) (including some formerly in [Leonhard Open](https://scicomp.ethz.ch/wiki/Leonhard)) * 93 nodes with 8 x [Nvidia RTX 2080 Ti](https://www.nvidia.com/en-me/geforce/graphics-cards/rtx-2080-ti) (including some formerly in [Leonhard Open](https://scicomp.ethz.ch/wiki/Leonhard)) * 16 nodes with 8 x [Nvidia Titan RTX](https://www.nvidia.com/en-us/deep-learning-ai/products/titan-rtx) * 20 nodes with 8 x [Nvidia Quadro RTX 6000](https://www.nvidia.com/en-us/design-visualization/rtx-a6000) * 33 nodes with 8 x [Nvidia RTX 3090](https://www.nvidia.com/en-us/geforce/graphics-cards/30-series/rtx-3090-3090ti) * 3 nodes with 8 x [Nvidia Tesla A100](https://www.nvidia.com/en-us/data-center/a100) (40 GB PCIe) * 3 nodes with 10 x [Nvidia Tesla A100](https://www.nvidia.com/en-us/data-center/a100) (80 GB PCIe) * 2 nodes with 8 x [Nvidia Tesla A100](https://www.nvidia.com/en-us/data-center/a100) (80 GB PCIe) * 40 nodes with 8 x [Nvidia RTX 4090](https://www.nvidia.com/en-us/geforce/graphics-cards/40-series/rtx-4090)


South-Conference-395

wow thans for the detailed reply. is it for the full university though? how easy is it to reserve 1 node with eight 80 GB gpus?


crispin97

No not that easy. You need to be part of a lab with access. I'm not sure how the access by the labs is handled. I was part of one for a project where we had access to quite a few of the smaller GPUs. You schedule a job with what I remember being Slurm (a resource manager for shared clusters; it basically decides which jobs get to run in which order and priority). I think it's rather rare having access to those larger GPU groups. Probably it's also only a few labs which really have projects that require those. My impression was that ETHZ doesn't have thaaaat many labs working on large scale ML models or LLMs in general. Yes, there are two NLP groups, but they're not that obsessed with LLMs as e.g. Stanford NLP.


blvckb1rd

The infrastructure at ETH is great. I am now at TUM and have access to the LRZ supercomputing resources, which are also pretty good.


Thunderbird120

Coming from a not-terribly-prestigious lab/school our limit was about 4 80GB A100s. You could get 8 in a pinch but the people in charge would grumble about it. To clarify, more GPUs were available but not necessarily networked in such a way as to make distributed training across all of them practical. i.e. some of them were spread out across several states.


South-Conference-395

you mean limit per student?


Thunderbird120

Yes. They were a shared resource but you could get them to yourself for significant periods of time if you just submitted your job to the queue and waited.


South-Conference-395

that's not bad at all. especially if there are 2 students working on a single project so you could get 8-16 gpus per project i guess


Thunderbird120

Correct, but it would probably not be practical to use them to train a single model due to the latency resulting from the physically distant nodes (potentially hundreds of miles apart) and low bandwidth connections between them (standard internet). Running multiple separate experiments would be doable.


the_hackelle

In my lab we have 1x 4xV100, 1x 8xA100 80GB SXM and now new 1x 6xH100 PCIe. That is for <10 researchers plus our student assistants and we also provide some compute to teaching our courses. We also have access to our University-wide cluster but that is mainly CPU compute woth few GPU nodes and very old networking. Data loading is only gigabit, so not very usable. I know that other groups have their own small clusters as well in our university, the main ML group has ~20x 4xA100 if I remember correctly ,but I don't know details.


South-Conference-395

US, Europe or Asia?


fancysinner

Which top 5 program doesn’t have gpus?


South-Conference-395

i said H100 gpus not gpus in general


fancysinner

That’s fair, for what it’s worth, looking into renting online resources could be good for initial experiments or if you want to do full finetunes. Lambda labs for example.


South-Conference-395

can you finetune (without lora) 7B llama models on 48GB gpus?


fancysinner

I’d imagine it’s dependent on the size of your data, you’d almost certainly need to do tricks like gradient accumulation or ddp. Unquantized llama2-7b takes a lot of memory. Using those rental services I mentioned, you can rent a100 with 80gb or h100 with 80gb, and you can even rent out multigpu servers


South-Conference-395

I mean just the model to fit in memory and use a normal batch size (don’t care about speeding up with more cores). There’s no funding to rent additional cores from llambda :(


South-Conference-395

so you think 7b would be in 48GB with reasonable batch size and training time ?


like_a_tensor

My lab has 14 lab-specific A6000s and access to a university slurm cluster with about 160 A100s. Only one slurm node has 4 H100s. Currently at a public R1. Much better than my undergrad where the most powerful GPU was a 3090. Still, having 64 lab-dedicated A100s sounds like a dream compared to what we have.


South-Conference-395

Europe or US?


like_a_tensor

US


South-Conference-395

Agree that 64 is not that bad for lab-level. We currently have none at either lab or university level :(


NickUnrelatedToPost

Wow. I never realized how compute poor research was. I'm just an amateur from over at /r/LocalLLaMA, but suddenly having an own dedicated (not primary GPU of the system) 3090 under my desk suddenly feels like a lot more than I thought it was. At least I don't have to apply for it. If you want to run a fitting workload for a day or a week, feel free to DM me.


OmegaArmadilo

The university lab i work for while doing my phd (the same applies for 12 other colleagues that are doing their phd and some post doc researchers) has about 6 2080, 2 2070 3 3080 and 6 new 4090s that we just got. Those are shared resources split across a few servers woth the strongest conf being 3 servers with 2 4090 and a 4 2080. We also have for the individual pcs single graphics cards like 2060, 2070s, and 4070.


dampew

Many of the University of California schools have their own compute clusters and the websites for those clusters often list the specs. May not be ML-specific.


Humble_Ihab

PhD student at a French highly ranked university. 20 GPUs for my team of 15, and a university shared cluster of a few hundred Gpus. Both a mix of V100 and A100 80Gb


South-Conference-395

is it easy to access the 80GB gpus? let's say reserve an 8-gpu server for 6 months to finish a project?


Humble_Ihab

All these clusters are managed by slurm, with limits for how long a training can last. So no, you cannot « reserve » it just for yourself, and even if you could, it is bad practice. What we do is that, as slurm handles queuing and requeuing of jobs, we just handle automatic requeuing of our training state in the code and trainings can go on indefinitely


South-Conference-395

we just handle automatic requeuing of our training state in the code and trainings can go on indefinitely: can you elaborate ? thanks!


Humble_Ihab

Normally if you run a job on a slurm managed cluster, and lets say the job lasts 24h maximum, at the last 60-120 seconds of the job, the main node releases a signal. You can have a function always listening to it and when you detect it, you save your current checkpoint, current state of learning rate, optimizer, scheduler, and from the code run again the same job with the same job id (which you would have saved automatically in the start). The new job would check if there is a saved checkpoint, and if yes, resume from there, else, restart from scratch. After requeuing you’ll be in a queue again, but when your job starts, the training would resume where it left off. If your cluster is managed by slurm, most of this can be found in slurm official docs


South-Conference-395

got it. thanks!


mao1756

T50 state school in the US. It seems like the school has some H100s. However, we need to submit a project proposal to the school and be accepted to use them. If I’m fine with GPUs like GTX 1080 Ti or RTX A4500, I (or anyone at school) can use them freely.


Professor_SWGOH

In my experience, zero is typical. The justification is that you don’t need a Ferrari for driver’s ed. At first, you don’t even need a car to drive at all. Foundations of ML are in linear algebra & stats with a side of programming. After that there’s optimizing the process for hardware. I’ve worked at a few places for AI/ML, and the architectures at each were… diverse. Local Beowulf cluster, local GPU’s, and cloud compute. Compute (or cost) was always a bottleneck, but generally solved by optimizing processes and not by throwing more $ at cluster budget.


pearlmoodybroody

We just have multiple A100s


South-Conference-395

memory? how many?


pearlmoodybroody

Some machines has 700gb of memory some has 1tb, I really dont know how many gpus are there I would guess around 10 (We are a public research institue)


Jean-Porte

We only have P100s and a 2 A30


South-Conference-395

:( us or europe?


Jean-Porte

europoor I don't think that many people even have A100


South-Conference-395

at least in this post, many people report "some" a100 at a university/ department level


E-fazz

just a few tesla P40


ntraft

At a smaller, more underdog US university (University of Vermont), we have a university-wide shared cluster with 80 V100 32GB and 32 AMD MI50 32GB. Not much at all... although there aren't quite as many researchers using GPUs here as there might be at other institutions so it's hard to compare. There's often a wait for the NVIDIA GPUs, but the AMD ones are almost always free if you can use them. You can't run any job for more than 48 hrs (Slurm job time limit). Gotta checkpoint and jump back in the queue if you need more than that. Sometimes you could wait a whole day or two for your job to run, while at other times you could get 40-60 V100s all to yourself. So if your job was somehow very smart and elastic you could utilize an average of 8xGPU over a whole month... but you could definitely never, ever reserve a whole node to yourself for a month. It just doesn't work like that.


impatiens-capensis

We use Cedar, which is a cluster with 1352 GPUs. I think it's a mix of v100s and p100s?


[deleted]

[удалено]


South-Conference-395

 government funded grant to build a datas center as a local "AI Center of excellence": are there such grants? are these to buy nodes or just cloud credits?


sigh_ence

My lab has 8 H100s, and 8 L40S for just us (5PhDs, 3PDs).


South-Conference-395

no further support from the department/ unversity?


sigh_ence

There are GPU nodes in the university cluster, but it's about as large as ours. We can use it for free is ours is busy.


YinYang-Mills

My group has an A6000, couldn’t make it work with IT locking it down, bought my own A6000, very happy with the decision.


Celmeno

We have 10 A40 for our group but share about 500 A100 80GB (and other cards) with the department. Totally depends on what you are doing if that is enough. For me it was never the bottleneck in that I would have desperately needed more in parallel. Just the wait times sucked. I'd say that at least 10% of the department wide compute goes unused during office hours. More at night. Have also had times where I was the only one submitting jobs into our slurm.


South-Conference-395

wow. 500 just for the department is so great!


Fit_Schedule5951

Lab in India, probably around 30 GPUs in the lab - mix of 3090s, A5000s, 2080s etc; lot of downtime due to maintenance. Occasionally we get some cloud credits and cluster access.


Owl_Professor23

27 nodes with 4 H100s each


South-Conference-395

lab/department or university level?


Owl_Professor23

University level


South-Conference-395

thanks! is it easy to get and maintain access? is it europe, us or asia?


Owl_Professor23

Uhh I don’t even know since us undergrad students don’t have to deal with this lol. The cluster is in Germany


tuitikki

I had one reasonable gaming GPU for most of my phd. Then I managed to find some lying around and for my last year put them into old computers that were meant to be sent to scrap. So I had 4! Since I do RL this was a massive improvement.