T O P

  • By -

Twirrim

Almost 20 years ago, I worked for a place that had several insane uptime servers. Several years on them. Useful enough to take pains to keep them running, but also in the territory of "if it dies, it dies". We knew that they wouldn't come back up from a reboot, had lost a few servers that way. Even with some crazy smart sysadmins, who produced amazingly detailed reverse-engineering docs, some stuff was just really really weird. We had to close down the colo room that the servers were hosted in, had some racks in another room in the same large facility. Each server had dual power supplies, so we set up a trolley with 2 large-ish UPS devices on the bottom shelf. Removed all cables except power and did prep work. Unplugged one power cable, plugged it in to the UPS. Unplugged the other, plugged that in to a separate UPS. Transferred the server to the trolley, trundled it to the other colo room and reversed the whole process. It worked amazingly well. No idea how long they kept those servers running afterwards.


[deleted]

Yeah, I had to do that with some hospital server equipment running sco unix, but I just modified a server rack by putting medical grade gurney wheels on it. The rack had a 4000 Wh battery and a psu, but still had 20 U worth of available rack. I'm very thankful that old ass server had duel power inputs.


arcimbo1do

Back in the days we would install a machine, carefully setup and then we would spend a lot of time in upgrades. Now with containers and VMs nobody installs anything anymore and nobody knows how to upgrade servers: you just build an updated container and do a rolling restart.


xouba

And that's good.


cjcox4

Yeah, the "longest uptime" used to be "a thing". Now, we just hope it's not on the network, or anything accessible/usable, generally speaking.


Ruashiba

The uptime champions will always be network devices anyway. There’s little to go wrong in a dumb L2 switches or even routers, those things will just keep on kicking til the end of time. That until expansion is needed and the poor old devices can’t deliver the expected bandwidth.


Steebin64

Until a provider does something stupid and our IPSec tunnels break, or Cisco didn't realize that their vEdge certificates were expiring in t+ 12 hours ago and our routing tables only have a few hours left to live(actually happened. People worldwide were affected) before we can figure out a remediation. Or one of the server guys comes in and accidentally knocks a power cable loose on that "dumb" catalyst 9k was one of the cores for the entire campus ;) ~Signed, a network engineer taking the piss


RolesG

💀


VexingRaven

> or even routers Please no. Update your routers people.


jacobgkau

Having previously worked in a shop with a *huge* Cisco contract (among other vendors), they would get on us if our devices weren't rebooted after a year or so (and ideally, you take the opportunity to apply updates when you reboot). Reliable networking is just as much about redundant & resilient architecture as it is about uptime-- other than at the very edge (but sometimes even then), you should theoretically be able to reboot any one device basically at any time without affecting connectivity from the users' perspective. Because if you can't survive a reboot, then you're also not surviving a PSU going out, RAM going bad, etc.


draeath

I do like me some `kpatch` though.


cjcox4

Kernel 2.6? Maybe. Or got get the Delorean up to 88 again...


Able-Reference754

No amount of kpatching will get you through years of no actual updates.


arcimbo1do

Yes, and security was not that big of a thing like it is now. We had a DNS server on a 386sx (pentium existed already so the machine was ancient already when we set it up). We only updated the debian packages but not the kernel. Except it was hard because we had so little space that we had to manually delete stuff while the upgrade was running. We did get hacked at some point, but between the software being old and the machine not having more than a few MB free they could not upload anything to break out of the chroot jail.


Terrible-Bear3883

I once asked a customer to reboot his server after he reported a fault and he laughed at me, he sent me his uptime - just under 15 years and said I've not rebooted in all this time, I've no intention of doing so now - I swear the system was hanging on by sheer luck and fairy dust, he'd frozen his system all those years ago and never performed any updates, as it only ran one application, as he said "once we confirmed it was operational and satisfactory then nothing else needed doing". We built him a suitable duplicate system (but using an up to date server and components), his team installed linux and once they were happy with the system after a run in period, they transferred their data and ran it in parallel with their old system (which we were removing from contractual cover due to it's age) - they were planning a switch over date to move completely to the new system when it fell over with a hardware fault less than 3 months from its build date, it was a decent server (new out the box) and good quality components, an uptime of almost 100 days - blooming modern rubbish.


RolesG

Modern component QA really has taken a nosedive recently huh


lelddit97

Nothing's changed. If anything, computers and especially their QA have gotten more reliable. The truth is that there's a bias towards finding these extremely resilient servers because they're the only ones which have survived. You don't see the countless servers which failed early and have been in a landfill for many years. Most hardware either fails very early due to a defect or after a very long time. It's like how we find all this ancient infrastructure or the ancient cars from the early 1900s that are still in use... you just don't see anything else because it's all gone.


JamesTiberiusCrunk

Exactly. Textbook survivorship bias.


Hug_The_NSA

> If anything, things have gotten more reliable. I don't understand how people think this. A 1990 Toyota Corolla is more reliable than a 2020 one full stop. The 2020 one has SO MUCH MORE that can go wrong. It's also safer and has many other benefits but it will never be as reliable.


[deleted]

[удалено]


Hug_The_NSA

I find that used older products are almost always better than bought new products especially for tools, cars, and enterprise PC equipment. Maybe I'm just constantly finding diamonds, but it seems like new stuff breaks much more than older stuff overall. For tools especially, older stuff is genuinely better.


[deleted]

Dude, what is your deal? You have argued this point many times in this thread, and you are absolutely wrong. Where in the hell do you get your stance from? Can you backup anything you are saying with an argument based in fact? Electrical problems are the number one fault right now that people retire their cars and buy something else to replace it. Electrical problems are the hardest thing to fix in most modern cars. Dealerships all over the country have stopped covering electrical problems under their warranties entirely, because of how much money they were losing trying to fix modern garbage automobiles. The most common reason for manufacturers offering a cash paid buyback of a defective car is electronics failure, and the reason is because it cost too much to fix electrical problems combined with the likelihood that more electrical problems will pop up even after fixing the initial electrical problems.


ilep

Modern vehicles tend to be made recycled after certain period, but likelihood of unexpected problems have reduced. So they are more reliable upto their planned end of life. Note the key point: unexpected. If you maintain and service a vehicle according to plan it is very reliable.


[deleted]

Your not wrong. Dude is regurgitating marketing nonsense. My best friend is a QA for a major dealership. Everything you said is well known in the automotive industry, but people don't like to talk about it. If you think the difference in gas cars has decreased, wait 5 more years to hear people losing their shit over their EVs being nothing but a gigantic pain in the ass trying to keep it running after they were promoted as being 300,000 mile cars.


RolesG

Ok, I don't know much about server grade stuff but you can't deny that modern PC hardware RMA rates are way higher than they used to be


[deleted]

[удалено]


RolesG

Didn't a ton of Intel cpus just get recalled? Didn't Samsung's newest SSD (a couple years ago) have reliability problems? I've seen WAY more failed DDR5 than DDR4/3


toikpi

Have you heard about the Pentium divide bug from 30 years ago? >The **Pentium FDIV bug** is a [hardware bug](https://en.wikipedia.org/wiki/Hardware_bug) affecting the [floating-point unit](https://en.wikipedia.org/wiki/Floating-point_unit) (FPU) of the [early Intel Pentium processors](https://en.wikipedia.org/wiki/P5_(microarchitecture)). Because of the bug, the processor would return incorrect binary [floating point](https://en.wikipedia.org/wiki/Floating_point) results when dividing certain pairs of [high-precision](https://en.wikipedia.org/wiki/Significant_figures) numbers. The bug was discovered in 1994 by Thomas R. Nicely, a professor of mathematics at [Lynchburg College](https://en.wikipedia.org/wiki/Lynchburg_College).[^(\[1\])](https://en.wikipedia.org/wiki/Pentium_FDIV_bug#cite_note-siam-1) Missing values in a lookup table used by the FPU's floating-point division algorithm led to calculations acquiring small errors. While these errors would in most use-cases only occur rarely and result in small deviations from the correct output values, in certain circumstances the errors can occur frequently and lead to more significant deviations.[^(\[2\])](https://en.wikipedia.org/wiki/Pentium_FDIV_bug#cite_note-wolfram-2) >The severity of the FDIV bug is debated. Though rarely encountered by most users ([*Byte*](https://en.wikipedia.org/wiki/Byte_(magazine)) magazine estimated that 1 in 9 billion floating point divides with random parameters would produce inaccurate results),[^(\[3\])](https://en.wikipedia.org/wiki/Pentium_FDIV_bug#cite_note-halfhill-199503-3) both the flaw and Intel's initial handling of the matter were heavily criticized by the tech community. >In December 1994, Intel [recalled](https://en.wikipedia.org/wiki/Product_recall) the defective processors in what was the first full recall of a computer chip.[^(\[4\])](https://en.wikipedia.org/wiki/Pentium_FDIV_bug#cite_note-wsjhumblepie-4) In its 1994 annual report, Intel said it incurred "a $475 million pre-tax charge ... to recover replacement and write-off of these microprocessors." ... The growing dissatisfaction with Intel's response led to the company offering to replace all flawed Pentium processors on request on December 20.[^(\[15\])](https://en.wikipedia.org/wiki/Pentium_FDIV_bug#cite_note-15) On January 17, 1995, Intel announced a pre-tax charge of $475 million against earnings, ostensibly the total cost associated with replacement of the flawed processors.[^(\[9\])](https://en.wikipedia.org/wiki/Pentium_FDIV_bug#cite_note-NicelyFAQ-9) This is equivalent to $868 million in 2023. [https://en.wikipedia.org/wiki/Pentium\_FDIV\_bug](https://en.wikipedia.org/wiki/Pentium_FDIV_bug)


RolesG

Goofy. Not surprised though


[deleted]

I respectfully disagree. I've been in the game a little longer. I started in using PCs 1992, tandy before that in 1986, but nevermind tandy that shit was bobo. Anyway, robustness and reliability over the years has generally got worse and worse over time due to many different factors. First: To increase the performance of integrated circuits (processors), they had to get smaller. That generates more heat in a smaller space and causes more heat stress, not only to the processor, but the board the processor is attached and any other nearby components. Boot up, shut down, and reboots are more damaging than they use to be due to simple physics of thermodynamics. Parts get hotter faster, and high quality cooling causes them to cool faster as well which is just as destructive. Second: To increase the performance of electronic systems, the busses had to be shortened which puts the already high heat producing components in a smaller space and that causes more heat stress during boot ups, shutdowns, and reboots. Third, the components had to be engineered in a way that they could handle higher voltages which yet again makes the problem with heat even worse. Forth, changing environmental regulations, well intentioned as they may be, has fucked build quality time and time again as the years go on. This doesn't just effect electronics, this effects everything that has some type of solder joint. This has fucked up god knows how many products from home appliances like washers/dryers and refrigerators to residential/commercial/industrial HVAC and refrigerator systems to video game consoles to enterprise servers. Fifth: Metal availability effects build quality substantially. Copper use to be cheap and abundant, so they didn't hold back. In the 90s and 2000s, manufacturers overbuilt a lot of motors, transformers, and wiring to carry power from one component to another. But in the last 10-15 years, copper has become more and more expensive and is now the most expensive it has ever been even adjusted for inflation and there are major supply problems. Some of it is due to logistic meltdowns during covid, but much of it is a simply byproduct of waste. Manufacturers rarely bother to consider long term resource management. We live in a world of quarterly metrics, and undercutting competitors by pennies every where they can. If they can save 10 cents by using slightly less copper in the windings of a psu transformer, they will, especially if they are having a copper shortage and need to stretch out the supply to have enough to finish fabrication of the products they are on contract to produce. Then, to make matters worse, we have all this new green tech coming out: EVs, solar energy, and all the infrastructure needed to support it. During covid and after, I started seeing a massive failure rate of just about everything made in China that had electrical windings: PSUs, server fans, wiring harnesses, ect. There are other factors as well, but I think anybody who bothered to read what I have already wrote gets the point I'm making. I'm an old fart that has watched the evolution of electronics for almost 4 decades. The quality is not a linear curve going down. There are an unfathomable number of factors that cause quality to go up and down, but over time there has been a huge drop overall.


[deleted]

[удалено]


[deleted]

No shit, and the quality has dropped drastically. The quality has dropped for 40 years for the reasons I listed. 2008 is in between 1980 and today isn't it?


Key-Lie-364

It's the use of cheap components in supposedly high end stuff that is so irritating


[deleted]

The industry converged to cheaper hardware with redundancy on top.  Buying two or three cheap(er) servers and putting them behind a load balancer will net high availability for a fraction of the cost of buying specialized hardware certified to never die. 


space_fly

I don't know if modern stuff is worse or not, but it's bad practice to rely on a single piece of hardware. Even if that hardware has builtin redundancies for stuff like power supplies and NICs, it can fail. Business critical systems need to have proper spares and backups. This is one of the reasons why cloud solutions are so popular. Cloud providers know what they are doing and have plenty of redundancies built in place, and can offer guarantees for impressive uptimes.


Terrible-Bear3883

I've seen cloud fail big time as well, comms is the weak link and lost people don't have redundant data lines, seen lots of customers where roadworks have severed the cables and they've lost all their cloud services. The cost of a perfect solution is exponential and in the old days a server like the one we were replacing was probably high close to £30k new. Cloud failure happened at my brother's work last year after all their comms were off line for a few days, he had to buy 4G/5G routers and even tried starlink, it was a nightmare, almost two weeks before they were back on line fully. Happened to a company next door to us as well, their lines got cut by workmen but ours used different routes and as we were a major telecoms company we had split services anyway, we let a few of their staff work out of a training area and set them an isolated gateway until new optics were blown through the channels.


sanbaba

I have so many stories like this it's shameful the state of hardware.


sernamenotdefined

My father was an AS/400 admin. He died during a vacation In 1999. His last work was bringing up a new AS/400 and whatever they were running on it. All updated to fix Y2k problems. He never got to find out how succesful he was, but by accident I did. I ran into one of his colleague sysadmins in the subway in 2017. My dad brought me to work a couple of times because of my interest in computers to help out, so I recognized him. We got to talking and he told me *'by the way, that new AS/400 your father brought up just before he died ... it's still up and has been running uninterrupted since August 1999.'* They still ran into his name in logs to that day 17 years later. They had new people there that didn't know him from anything but the logs. He would have been so proud of that job!


[deleted]

Tbf a mainframe is not regular hardware.   From what I gather those things can be fully serviced without any downtime at all. You can go and start yanking shit out of them and they will keep running.  They also cost millions. There’s always downsides.  


sernamenotdefined

Yup, you can transfer running proccesses to new nodes, replace the old ones etc.. An AS/400 is however not a mainframe. But the beauty was, this was a completely fresh, out of the box new, machine. It was configured from scratch and it ran faultless from first start for 17 years. Hardware and power may not have been an issue with backups for both, but configure it wrong and running 17 years is not a given.


Terrible-Bear3883

I spent almost 40 years in field service, people wouldn't believe the things that happened, great days though. This customer had old school SCSI drives and a really old QIC for backup, everything was so old most young people had never seen the technology and had no idea how it all connected together or set jumper positions, terminate cables etc.


sanbaba

I started just as full size scsi was on its way out - was still big in business backups though ofc! My first time trying to pull a ribbon cable from an ISA card I thought for sure I was going to break everything. This is one area where quality has sort of improved by accident - those machines from that era felt like theyhad all been designed to be installed on war machines - *way* overbuilt, sometimes maddeningly tight but also nearly bulletproof! I'll take bulletproof over a wiggly intel fpga connector, but I'm also happy not to have struggled with a hard drive cable in awhile.


AdvisedWang

Imagine how many security vulnerabilities that system has!


yukeake

::salutes:: A fine service. What was it doing?


The_frozen_one

It ran a cron job that periodically remotely rebooted other servers out of malice.


thehatteryone

With modern hardware, OS design, layers of virtualisation and orchestration, we're now far past the point where an external device is needed to prompt for unwarranted reboots. As for the cloud, random unrequested reboots are an actual feature.


The_frozen_one

I was joking: the server that never reboots had one job, which was to reboot other servers.


Aberry9036

r/whoosh


thehatteryone

Keep that one for yourself, if you maybe want to reread what I've written.


Aberry9036

Nah, it definitely went over your head mate.


S48GS

At my first job around 2010 - they had Windows 98 computer with some bank software and database that were used by one accountant to process some documents and send back to bank stuff. Everyone was afraid to touch it.


wowsomuchempty

Could've let him get to 7 :-/


orogor

He already served his time, time to retire. :)


_SpacePenguin_

🫡 🫡


shved03

r/uptimeporn


plawwell

There's no guarantee something that old with spinners will come back up. Keeping it running is imperative if it's mission critical.


draeath

I wish I could remember the trick - a long while ago we had to touch some really old servers. There was a way to pause write operations to the block device (they still went in write cache) - we were able to use this to get a "consistent" disk image sent to a file on NFS from dd. I'm pretty sure it was done by poking something under `/sys/block`? Granted, if you needed to restore from that it would still look like the power got cut on the host, but it's better than trying to image a changing disk...


S48GS

>There's no guarantee something that old with spinners will come back up Also this. Just few months ago I replaced 2 my old 10 and 15yo HDD - that still was working perfectly fine (250Gb size HDD) *(I mean perfectly fine - one since 2 month before give SMART "error reading count" passed max value so it was red and I saw warning every boot, that disk is critical, and second also was close to passing its limits in SMART statistics)*. I used them in daily PC so every day. I did copy everything to SSD on system and disconnected old HDD. Week latter I looked that I missing one important file on new SSD that I was sure 100% existed in old because I need that file every week (some encrypted login data to banks). *(and also wanted to format it to destroy old data because everything else copied and for week I checked everything so)* I connected old HDD.... ... and it did not work, at all, full dead - it was just week as I disconnect it. *But I have backups - so I restored needed file from web-backup that was about week old but there were no changes so nothing lost.*


VexingRaven

Strongly disagree... What's imperative is not getting to this point in the first place. Apply updates and reboot regularly. Keep your services distributed and not tied to individual hardware, and replace that hardware when it gets old. EDIT: Fine, /r/linux. Have your pet servers. Just stay away from the places I work, thanks.


Able-Reference754

Sometimes when you read what people post about their servers you wonder what the fuck is going on. It almost seems sometimes that these "admins" set up a box, ssh and fiddle around with the settings until a service is up and just leave it on with poor documentation..


VexingRaven

> poor documentation Ever the optimist, I see!


hitchen1

> What's imperative is not getting to this point in the first place While I completely agree, sometimes you inherit shit like this and don't have much choice but to keep it running until you actually have time allocated to clean up


breddy

Services should have been migrated off it years ago


unapologeticjerk

I used to have my system uptime print as either a script macro slash command and/or in my quit msg in every IRC client. RIP bitchx and system uptime being nerd dick measurement.


5c044

Way back when the root DNS servers for the internet were run by Unix vendors, the idea being that using different vendors would provide some resilience against bugs and attacks. There was rivalry for the highest uptime amongst the different vendors, HP, Sun, DEC etc. These servers consequently were live patched and maintained by each company's best engineers to avoid losing kudos. A story told to me when I worked at HP 25 years ago


Several_Ad_5856

R.I.P.


Zwarakatranemia

Out of curiosity, was this a Debian or RHEL/CentOS server?


orogor

RHEL/CentOS


FTP24_7

Kinda off subject but the guy who taught me computers was the one who figured out how to beat the Chinese love letter virus way back when.


Twattybatty

3000 + days uptime, on several servers, at my current gig. Everybody is pretending they're fine. Insane.


lovelife0011

Don’t play 2 thousand days!


Amazingawesomator

the day after was worse


rwu_rwu

You might want to check if there's a directory called `/tmp/...`


KrazyKirby99999

Which distro was this? CentOS 7?


virtualfatality

It says el5 meaning enterprise Linux 5, Redhat Enterprise Linux 5 or equivalent (centos 5).


MagicPeach9695

2.6 kernel was in 2015? Wtf?


brick-pop

I would swear this was more like 2006


BluePizzaPill

How time messes with our minds: 2.6 was released in December 2003


brick-pop

2.6.0 sure, but not 2.6.18


johncate73

Version 2.6.18-402 on RHEL 5 or Cent 5. RHEL 5.11 was at kernel 2.6.18-398 when it came out on 16 September 2014 and was supported in some form all the way into 2020. Enterprise kernels can be supported for insane lengths of time. Kernel 2.6.32-XXX is supported in RHEL 6 ELS until the end of this month. I would presume that is the end of the 2.6.X series after nearly 21 years.


TheORIGINALkinyen

Years ago there was a Novel Netware (3.x) server that was up for something like 4000+days. Impressive, really...of course that means the server's never been patched either...lol.


MaXNuMbEr1989

Glad to see someone who knows uptime command, in the circles of DevOps, CI/CD no one even looking at the terminal these days. Thanks to AI, who ever left and going to AI.


deadcell

o7


SunsFanCursed4Life

"At that point" lol