Here's an example of why good working hardware suddenly stops working correctly without an obvious explanation

Ever had that experience where you didn’t change anything (that you knew) and yet some piece of equipment started behaving unexpectedly? Well here’s my story for today.

Over the last few years Microsoft has mucked about with Windows to fix issues, in particular networking issues related to SMB 1.0 and the EternalBlue issue ( https://en.wikipedia.org/wiki/EternalBlue ), and I have had some connectivity issues. At one point, in an attempt to solve these issues, I rushed out to buy a low cost batch of hardware to throw together a machine to be a network server. (Running Proxmox for those curious.) Although I was in a hurry, the machine went together pretty well, and worked for what I needed it for at the time.

Eventually the machine’s purpose became redundant, and I decommissioned it. It was working fine when I powered it off. I then carried it downstairs for temporary storage because I needed the room clear for temporary [human] guests. One thing led to another, and it was left out of use for longer than intended, probably about 4 months.

Today was the day to fetch the machine, and do a BIOS upgrade to install the newer AGESA (it’s an AMD 2700X CPU), and then reformat it and install it for its next life. I went down, retrieved it from under the eventual other pile of networking gear that I had piled on it, and carried it upstairs. I hooked it up, and powered it on… the lights came on, but nobody was home. I couldn’t get any video display at all.

What the heck could have happened to this poor machine?? I scratched my head for a bit while trying to guess what diagnostic step to take. After numerous power cycles, and rechecking and reseating all cables, still no luck. I resolved to open the beast up and see if I could any loose cables or anything else obvious. I didn’t see anything obvious inside the machine. Time to check for the inobvious…

After much poking and prodding I finally saw something I didn’t like. On the back of a standard computer case there is a cut out for what is known as the IO shield. This is a standardized rectangular shape and the IO shield usually comes with each motherboard because it is unique to the layout of the connectors of the motherboard. On this machine, I recall struggling a bit to fit the motherboard to the IO shield while being in a hurry when I assembled it. Tolerances are tight and it’s not unusual for there to be some difficulty doing this, so it didn’t register any alarm at the time.

So here’s what I think happened. There is a grounding tab on the IO shield that is meant to sit on top of the connector housing a pair of USB ports. I think that tab actually sat outside the case and nearly inside the USB connector. This would have been a consequence of my struggle to make it fit. Somehow, at the time, and for the weeks while I used the machine, it seemingly didn’t pose an issue. (I wasn’t using those USB ports for anything, so I wouldn’t have had the opportunity to notice this problem from trying to insert something into the port.) I suspect when I decommissioned the machine and carried it downstairs, the case motion cause the whole case to flex a little bit (as I previously stated it was a cheap build so it’s a cheap case, so not as rigid as I might otherwise like.) The flexing must have worked the tab into the port so that it came in contact with the pins, causing a short. Of course, since I didn’t set the machine up for use after I took it downstairs, I wouldn’t detect the problem. The problem didn’t surface until I brought it back up today to try and work on it.

If this whole story has any moral, it’s simply that you can have what appears to be a perfectly working system suddenly go wrong for reasons that can be really difficult to detect. Loose connections and connectors, a little wiggle or flex, can sometimes expose a fault no one could have predicted. It can be easy to think “but I didn’t do anything” when the cause and effect are really hard to connect to each other.

The resolution to this particular PC is that once I got it booting, I upgraded the BIOS and it is functioning well. There don’t seem to be any lasting effects to the temporary problems it suffered.

7 Likes

Wow great troubleshooting! I can’t tell you how many times I have inadvertently damaged something in the excitement of assembly to get to the final product.

It is crazy how a sequence of seemingly harmless actions can take something from working to broken. Haha. Glad it is working for you now!

4 Likes

Great job, I am glad the fix was easy, I ate updating the BIOS…

2 Likes

Every time I update my BIOS, I have to reapply all my memory speeds and re-enable the TPM for BitLocker. Last time, BitLocker still complained. Luckily I had printed out a hard copy of the very long BitLocker recovery code or I would have been locked out of my system. So I approach BIOS upgrades with trepidation too.

3 Likes

Another important thing to check is the blue smoke level.
If you let the blue smoke out it will stop working every time, and you can’t put the blue smoke back in.
:grin::+1:

4 Likes

Many years ago I worked for a PC manufacturer, in a tech support role. We never updated the BIOS on our (own-brand) systems, because we’d got fed up with the engineers not being able to explain why an update had just bricked a key system. :roll_eyes: :-1:

4 Likes

Yes, I have never manually updated a BIOS on a computer before either.

Now, my HP laptop has a program that automates such things. So, I think that has been updated before - but it does it all on its own.

1 Like

Great story, sounds all to familiar of many many situations I have dealt with. Most of the time it always seems that moving a system somehow slightly dislodges the RAM and they most always need to be re-seated. I even had a PC years ago that every time the power was turned off the RAM would have to be re-seated. Glad you found the cause and were able to correct it.

2 Likes

In the past, the BIOS didn’t do much, so there wasn’t much cause to update a known good/working system.

These days, with the complexity of CPUs and support chips (USB, Thunderbolt, PCI (NVMe), and even RAM compatibility with all those power and timing tweaks) you can get a better experience (and yes, I suppose, possibly a worse one) by updating the BIOS/UEFI.

One thing that frequently gets addresses is the CPU itself. The BIOS can load new microcode into the CPU so that it can address incompatibilities and even add new features (Intel added new abilities to address the Spectre and Meltdown, for example.) Although these days it is also possible that the OS can load the microcode (Windows and Linux can do this I believe.) I wouldn’t rely on the OS to do it, or do it correctly though… it’s basically CPU firmware, and should be handled by the motherboard firmware (the BIOS/UEFI) IMHO.

Certainly with Ryzen 3, there were still things being addressed by the AGESA (the microcode and firmware for the support chips in AMD systems) months after the CPUs were released. It seems to have finally stabilized now, but I did notice improvements in RAM support over the various fixes over the first three months or so.

3 Likes