Hunting the ghost in the machine

OR: How hard can it be to detect a failing SSD?

TL;DR
I had an SSD which was causing trouble with my NAS, causing the system to freeze out of nowhere. No logs, no errors, nothing.

Turns out it was apparently a known issue in the firmware (Release 3B2QGXA7) with the Samsung 980 PRO SSD I used for the OS.

The Story

If you monitor all your devices closely and keep the metrics of all your drives in the view with monitoring solutions like CheckMK and SMART, use more or less modern file systems which are capable of handling SSDs with ease, how difficult can it be to detect a failing SSD?

The Answer is: Yes! And that’s thanks to different factors which were not all entirely in my hand.

But let’s start at the beginning.

Service Down

The problems started end of October last year, with my monitoring starting to panic due to the NAS and thus my primary media server, being unreachable. The system is a custom built server running Ubuntu, so nothing I could put on one of the more known manufacturers.

The only output I was getting, was an error from the AMD GPU that ffmpeg was having an issue with addressing something. First I though that was weird, took a photo of the error for documentation and proceeded to boot the system. Here comes the first weird issue: No input was possible anymore, nothing.

OK, hard reboot. The system comes up again, normally, and I start checking logs. Here it was also starting to get weird, as there were no entries available indicating an error with the hardware or any message wich could cause such a outage.

I noted the issue in my GitLab for further investigation and to track all information.

It gets worse…

… and fast. The second outage would take about two weeks to happen, same pattern. Then the intervals would quickly shorten to less than a single day. There were always ffmpeg errors on the console regarding the AMD GPU, so I re-installed the GPU drivers, but to no avail. At least these errors were gone now. (Yay!)

Starting to suspect a hardware failure, this was the point where I started moving payload and logs from the system SSD to the ZFS pool, running seven HDDs and a SSD as cache drive. Suddenly the system was living again for a week, only to go down after I moved some bigger directory to the ZFS pool again. The timing was impeccable: Move the big folder (around 20 GB) and five minutes later the system is dead in the water again…

(Panic sets in…) THIS IS A FAILING DRIVE! (… Confusion arises …) But which one?!

Check the metrics and self-tests

Thankfully I installed and configured the monitoring in a way all the metrics of the devices are retrieved and stored every minute. This way I can see the remaining lifetime of the two SSDs in this server. Although I was pretty certain it was not the cache drive (a Crucial 500 GB M.2, sadly they don’t exist anymore) as the behavior would have been different, I checked everything again.

This lead me to check all devices with smartctl:

smartctl -a <DEVICE>

Just to find everything being alright. All drives were OK and I knew SMART was working, due to a failed HDD at earlier point, were the error was pointed out correctly. Also ZFS was not reporting any problems with its drives, leaving the system drive, a Samsung 980 PRO, as the only culprit.

CheckMK and SMART

Just in case you have a similar set up. If you want to monitor all drives with SMART and the CheckMK SMART plugin, you will have to fumble the check command in the plugin you place on the server, other not all drives are monitored.

Why, how, who?

I started searching for similar issue with the Samsung drives and found something immediately. The firmware which was installed on the drive (Release 3B2QGXA7) was known to cause the drive to enter read-only mode without further explanation. This would also explain why I did not have any logs, since they could not be written anymore.

You can use fwupd to check for the currently installed versions:

fwupdmgr get-devices

This should give you a comprehensive list of all devices in you system. Maybe you have to install it, as it might not come pre-installed.

Next I tried to figure out how to update the firmware of the drive, as there is a new firmware available which should fix the problem. And guess what: It’s not supported on Linux.

The solution, or so I hope

Since I don’t like defective drives in my servers, I elected to replace it with one from a manufacturer I trust. The new SSD was to be a Kingston NV3 (500 GB, M.2 NVME) which I already use in a different place. Instead of copying the existing OS from the old drive to the new one, I (or better, we, as I got some help) decided to install the new OS and re-install the missing services. All configurations were on the ZFS pool anyway, as I started moving them earlier.

Since the change the system is running stable and at full load again. Making me hopeful this was the definitive solution.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.