5

I have a Windows server (2022) with two Samsung 990 Pro SSDs of 2TB. I've had some weird problems with one of them disappearing from time. What happens is that every 2 months or so, the disk in question, does not exist anymore: diskpart or Get-PhysicalDisk (in PS) simply do not list the disk anymore. The only thing to do at that time is a complete powerdown and restart, a simply restart in the OS is not sufficient.

At first I thought it was an issue with the motherboard, so I got in touch with the manufacturer and -surprise!- they told me to make sure it wasn't a problem with the disk. After some back and forth, I decided to explore a potential issue with the disks, simply to avoid the hassle of replacing the mobo and then still have the problem.

Examining the situation of the disks was not so easy, because this is Server Core installation, so no GUI, but I was able to do some analysis, which revealed a shocker: running MS's diskspd showed a completely abysmal performance for both disks. Both read and write are just below 50MiB/s which is way lower than the specs of the 990 Pro.

So I now have several questions:

  • Are the two problems (disk disappearing from time to time) linked?
  • Could the speed problem by caused by the motherboard (it is an ASRock X570S PG Riptide)?
  • Could it be that the SSDs are counterfeit? And how can I check this?
  • Any suggestions on further analyzing this?

Clarification:

  • Server logs: nothing shows up in event viewer
  • Age of the drives: they're a year old and haven't been used intensively
  • Smart readings: This is the output I got from Samsung DC Toolkit:

Disk Number: 1:c | Model Name: Samsung SSD 990 PRO with Heatsink 2TB | Firmware Version: 0B2QJXG7

Bytes Description Value
0 Critical Warning 0x00
2:1 Composite Temperature 0x0142
3 Available Spare 0x64
4 Available Spare Threshold 0x0A
5 Percentage Used 0x02
47:32 Data Units Read 0x000000000000000000000000011BD521
63:48 Data Units Written 0x000000000000000000000000010D94FB
79:64 Host Read Commands 0x0000000000000000000000000DD8604F
95:80 Host Write Commands 0x0000000000000000000000001282EACA
111:96 Controller Busy Time 0x00000000000000000000000000009963
127:112 Power Cycle 0x00000000000000000000000000000020
143:128 Power On Hours 0x00000000000000000000000000001F93
159:144 Unsafe Shutdowns 0x00000000000000000000000000000014
175:160 Media and Data Integrity Errors 0x00000000000000000000000000000000
191:176 Number of Error Information Log Entries 0x00000000000000000000000000000000
195:192 Warning Composite Temperature Time 0x00040880
199:196 Critical Composite Temperature Time 0x00000000
201:200 Temperature Sensor 1 0x0142
203:202 Temperature Sensor 2 0x0149
205:204 Temperature Sensor 3 0x0000
207:206 Temperature Sensor 4 0x0000
209:208 Temperature Sensor 5 0x0000
211:210 Temperature Sensor 6 0x0000
213:212 Temperature Sensor 7 0x0000
215:214 Temperature Sensor 8 0x0000
gwyers
  • 73

5 Answers5

3

Update to the latest firmware. If your drive will continue misbehaving like showing up bad perf, disappearing and/or dropping out the system, you just RMA it.

NISMO1968
  • 1,583
3

The sudden disappearance is a known issue with 990 Pro's. I also have this issue with my server and did some research. I put my findings and lots of references in this reddit post: https://www.reddit.com/r/homelab/comments/1hqew6m/comment/m9hxhw2/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

Basically, it boils down to buggy PCI-E energy management. It seems like the disk goes to sleep and never wakes up again. The solution posted by many is to disable PCI-E energy management. If you would have a GUI, you could do that through Samsung Magician. Otherwise, through BIOS or the OS. I've tried different BIOS settings, without success so far. I post my findings on the reddit post above continuously.


Last update (hopefully): I was not able to find stable settings disabling all kinds of power saving features in the EFI. I gave up and ordered a Crucial T700 from the board's QVL. The system has been stable since I installed the new drive on 2025-03-10 for almost three months now. Before, it would crash at least once a month.

2

I have several findings to report:

  • I made a stupid mistake in the diskspd command line. This explains the low readings for speed.
  • I was able to reproduce the "disk disappearing" issue and have now a trace from event viewer.
  • The firmware seems to be the culprit indeed.

What I did was I removed the SSDs from the server and plugged them in a machine with a normal windows 10 installed. This gave me access to a GUI and allowed me to run Samsung Magician and some other disk benchmark tools. They all showed around 6500 MB/s sequential read speed and a slightly lower write speed. I spent some time understanding the readings I got from the diskspd command line when the disks were plugged in the server. After fixing that, I got a similar reading on the server itself. With that settled, the remaining question is if I should worry about the gap between the 6500 MB/s measured speed and the 7450 MB/s official Samsung speed. For the moment, I've decided to put that in the marketing blurb category.

While being in Samsung Magician, it prompted me to upgrade the firmware (from 0B2QJXG7 to 4B2QJXD7 as suggested by telcoM). Thinking that was a good idea, but not wanting to risk data loss, I started copying stuff from the drive to another location on that PC. All the files are Hyper-V VHDs, so fairly large. The copy started with a 200GB virtual disk, and got interrupted after about a minute with the same behavior as I saw before: the disk does not exist anymore: diskpart doesn't see it and the only thing to do is a complete powerdown and restart, a simply restart in the OS is not sufficient.

Having an exact timing of when this occurred, I went through event viewer in detail, where I could see the whole sequence of events happening. The reason I didn't see it earlier is that most of it is logged as warnings:

  • It starts with a warning with Event ID 129 from stornvme: "Reset to device, \Device\RaidPort2, was issued."
  • This is followed by a series of warnings with Event ID 51: "An error was detected on device \Device\Harddisk1\DR1 during a paging operation."
  • After several of these (retries I suppose), there is a warning (not an error!) from NTFS with Event ID 50: "{Delayed Write Failed} Windows was unable to save all the data for the file XYZ. The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere."
  • This pattern repeats once or twice to result in an error from stornvme with Event ID 11: "The driver detected a controller error on \Device\RaidPort2."

So I went ahead and upgraded the firmware. That all went smoothly and I tried to reproduce the error by copying some large files and it seems to have disappeared for now.

It all leaves me with a sour taste: I thought these were premium drives and I paid premium prices.

gwyers
  • 73
0

The current firmware version for Samsung SSD 990 PRO series seems to be 4B2QJXD7. And apparently firmware versions older than 1B2QJXD7 had a rather bad bug that will seriously hurt the SSD lifetime:

https://www.tomshardware.com/news/samsung-990-pro-health-dropping-fast

https://www.tomshardware.com/news/samsung-990-pro-firmware-update-released-ssd-health

https://www.youtube.com/shorts/D7XgEfxPGuo

https://www.reddit.com/r/hardware/comments/10jkwwh/samsung_990_pro_ssd_with_rapid_health_drops/

At least when the initial fix was provided in version 1B2QJXD7, it stopped the drive from getting worse but did not fix the degradation that was already happened before the update. The newer firmware versions may have provided more refined fixes, but unfortunately Samsung has apparently not released very much details.

As far as I understand, your firmware version 0B2QJXG7 would be the one that's affected by this bug, and it looks like your SSDs are indeed deeply degraded. You probably should update the firmware ASAP, and perhaps try and contact Samsung support for a possible RMA, as this seems to be a known issue.

telcoM
  • 4,876
0

If someone pops up like me after doing a search about the 990 Pro.

I read elsewhere that someone had this same issue of the drive disappearing. He solved it, using Samsung Magician, by setting the drive in performance mode. No more issues later on.

Dave M
  • 4,494
oliwek
  • 1