3

We have a PowerEdge R7525 server with nvidia A16 graphics card on debian 11. But we have about 50% lower gpu performance than other servers. I suspect it's the missing "Above 4G decoding" option in the BIOS. According to nvidia this server should handle up to 3 A16 gpu units. Can anyone advice me some work-around or something to harness the full power of this gpu?

Thank you very much in advance

Aotor
  • 31

2 Answers2

6

(I work for Dell) - specifically, I do a lot of optimization.

I think you're tracking a bit off course; "Above 4G decoding" is a feature left over from when BIOS PCIe memory enumeration was limited to 32bits, which is no longer the case and hasn't been for quite some time. The addressing is now native 64 bit.

But we have about 50% lower gpu performance than other servers.

I'm not sure what you mean by this. I may be reading too much into this, but this statement makes me think this may be your first foray into optimization in which case, awesome! It's a complicated but fascinating world. GPU performance can be measured in myriad different ways so this statement on its own doesn't narrow down what the problem is.

With regards to why you're seeing poor performance, this is an enormously complex question on which people write entire books. Some common mistakes I see people make particularly on AMD-based servers:

  • Failing to account for PCIe lane / proc alignment. Make sure whatever processes you're running against the GPU are assigned to the proc that has the GPU's PCIe lanes rather than the distant proc
  • Failing to set NUMA's per core appropriately for the workload (this is unique to AMD systems like the R7525)
  • Failing to account for bottlenecks elsewhere. For example: I've had people see poor GPU performance but in reality part of their software was storage IO bound.
  • Maybe this is obvious, but try setting the BIOS profile to performance. If you set it to power saver that can lead to downclocks potentially when you don't want them
  • Poorly aligned memory transfers

Optimization is extremely workload specific. If this is the first time you've gone through it, I would focus my time on really understanding exactly how the data flows and where it might be bottlenecking. Try to identify things that seem out of place. Ex: if you think GPU performance is low, what is the GPUs utilization? Is it at 100%? If it is close to 100%, I start to lean towards software problems. If it's not at 100%, why is it not? Are you not feeding it data fast enough? Is the card underpowered? Server overheating? Etc.

Grant Curell
  • 1,188
0

Have you been able to resolve this?

tldr: enabling SR-IOV seems to be required

We have noticed similar, probably the same, issue with 3 a16 cards. Basically only 1 physical card was working.

One of the things we noticed was that the system complained about memory overlap (dmesg logs), only one of the cards was actually working. 2 cards were not able to map memory.

This could be then checked in lspci, below is the example when it does not work:

sudo lspci | grep NVIDIA | cut -d ' ' -f1 | xargs -I@ bash -c 'echo @; sudo lspci -v -s @ | grep non-prefetchable'
5a:00.0
        Memory at bd000000 (32-bit, non-prefetchable) [size=16M]
5b:00.0
        Memory at bf000000 (32-bit, non-prefetchable) [size=16M]
5c:00.0
        Memory at c1000000 (32-bit, non-prefetchable) [size=16M]
5d:00.0
        Memory at c3000000 (32-bit, non-prefetchable) [size=16M]
c6:00.0
        Memory at <ignored> (32-bit, non-prefetchable)
c7:00.0
        Memory at <ignored> (32-bit, non-prefetchable)
c8:00.0
        Memory at <ignored> (32-bit, non-prefetchable)
c9:00.0
        Memory at <ignored> (32-bit, non-prefetchable)
de:00.0
        Memory at <ignored> (32-bit, non-prefetchable)
df:00.0
        Memory at <ignored> (32-bit, non-prefetchable)
e0:00.0
        Memory at <ignored> (32-bit, non-prefetchable)
e1:00.0
        Memory at <ignored> (32-bit, non-prefetchable)

Once we enabled SR-IOV all cards got memory mapped.

From before we have been enabling Intel IOMMU (perhaps AMD has equivalent) in boot options since all the other cards required it. So in case you haven't enabled it before you probably will need it with a16 cards too.

Some of the error logs we collected:

pnp 00:01: disabling [mem 0xff000000-0xffffffff disabled] because it overlaps 0000:e1:00.0 BAR 8 [mem 0x00000000-0x7ffffffff 64bit pref]
pci 0000:e1:00.0: BAR 8: no space for [mem size 0x800000000 64bit pref]
vfio 0000:ca:00.0: hardware reports invalid configuration, MSIX PBA outside of specified BAR