I have two machines connected by two ConnectX-7 to each other.
When running ib_write_bw test, the BW average starts from 395Gb/s, which is very good. But the speed drops very fast to less than 250Gb/s.
I have two machines connected by two ConnectX-7 to each other.
When running ib_write_bw test, the BW average starts from 395Gb/s, which is very good. But the speed drops very fast to less than 250Gb/s.
The CPU frequency is decreasing with each line shown in the test log.
Maybe you could try to set a more aggressive CPU frequency governor on both servers to prevent it ?
You can use mlnx_tune utility to generate a report that might help you finding the problem (I guess it's installed within Mellanox OFED), such as :
mlnx_tune --report
To globally improve performance over Mellanox hardware, you can have a look at the Mellanox Support ToolKit that will give you a bunch of tools to understand and address performance issues.
This sounds similar to an issue I was having, I asked and answered it here if you want to try what worked for me: https://serverfault.com/a/1143987/110252