2

I am trying to connect 3 HP z840 workstations using:

Mellanox ConnectX-3 VPI 40 / 56GbE Dual-Port QSFP Adapter MCX354A-FCBT
Mellanox SX6005 12-port Non-blocking Unmanaged 56Gb/s

Description of machines to be connected: oak-rd0-linux (main node where I will run things from and where opensm is running) oak-rd1-linux oak-rd2-linux

I have installed the latest fw on the cards and installed the latest mlnx ofed driver that supports my cards (MLNX_OFED_LINUX-4.9-4.1.7.0-ubuntu20.04-x86_64). Running ubuntu 20.04 (Linux 5.4.0-26-generic kernel as required by the mlnx_ofed driver).

how I installed the MLNX OFED:

sudo touch /etc/apt/sources.list.d/mlnx_ofed.list
sudo nano /etc/apt/sources.list.d/mlnx_ofed.list
deb file:/home/user/infiniband/MLNX_OFED_LINUX-4.9-4.1.7.0-ubuntu20.04-x86_64/DEBS/UPSTREAM_LIBS ./
wget -qO - http://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox | sudo apt-key add -
apt-key list
sudo apt-get update
sudo apt-get install mlnx-ofed-all

I also got the hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1

I start opensm in deamon mode with:

/etc/init.d/opensmd start

I run sudo ibdiagnet and it yields a clean summary (NOTE: I cannot run ibdiagnet without sudo )

Running: ibdiagnet -r
----------
Load Plugins from:
/usr/share/ibdiagnet2.1.1/plugins/
(You can specify more paths to be looked in with "IBDIAGNET_PLUGINS_PATH" env variable)

Plugin Name Result Comment libibdiagnet_cable_diag_plugin-2.1.1 Succeeded Plugin loaded libibdiagnet_phy_diag_plugin-2.1.1 Succeeded Plugin loaded


Discovery -I- Discovering ... 4 nodes (1 Switches & 3 CA-s) discovered. -I- Fabric Discover finished successfully

-I- Discovered 4 nodes (1 Switches & 3 CA-s).

-I- Retrieving ... 4/4 nodes (1/1 Switches & 3/3 CA-s) retrieved. -I- VS Capability GMP finished successfully

-I- Retrieving ... 4/4 nodes (1/1 Switches & 3/3 CA-s) retrieved. -I- VS Capability SMP finished successfully

-I- Retrieving ... 4/4 nodes (1/1 Switches & 3/3 CA-s) retrieved. -I- VS ExtendedPortInfo finished successfully

-I- Retrieving ... 4/4 nodes (1/1 Switches & 3/3 CA-s) retrieved. -I- Port Info Extended finished successfully

-I- Retrieving ... 4/4 nodes (1/1 Switches & 3/3 CA-s) retrieved. -I- Switch Info retrieving finished successfully

-I- Duplicated GUIDs detection finished successfully

-I- Duplicated Node Description detection finished successfully


Lids Check -I- Lids Check finished successfully


Links Check -I- Links Check finished successfully


Subnet Manager -I- SM Info retrieving finished successfully

-I- Subnet Manager Check finished successfully


Port Counters -I- Retrieving PMClassPortInfo ... 4/4 nodes (1/1 Switches & 3/3 CA-s) retrieved. -I- Retrieving PMPortSampleControl ... 4/4 nodes (1/1 Switches & 3/3 CA-s) retrieved. -I- Retrieving ... 4/4 nodes (1/1 Switches & 3/3 CA-s) retrieved. -I- Ports counters retrieving finished successfully

-I- Going to sleep for 1 seconds until next counters sample -I- Time left to sleep ... 1 seconds.

-I- Retrieving ... 4/4 nodes (1/1 Switches & 3/3 CA-s) retrieved. -I- Ports counters retrieving (second time) finished successfully

-I- Ports counters value Check finished successfully

-I- Ports counters Difference Check (during run) finished successfully


Nodes Information -I- Devid: 4099(0x1003), PSID: MT_1090120019, Latest FW Version:2.42.5000 -I- Devid: 51000(0xc738), PSID: EMC1260110021, Latest FW Version:9.3.8000 -I- FW Check finished successfully


Speed / Width checks -I- Link Speed Check (Compare to supported link speed) -I- Links Speed Check finished successfully

-I- Link Width Check (Compare to supported link width) -I- Links Width Check finished successfully


Alias GUIDs -I- Retrieving ... 4/4 nodes (1/1 Switches & 3/3 CA-s) retrieved. -I- Alias GUIDs retrieving finished successfully

-I- Alias GUIDs finished successfully


Virtualization -I- Retrieving ... 4/4 nodes (1/1 Switches & 3/3 CA-s) retrieved. -I- Virtualization finished successfully

-I- Virtual ports retrieving finished successfully

-I- Virtual ports retrieving finished successfully


Partition Keys -I- Retrieving ... 4/4 nodes (1/1 Switches & 3/3 CA-s) retrieved. -I- Partition Keys retrieving finished successfully

-I- Partition Keys finished successfully


Temperature Sensing -I- Retrieving ... 4/4 nodes (1/1 Switches & 3/3 CA-s) retrieved. -I- Temperature Sensing finished successfully


Routing

-I- EXT switch info retrieving finished successfully

-I- PLFT is enabled on 0 switches. -I- PLFT data retrieving finished successfully

-I- Adaptive Routing is enabled on 0 switches. -I- AR data retrieving finished successfully

-I- Retrieving ... 4/4 nodes (1/1 Switches & 3/3 CA-s) retrieved. -I- Unicast FDBS Info retrieving finished successfully

-I- Retrieving ... 4/4 nodes (1/1 Switches & 3/3 CA-s) retrieved. -I- Multicast FDBS Info retrieving finished successfully

-I- Retrieving ... 4/4 nodes (1/1 Switches & 3/3 CA-s) retrieved. -I- Dump SLVL Table finished successfully

-I- Load SLVL file.

Summary -I- Stage Warnings Errors Comment -I- Discovery 0 0 -I- Lids Check 0 0 -I- Links Check 0 0 -I- Subnet Manager 0 0 -I- Port Counters 0 0 -I- Nodes Information 0 0 -I- Speed / Width checks 0 0 -I- Alias GUIDs 0 0 -I- Virtualization 0 0 -I- Partition Keys 0 0 -I- Temperature Sensing 0 0 -I- Routing 0 0

-I- You can find detailed errors/warnings in: /var/tmp/ibdiagnet2/ibdiagnet2.log

-I- ibdiagnet database file : /var/tmp/ibdiagnet2/ibdiagnet2.db_csv -I- LST file : /var/tmp/ibdiagnet2/ibdiagnet2.lst -I- Network dump file : /var/tmp/ibdiagnet2/ibdiagnet2.net_dump -I- Subnet Manager file : /var/tmp/ibdiagnet2/ibdiagnet2.sm -I- Ports Counters file : /var/tmp/ibdiagnet2/ibdiagnet2.pm -I- Nodes Information file : /var/tmp/ibdiagnet2/ibdiagnet2.nodes_info -I- Alias guids file : /var/tmp/ibdiagnet2/ibdiagnet2.aguid -I- VPorts file : /var/tmp/ibdiagnet2/ibdiagnet2.vports -I- VPorts Pkey file : /var/tmp/ibdiagnet2/ibdiagnet2.vports_pkey -I- Partition keys file : /var/tmp/ibdiagnet2/ibdiagnet2.pkey -I- VL2VL file : /var/tmp/ibdiagnet2/ibdiagnet2.vl2vl -I- PLFT file : /var/tmp/ibdiagnet2/ibdiagnet2.plft -I- AR file : /var/tmp/ibdiagnet2/ibdiagnet2.ar -I- Full AR file : /var/tmp/ibdiagnet2/ibdiagnet2.far -I- Unicast FDBS file : /var/tmp/ibdiagnet2/ibdiagnet2.fdbs -I- Multicast FDBS file : /var/tmp/ibdiagnet2/ibdiagnet2.mcfdbs -I- SLVL Table file : /var/tmp/ibdiagnet2/ibdiagnet2.slvl

ibping seems to be running fine although I am not sure if these are good performance values

ibstat | egrep "Port|Base"

(base) baird@oak-rd0-linux:~$ ibstat | egrep "Port|Base" Port 1: Base lid: 0 Port GUID: 0x0010e00001885689 Port 2: Base lid: 1 Port GUID: 0x0010e0000188568a

server ( oak-rd0-linux ) ibping -S -P 2 -d (I know that Port2 is the active one)

I can then ibping from host1 and host2 with: ibping -P 1 1

Host1 ( oak-rd1-linux ) baird@oak-rd1-linux:~$ sudo ibping -P 1 1 Pong from oak-rd0-linux.(none) (Lid 1): time 0.027 ms Pong from oak-rd0-linux.(none) (Lid 1): time 0.037 ms Pong from oak-rd0-linux.(none) (Lid 1): time 0.042 ms Pong from oak-rd0-linux.(none) (Lid 1): time 0.044 ms Pong from oak-rd0-linux.(none) (Lid 1): time 0.038 ms Pong from oak-rd0-linux.(none) (Lid 1): time 0.029 ms Pong from oak-rd0-linux.(none) (Lid 1): time 0.042 ms Pong from oak-rd0-linux.(none) (Lid 1): time 0.029 ms Pong from oak-rd0-linux.(none) (Lid 1): time 0.038 ms ^C --- oak-rd0-linux.(none) (Lid 1) ibping statistics --- 9 packets transmitted, 9 received, 0% packet loss, time 8028 ms rtt min/avg/max = 0.027/0.036/0.044 ms

Host2 ( oak-rd2-linux ) (base) baird@oak-rd2-linux:~$ sudo ibping -P 1 1 Pong from oak-rd0-linux.(none) (Lid 1): time 0.029 ms Pong from oak-rd0-linux.(none) (Lid 1): time 0.015 ms Pong from oak-rd0-linux.(none) (Lid 1): time 0.041 ms Pong from oak-rd0-linux.(none) (Lid 1): time 0.043 ms Pong from oak-rd0-linux.(none) (Lid 1): time 0.044 ms Pong from oak-rd0-linux.(none) (Lid 1): time 0.037 ms Pong from oak-rd0-linux.(none) (Lid 1): time 0.042 ms Pong from oak-rd0-linux.(none) (Lid 1): time 0.039 ms Pong from oak-rd0-linux.(none) (Lid 1): time 0.038 ms Pong from oak-rd0-linux.(none) (Lid 1): time 0.040 ms ^C --- oak-rd0-linux.(none) (Lid 1) ibping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9055 ms rtt min/avg/max = 0.015/0.036/0.044 ms

seems to be working fine assuming that all is good on the infiniband side, here's my problem:

I can run the tests in the ompi that comes with hpcx

mpicc $HPCX_MPI_TESTS_DIR/examples/hello_c.c -o $HPCX_MPI_TESTS_DIR/examples/hello_c
mpirun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_c

Hello, world, I am 1 of 2, (Open MPI v4.0.3rc4, package: Open MPI root@0e5a40994726 Distribution, ident: 4.0.3rc4, repo rev: v4.0.3rc4-6-g8b4a8cd34c, Unreleased developer copy, 148) Hello, world, I am 0 of 2, (Open MPI v4.0.3rc4, package: Open MPI root@0e5a40994726 Distribution, ident: 4.0.3rc4, repo rev: v4.0.3rc4-6-g8b4a8cd34c, Unreleased developer copy, 148)

However, when I try to run with:

mpirun -x LD_LIBRARY_PATH -np 2 -H oak-rd0-linux,oak-rd1-linux $HPCX_MPI_TESTS_DIR/examples/hello_c

I don't get any feedback, no errors no output, it seems to be hanging.

-Can someone please guide me on how to connect/use my other hosts' CPU? -What are the utils I need to use to debug the issue?

I am a complete newbie in this and I would greatly appreciate any help/suggestion etc. I am ready to provide any additional information, test suggestions out etc. Cheers!

theenemy
  • 121

0 Answers0