LACP with 2 NICs working when either one is down, not when both are up

Question

I'm running into problems with getting a LACP trunk to operate properly on Ubuntu 12.04.2 LTS.

My setup is a single host connected with two 10 Gbe interfaces to two seperate Nexus 5548 switches, with vPC configured to enabled multi-chassis LACP. Nexus config is as per Cisco guidelines, and Ubuntu config as per https://help.ubuntu.com/community/UbuntuBonding

Server is connected to port Ethernet1/7 on each Nexus switch, whose ports are configured identical and placed in Port-channel 15. Port-channel 15 is configured as VPC 15, and VPC output looks good. These are simple access ports, i.e. no 801.1q trunking involved.

Diagram:

    +----------+      +----------+      +----------+      +----------+
    | client 1 |------| nexus 1  |------| nexus 2  |------| client 2 |
    +----------+      +----------+      +----------+      +----------+
                           |                  |
                           |    +--------+    |
                           +----| server |----+
                           eth4 +--------+ eth5

When either link is down, both clients 1 and 2 are able to reach the server. However, when I bring the secondary link up, the client connected to the switch with the newly-enabled link, is unable to reach the server. See the following table for state transitions and results:

   port states (down by means of "shutdown")
     nexus 1 eth1/7        up     up    down   up
     nexus 2 eth1/7       down    up     up    up

   connectivity
    client 1 - server      OK     OK     OK   FAIL
    client 2 - server      OK    FAIL    OK    OK

Now, I belive I've isolated the issue to the Linux side. When in up-up state, each nexus uses the local link to the server to deliver the packets, as verified by looking at the mac address table. What I am able to see on the server is that the packets from each client are being received on the ethX interface (packets from client 1 on eth4, packets from client 2 on eth4) by using tcpdump -i ethX, but when I run tcpdump -i bond0 I can only traffic from either of the host (in accordance with what I stated above).

I observe the same behaviour for ARP and ICMP (IP) traffic; ARP fails from a client when both links are up, works (along with ping) when one is down, ping fails when I enable the link again (packets are still received on eth interface, but not on bond0).

To clarify, I'm setting up multiple servers in this configuration, and all show the same symptoms, so it doesn't appear to be hardware related.

So - figuring out how to fix that is what I'm dealing with; my Googling has not brought me any luck so far.

Any pointers are highly appreciated.

/etc/network/interfaces

    auto eth4
    iface eth4 inet manual
    bond-master bond0

    auto eth5
    iface eth5 inet manual
    bond-master bond0

    auto bond0
    iface bond0 inet static
    address 10.0.11.5
    netmask 255.255.0.0
    gateway 10.0.0.3
    mtu 9216
    dns-nameservers 8.8.8.8 8.8.4.4
    bond-mode 4
    bond-miimon 100
    bond-lacp-rate 1
    #bond-slaves eth4
    bond-slaves eth4 eth5

/proc/net/bonding/bond0

    A little further information:
    Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

    Bonding Mode: IEEE 802.3ad Dynamic link aggregation
    Transmit Hash Policy: layer2 (0)
    MII Status: up
    MII Polling Interval (ms): 100
    Up Delay (ms): 0
    Down Delay (ms): 0

    802.3ad info
    LACP rate: fast
    Min links: 0
    Aggregator selection policy (ad_select): stable
    Active Aggregator Info:
    Aggregator ID: 1
    Number of ports: 1
    Actor Key: 33
    Partner Key: 1
    Partner Mac Address: 00:00:00:00:00:00

    Slave Interface: eth4
    MII Status: up
    Speed: 10000 Mbps
    Duplex: full
    Link Failure Count: 8
    Permanent HW addr: 90:e2:ba:3f:d1:8c
    Aggregator ID: 1
    Slave queue ID: 0

    Slave Interface: eth5
    MII Status: up
    Speed: 10000 Mbps
    Duplex: full
    Link Failure Count: 13
    Permanent HW addr: 90:e2:ba:3f:d1:8d
    Aggregator ID: 2
    Slave queue ID: 0

EDIT: Added config from Nexus

    vpc domain 100
      role priority 4000
      system-priority 4000
      peer-keepalive destination 10.141.10.17 source 10.141.10.12
      peer-gateway
      auto-recovery
    interface port-channel15
      description server5
      switchport access vlan 11
      spanning-tree port type edge
      speed 10000
      vpc 15
    interface Ethernet1/7
      description server5 internal eth4
      no cdp enable
      switchport access vlan 11
      channel-group 15

EDIT: Added results from non-VPC port-channel on nexus1 for same server, before and after IP change (changed IP to influence load-balancing algorithm). This is still using the same settings on the server.

      port states (down by means of "shutdown")
        nexus 1 eth1/7        up     up    down   up
        nexus 1 eth1/14      down    up     up    up <= port moved from nexus 2 eth1/7

   connectivity (sever at 10.0.11.5, hashing uses Eth1/14)
       client 1 - server      OK     OK     OK   FAIL
       client 2 - server      OK     OK     OK   FAIL

The results after changing the IP is as predicted; not-used interface being brought up causes failures.

      connectivity (sever at 10.0.11.15, hashing uses Eth1/7)
       client 1 - server      OK    FAIL    OK    OK
       client 2 - server      OK    FAIL    OK    OK

hookenz · Answer 1 · 2013-10-16T21:16:52.337

The only LACP config I managed to get working in Ubuntu is this:

auto bond0
iface bond0 inet dhcp
  bond-mode 4
  bond-slaves none
  bond-miimon 100
  bond-lacp-rate 1
  bond-updelay 200 
  bond-downdelay 200

auto eth0
iface eth0 inet manual
  bond-master bond0

auto eth1
iface eth1 inet manual
  bond-master bond0

i.e. I don't use bond-slaves but bond-master. I'm not sure what the difference is but I found this config worked for me.

I don't have any issues with LACP under my setup although this is with 1Gbe networking.

In addition, if you're still getting problems, try plugging both cables into the same switch and configuring the ports for LACP. Just to eliminate the possibility of issues with multi chassis LACP.

score 0 · Answer 2 · answered Jun 17 '17 at 13:55

problem is not on the linux side but on nexus side and how it works in vPC configuration.

to configure vPC on nexus first you need to connect two nexus switches and configre that link as "peer-link".

in normal situation when both links from switches to server are in state UP traffic in vlan 11 configured on vPC are dropped on peer-link.

only when one of interfaces which are part of vPC are down - traffic in vlan 11 is allowed on peer-link.

this is how vPC works on nexus switches.

to solve this problem you can run fabric-path and make another connection between switches nexus-1 and nexus-2.

LACP with 2 NICs working when either one is down, not when both are up

2 Answers2