20

As you all probably know, ipv4 route cache has been removed in 3.6 Linux kernel series, which had serious impact on multipath routing. IPv4 routing code (unlike IPv6 one) selects next hop in a round-robin fashion, so packets from given source IP to given destination IP don't always go via the same next hop. Before 3.6 the routing cache was correcting that situation, as next hop, once selected, was staying in the cache, and all further packets from the same source to the same destination were going through that next hop. Now next hop is re-selected for each packet, which leads to strange things: with 2 equal cost default routes in the routing table, each pointing to one internet provider, I can't even establish TCP connection, because initial SYN and final ACK go via different routes, and because of NAT on each path they arrive to destination as packets from different sources.

Is there any relatively easy way to restore normal behaviour of multipath routing, so that next hop is selected per-flow rather than per-packet? Are there patches around to make IPv4 next hop selection hash-based, like it is for IPv6? Or how do you all deal with it?

Eugene
  • 511

2 Answers2

9

"Relatively easy" is a difficult term, but you might

  1. set up routing tables for each of your links - one table per link, with a single default gateway
  2. use netfilter to stamp identical marks on all packets of a single stream
  3. use the ip rule table to route the packets via different routing tables depending on the mark
  4. use a multi-nexthop weighted route to balance the first-in-a-session packets over your gateways/links.

There has been a discussion at the netfilter mailing list on this topic where I am stealing the listings from:

1. Routing rules (RPDB and FIB)

ip route add default via <gw_1> lable link1
ip route add <net_gw1> dev <dev_gw1> table link1
ip route add default via <gw_2> table link2
ip route add <net_gw2> dev <dev_gw2> table link2

/sbin/ip route add default  proto static scope global table lb \
 nexthop  via <gw_1> weight 1 \
 nexthop  via <gw_2> weight 1

ip rule add prio 10 table main
ip rule add prio 20 from <net_gw1> table link1
ip rule add prio 21 from <net_gw2> table link2
ip rule add prio 50 fwmark 0x301 table link1
ip rule add prio 51 fwmark 0x302 table link2
ip rule add prio 100 table lb

ip route del default

2. Firewall rules (using ipset to force a "flow" LB mode)

ipset create lb_link1 hash:ip,port,ip timeout 1200
ipset create lb_link2 hash:ip,port,ip timeout 1200

# Set firewall marks and ipset hash
iptables -t mangle -N SETMARK
iptables -t mangle -A SETMARK -o <if_gw1> -j MARK --set-mark 0x301
iptables -t mangle -A SETMARK -m mark --mark 0x301 -m set !
--match-set lb_link1 src,dstport,dst -j SET \
          --add-set lb_link1 src,dstport,dst
iptables -t mangle -A SETMARK -o <if_gw2> -j MARK --set-mark 0x302
iptables -t mangle -A SETMARK -m mark --mark 0x302 -m set !
--match-set lb_link2 src,dstport,dst -j SET \
          --add-set lb_link2 src,dstport,dst

# Reload marks by ipset hash
iptables -t mangle -N GETMARK
iptables -t mangle -A GETMARK -m mark --mark 0x0 -m set --match-set
lb_link1 src,dstport,dst -j MARK --set-mark 0x301
iptables -t mangle -A GETMARK -m mark --mark 0x0 -m set --match-set
lb_link2 src,dstport,dst -j MARK --set-mark 0x302

# Defining and save firewall marks
iptables -t mangle -N CNTRACK
iptables -t mangle -A CNTRACK -o <if_gw1> -m mark --mark 0x0 -j SETMARK
iptables -t mangle -A CNTRACK -o <if_gw2> -m mark --mark 0x0 -j SETMARK
iptables -t mangle -A CNTRACK -m mark ! --mark 0x0 -j CONNMARK --save-mark
iptables -t mangle -A POSTROUTING -j CNTRACK

# Reload all firewall marks
# Use OUTPUT chain for local access (Squid proxy, for example)
iptables -t mangle -A OUTPUT -m mark --mark 0x0 -j CONNMARK --restore-mark
iptables -t mangle -A OUTPUT -m mark --mark 0x0 -j GETMARK
iptables -t mangle -A PREROUTING -m mark --mark 0x0 -j CONNMARK --restore-mark
iptables -t mangle -A PREROUTING -m mark --mark 0x0 -j GETMARK

You might want to follow the netfilter mailing list discussion for some variations of the above.

the-wabbit
  • 41,352
9

If possible, upgrade to Linux Kernel >= 4.4 ....

Hash-based multipath routing has been introduced, which in many ways is better than the pre 3.6 behaviour. It is flow-based, taking a hash of the source and destination IPs (ports are ignored) to keep the path steady for individual connections. One downside is that I believe there were various algorithms/config modes available pre 3.6, but now you get what you're given!. You can use affect the choice of path by weight though.

If you are in my situation then you actually want the 3.6 >= behaviour < 4.4 but it is no longer supported.

If you do upgrade to >= 4.4 then this should do the trick, without all the other commands:

ip route add default  proto static scope global \
nexthop  via <gw_1> weight 1 \
nexthop  via <gw_2> weight 1

Alternatively by device:

ip route add default  proto static scope global \
 nexthop  dev <if_1> weight 1 \
 nexthop  dev <if_2> weight 1
bao7uo
  • 1,714