18

We have a system which is suffering from comms outages on a gigabit ethernet network. The traffic load on the network is such as to slightly stress a 100Mb network, but there are gigabit switches and NICs and cables throughout - or so I am told by the customer who built the network we are plugging into.

We plugged in a laptop running Wireshark via a 100baseT hub and found that it reported lots of "Ethernet II" packets where the raw data, when displayed as ASCII, basically looks like this:

PUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU

Naturally I immediately named this issue "Network PUU" and many giggles ensued. We're all in our forties or so, but I guess some of us never grow up (guilty!)

Anyway, more seriously, other perfectly valid packets were being corrupted by this data. IPv4 headers were getting bytes replaced with U bytes as well as there being data corruption which would cause the software to reject the data, even if the IP checksums didn't fail to match. We are pretty sure that this data spewing onto the network is causing the comms outages. What we don't know is where it might be coming from.

Has anyone ever seen this happen before? Did you solve it? Did you figure out where it came from?

====EDITED====

Added mention of the hub to the original description since, judging from the comments below, it is the most likely source of the corruption! The tool we used to try and find the network issue appears to have added a new and worse network issue.

AlastairG
  • 348

4 Answers4

18

Anyway, more seriously, other perfectly valid packets were being corrupted by this data. IPv4 headers were getting bytes replaced with U bytes as well as there being data corruption which would cause the software to reject the data, even if the IP checksums didn't fail to match.

It's surprising that just alternating bits (U is ASCII 0x55 or 01010101b) actually make up valid Ethernet frames or even valid IP packets. If this corruption crawls into mainly intact frames/packets as well, it can only be caused by - most likely - a faulty switch (bad buffer memory) or a faulty host (NIC or RAM).

If frame data is corrupted in transport, on the cable, the FCS extremely likely fails to verify, making the very next switch drop that frame. However, if such a frame is transported through the network with a valid FCS, it must have been corrupted before that FCS was calculated, which mandates a defective switch or host.

You'll need to trace back that traffic. If the source MAC address isn't valid or can't be checked on intermediate (unmanaged) switches you'll need to trace your way back along the cables.

Zac67
  • 13,684
12

Sounds like you have a bad NIC card. If the source MAC address valid, you can find it by checking the switch MAC tables. If it is corrupted, you'll just have to start unplugging devices to find it.

Ron Trunk
  • 2,209
3

That sounds as if you have a device (probably a 100 Mb/s switch) somewhere that can't deal with the traffic-flow and starts corrupting packets when its internal buffers overflow.
(Or it just has a bad RAM).

It doesn't notice it has corrupt packets and will happily be re-transmitting them, with freshly calculated new checksums. So the bad packets are accepted by other switches (checksum is good, switches don't care that the content is non-sense) and forwarded through the entire network.

It is actually worse than that:
Consider how switches learn which device (mac-address) is behind which port. Any packet destined for a mac-address which isn't learned yet by the switch is flooded to all switch-ports (except the one it came in from). This effectively turns a packet for an unlearned mac-address into a temporary broadcast.
Because your switches will never learn these mac-addresses (after all they are corruption, not real mac-addresses) they are ALL treated like broadcasts...
This essentially floods the whole network with un-deliverable packets.
(And note that normal broadcast-storm mitigations don't work in this case. They only act on REAL broadcast packets, not on these learning-floods.)

The only way to troubleshoot this is to disable 1 switch at a time and see if that makes the problem go away. If you can narrow it down to 1 switch it will be that switch itself or a device connected behind that switch.

Tonny
  • 6,360
  • 1
  • 20
  • 31
-1

The difference between a hub and a switch is that when a switch gets a collision, it either throws out the second packet, or it stores it and then forwards it when the first packet finishes; where a hub will merrily allow the collision to happen and just replaces the contents of the packet with 10101... to indicate that it was a collision and continues sending that until both packets have finished.

The solution here is to get rid of hubs, as they are obsolete. They stopped making hubs before 1G was available, so a hub has to be 100M or slower. The 1G network standard does not support hubs.

For a little history, before there were hubs, there were repeaters. The difference between a repeater and a hub is that the repeater receives the analog signal, cleans it up slightly back into a nice square wave, and then retransmits it, where a hub actually looks at what is in the packet a little bit and tries to make sure the packet is well formed. However, neither one of them does anything to fix collisions, they just let them happen. Repeaters and hubs are from back when ethernet was considered to be an unbuffered bus and only one device on the network can speak at a time. When ethernet was a true bus (10base2 and 10base5), to start a packet, you transmit start bits (10101...) until the first bit reaches the furthest ends of the network, and if nobody else has interrupted you in the mean time, then you continue your packet. If you get interrupted, you have a collision and both parties back off and tries again at a random time later. If one party doesn't abort, then you have a late collision. Your hub is turning your late collisions into all start bits. Possibly something in the path is not recognizing the packet as a late collision and rather than dropping it is reforming it as a valid packet. Or your promiscuous packet sniffer sees invalid packets as well as valid ones.

Contrast this with a switch, which not only fixes collisions, but can support full duplex, where a packet is being transmitted while a different one is being received. The 100M standard supports switches that both do and do not support full duplex and this is negotiated between devices when the cable is plugged in. The 1G ethernet standard requires all devices to support full duplex, so a hub is not allowed on a 1G connection, and therefore, 1G hubs do not exist.

user10489
  • 669