I have several computers running RedHat Enterprise 4 update 4 that I use for data acquisition. Almost all services have been disabled, but still something is happening every ten minutes that causes packet losses. (Data acquisition happens on the second ethernet port, which is attached via VLAN to an A/D box.) The odd thing is that it is not consistent - not all of computers have the same problem, and after rebooting, it will be a different subset of the group that drops packets. On the ones that have the problem, it is very consistent though - every ten minutes, to the second.
How do I find out what is causing the loss? Thanks...
Common cause for problem.
This seems to be a really tricky problem, but are they all connected to the same network switch? If so it may be that the switch is having some kind of problems and maybe a firmware upgrade will help. You may also try a different brand of switch just for reference testing.
In all try to figure out what's the common hardware that may be causing this problem.
If possible - try to test the application on a node with a different OS version too, like Fedora 7.
One "feature" that can cause a real confusion is that modern network cards has software configurable MAC addresses. If the OS is installed on one node and the other nodes are clones of that node the MAC address of the master node may have been replicated to the other nodes. I have experienced this "feature" under HP-UX, and the errors resulting from this misconfiguration can be really hard to match to the cause.
Just watch out for contention spots in your network - it may be that packets are dropped due to high load and that you have some part that is close to peak performance and then every 10 minutes a broadcast occurs that pushes it over the edge.
Sigh, it *IS* tricky...
We have removed the switch and cabled directly from the computer to the A/D box. (We're a bit disappointed by the performance of the 3750G Cisco we were using.) We've swapped ports on the A/D box, but the problem stays with the computer, so it really seems to be something in the kernel. All computers have been kickstarted with the same os, same compilers, etc, and all had the same firmware upgrades, so they should all be the same. Our hands are tied on the OS - we can't change it...
Is there some kind of auditing or accounting package that will show me what the kernel is doing when the packets get dropped?
Thanks...
AH-HA!
OK, well, if anyone else runs into this same problem, the answer is... (drum roll, please!)
...IRQBALANCE...
Looks like it was assigning the eth1 interrupt to CPU 0 and for whatever crazy reason, something was happening on CPU 0 every ten minutes that got in the way. When we turn that off and make sure the interrupt is assigned to CPU 1, the problem goes away...
Hmm
Even if the problem is avoided by changing CPU affinity, it would be interesting to know what runs every 10 minutes to cause the problem. Any ideas?
Which NIC and Driver?
I'm seeing precisely the same behavior you describe.
In my case it's a ~50Mb/s multicast stream. Every 10 minutes (almost to the millisecond) the kernel doesn't process any incoming frames for 40ms.
The frames queue somewhere (on the NIC, I guess), then stream (burst!) in very quickly after the 40ms stoppage.
If the data rate was high enough, the frames at the tail end of the 40ms outage get lost.
I know the frames were sent, and sent on time, because I've deployed an Ethernet tap between the server and the switch. Sniffer on the tap sees the data, tcpdump on the server does not.
Many servers are displaying this behavior, but they're on different 10 minute schedules: one server misses at 1,11,21,31 minues after the hour, another may be 7,17,27... and so on.
All of those boxes are running Broadcom 5706 chipset with the bnx2 driver. Other systems with tg3 are running fine.
So Doug... What chipset and driver?
/cmm
The machines are HP ProLiant
The machines are HP ProLiant DL380 G4's, the chipset is:
PCI-ID 14e4:1648 Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10)
Running RHEL 4.4, kernel-2.6.9-42.EL, tg3 verison 3.52-rh
FYI: Redhat has identified this issue.
No permanent fix yet but a couple of workarounds. For details, ask your Redhat technical contacts.
Here is the answer
http://kbase.redhat.com/faq/FAQ_75_11786.shtm
And related answers
http://kbase.redhat.com/faq/FAQ_75_11787.shtm
http://kbase.redhat.com/faq/FAQ_75_11785.shtm
any workaround for this HP
any workaround for this HP DL series (tg3) issues.
my workaround was...
my workaround on proliant was to buy a dual gigabit network card from Intel...
any workaround on this NIC
any workaround on this NIC issue
Maybe these posts to LKML
Maybe these posts to LKML could help:
"Strange delays / what usually happens every 10 min?"
http://lkml.org/lkml/2007/11/13/138
See answer from Eric Dumazet, "Check /proc/sys/net/ipv4/route/secret_interval"
http://lkml.org/lkml/2007/11/13/183