On an in house designed board, based on the Elan SC520 CDP board from AMD, using an Elan SC520 cpu (x86 based microcontroller) and the 79C973 ethernetcontroller, both from AMD, we see the following problems:
During heavy network load the ethern communication stops and the board is inaccessible from the network. After about a minute the board responds again and the system log showns the next message:
Mar 10 15:29:13 elmo kernel: NETDEV WATCHDOG: eth0: transmit timed out
Mar 10 15:29:13 elmo kernel: eth0: transmit timed out, status 06f3, resetting.
This problem only occures with kernel version 2.4.27 and higher (both 2.4.28 and 2.4.29 were tested). Version 2.4.26 works without any problems.
After zooming in on the driver (pcnet32.c) of kernel version 2.4.29 I isolated the problem to the function pcnet32_watchdog. After commenting out the call to mii_check_media in that function the communication errors disappeared and the board was responding normally, except for not detecting changes in the link status (seems logical when I commented out that check).
It looks like the is some sort of cocurrency problem or race condition under heavy load, but I have not yet been able to verify this.
Does anyone has a clue what can be wrong or what I could do to fix it?
Because our board is a dedicated embedded system (no keyboard or display) debugging is more difficult than on a normal PC, but it is not impossible.