I will be taking various machines down tonight for about fifteen minutes each to install new NIC cards with non-Intel chipsets. From 4.15.0 forward, the Linux kernel has had a bug in the Intel E-1000 drivers that cause the cards to lock-up when hardware offloading is used. Usually these lock-ups are transient resulting in 2-3 second delays in data but occasionally the cards will lock hard and require a drive to the co-lo facility to physically reset the machine.
Because the servers most affected are those carrying heavy traffic, the NFS server providing the home directories in particular, I will be replacing the NIC cards on all the NFS servers. This will affect virtually all of our services but will prevent long down times like we suffered Sunday morning from recurring.
I filed a bug report April of this year on this problem. Canonical has offered me various kernels to try, many of them either did not boot at all or were extremely unstable. At this point I feel it’s more cost effective and less service affecting just to replace the hardware.