There was some major weirdness this morning and afternoon. It started with the mail server not responding to NFS requests. Mail is a virtual machine on the Igloo physical host, so to reboot it I had to login to Igloo however, Ubuntu in their infinite wisdom has some system wide scripts that run when you login and among other things check the mail by looking at the local mail spool which on all of the machines is NFS mounted from mail. At this time mail was still responding to imap, pop3, and smtp so it was still possible to use via Thunderbird, but this broke shortly after I posted about it.
It used to be NFS had a timeout and when a server did not respond, if you were patient you would eventually get past this. But apparently the default is no longer to time out.
So I had to drive down to the co-lo facility to reboot that machine, I apologize that I did not hear the phone ring, I was sleeping heavy and late owing to being sick last night. Not sure what upset my stomach but I upchucked in the middle of the night and my upset stomach made it difficult for me to get to sleep for many hours.
So when I got to the co-lo I could not reboot the physical host even with the three finger salute. It hung on shutting down guests. I had to forcibly reboot it with the magic-sys-request key, alt+delete+printscreen+B to force a boot. Now I had a newer kernel prepare, three issues newer than the one in service and I knew there were some memory leaks among other things fixed, so thought well might as well install the new kernel on the physical hosts and mail while I am here.
This went ok on Igloo and Iglulik, but when I went to reboot on Ice, it would not come up. Strangely ice had swapped it’s drive letters between sda and sdb, sda had become sdb and vice versa. I had not moved the drives. A while back I had changed the UUID’s to drive numbers because the blkid program at the time was unreliable leading to occasional failed reboots. Now the machine was randomly swapping drive letter, so I put it back to UUID so it doesn’t care about the drive letters. If blkid becomes a problem again I’ll probably switch to labels which honestly makes more sense anyway.
I will be rebooting some of the shell servers and other non-physical hosts tonight to upgrade the kernels on them. I’m also going to try to find the script that is checking for mail and eliminate it on the physical hosts so I can reliably get into them if mail goes down again.