At some point libvirtd on igloo, the machine which hosts mail and a number of shell servers, failed. Libvirtd is the server side virtualization management daemon, it is responsible for starting, stopping, arranging networking, storage, and system resources for kvm/qemu guests (also for xen but we aren’t using xen here).
This affected a number of machines including mail and because every server NFS mounts the mail spool from mail, it affected them indirectly.
The message that Igloo gave in syslog relating to libvirt was:
libvirtd[2271]: internal error: wrong nlmsg len
The “nlmsg” refers to Netlink, so it would appear something went wrong in networking and libvirtd didn’t know how to handle it and crashed.
I don’t know exactly how long and how deep the outage was since it was kind of a gradual deterioration situation after libvirtd crashed. I was going to add an automatic restart to libvirtd in systemd to prevent this specific failure in the future but found it was already in place but incompletely specified so perhaps systemd choked. I have corrected that.
I received about eight tickets on this issue, and I really appreciate it that the ticket system is being used, but also with outages of this magnitude a phone call would be good because if I’m not actively at the terminal I may not be aware of issues.