Tonight’s outage was not the result of a hardware or software error, rather the result of an operator error. I had built a new kernel and had intended to try it on my workstation before deployment but I also had a window open on the main file server because that is where I store and distribute kernels from and also where I have the configuration files. I went to reboot my workstation but was in the wrong terminal and rebooted the server instead. And because I hadn’t shut the virtual machines on it down properly, it did not come up cleanly, in particular the kernel NFS server was snarled and restarting it did not correct, so a second reboot was necessary.
We will be performing a kernel upgrade to 6.1.9 this Friday, not because there are any obvious issues for 6.1.7, operator errors aside, it has been very stable, but because I made an error and misconfigured it. I’ve corrected this on the web server which is most sensitive to this but really need to fix it on all machines. And since 6.1.9 does have some minor fixes might as well get that in place.
I am most looking forward to the release of 6.2, because it has some fixes that largely recover the performance lost to the various security work-arounds for the Intel Skylake chips and two of our physical servers are based upon this architecture.