Server Hang – Kernel Upgrades 1/12/2023

     The server that provides home directories and also shared web services as well as two virtual private servers hung with the CPU stall bug even though we’re about 12 incarnations down the road from where it first appeared.  We know this seems to be a use after free error in the kernel but kexec is not finding it.  The kernel developers have asked me to compile in KASAN which is another memory allocation debugger, the system is now running that kernel and hopefully if it hangs again it will provide some debugging information.

     Between yesterday when I compiled the kernel and today, 6.1.5 has come out, so I am going to try to get kernels ready for another upgrade tomorrow evening and if I make it we will be doing a kernel upgrade at 11pm.  This affects all eskimo north services.

Kernel Upgrade Aborted

     Tonight’s kernel upgrade is aborted because it will not build with the debugging options the developers wanted me to include so I’ve sent the compiler errors back to them and will resume when I have a fix.

Kernel Upgrade Tonight Sunday 11pm

     I am going to be upgrading the kernels on only the physical servers tonight in order to turn on some additional debugging options to help the developers chase down an error in the NFS code that is causing issues for us.  Apparently this bug only occurs when you have a mix of NFSv3 and NFSv4 clients as we do, (also an NFSv2 client).  So it’s an issue that is rarely triggered but our environment triggers it.  It is a use after freed error that for some reason KFENCE is not finding, they have asked me to turn on KASAN, a different somewhat higher overhead memory allocation troubleshooter, and this requires a rebuild of the kernel and rebooting of the physical servers.  Because this only affects the NFS servers, I will be installing this on Iglulik, Igloo, and Mail, but not the other servers at this time.  This will affect vps6, vps9, and all the shell servers and mail.  The interval will be between 11pm-11:30 with individual outages not lasting more than about 10 minutes with the exception of yacy.eskimo.com which takes about half an hour to 45 minutes to rebuild it’s database after a reboot.

     This will affect all Eskimo North services EXCEPT for vps1-vps6, vps7 and vps8.

     It will impact our Fediverse instances including https://friendica.eskimo.com/, https://hubzilla.eskimo.com/, https://nextcloud.eskimo.com/, and https://yacy.eskimo.com/.

Kernel Upgrades 1/6 11PM PST (GMT-0800)

     Planning to upgrade to a 6.1.2 kernel Friday 1/6 at 11pm Pacific Time.  The present kernel, 6.0.15 has a nasty bug where it locks hard, no kernel dump, no auto reboot, no magic sys request key, only power cycling the affected machine restores service.  The inability to get a kernel dump makes this bug particularly difficult to troubleshoot.  Since this bug has persisted from 6.0.12, I’m going to try a 6.1 kernel and hope for better.

     This will result in outages between 11pm-11:30pm of all services lasting about 5-10 minutes each EXCEPT for yacy which takes close to 45 minutes to rebuild it’s database after every reboot.

     This will affect all of Eskimo North’s paid services such as mail, web hosting, virtual private servers, shell accounts, etc, as well as our free services including https://nextcloud.eskimo.com/, https://friendica.eskimo.com/, https://hubzilla.eskimo.com/, and as I mentioned, https://yacy.eskimo.com/.