Kernel Upgrades Friday Oct 7 11:00-11:30PM PDT (GMT-0700)

     We will be upgrading the remaining systems including the physical hosts to the 6.0.0 kernel release.  We’ve got it presently installed on about a dozen servers and so far only one expedited RCU CPU Stall and that occurred during boot-up on the physical server with the most guest machines.  This can happen even with a good kernel because systemd’s parallelization results in large numbers of processes in run state simultaneously and this in turn can cause the RCU process not to get CPU in the 30ms allocated, but in the week we got only that one and no others where previous 6.0-rc5 and rc6 kernels generated 20-30 in the first five minutes and were basically unusable.

     No single service should be out for more than about ten minutes EXCEPT for yacy which rebuilds it’s indexes every time it restarts and this process takes about half an hour.

     The 6.0 kernel re-wrote the scheduler in a way that benefits multi-core systems.  Although the changes were largely aimed at AMD’s Ryzen CPU’s, we say substantial improvements on our Intel based servers as well, particularly in the area of scheduling latency.  It reduced the time it took our web server to load a WordPress page from about 280ms to 40ms, a seven fold improvement and the largest improvement I’ve seen from any Linux kernel upgrade since kernel .98.

     This affects all Eskimo North services including shared web hosting, private virtual servers, shell servers, e-mail, and our free federated services https://friendica.eskimo.com/, https://hubzilla.eskimo.com/, https://nextcloud.eskimo.com/, and https://yacy.eskimo.com/.  If you wish to support our free services, please consider signing up for a Linux shell or web hosting account at https://www.eskimo.com/services/free-trial/.

Sad Sad

     Just found out a long time customer, Dave Bruels who had Interlake China Tours, passed yesterday.  I’ve spent many lunches with him and enjoyed his photos from China.  Great man and will miss him.  His wife had passed not long before and whatever the medical reason, I’m sure he passed of a broken heart.  She was a wonderful woman who fought cancer with a courage you rarely see in a human being, hiking and doing the things she loved right up to the end.

Zorin Borked

     Zorin will be down for a few days, well partially. It won’t be fully operational again for several days. Because Zorin 15 is based upon Ubuntu 16.04, there is NO support for bpfilters, and the currently Linux kernel has deprecated the old iptables filtering scheme so fail2ban and other firewall features do not work.

     Because even though Zorin is a complete rip-off of Ubuntu, if one performs a do-release-upgrade it only updates parts and you end up with a Mix of Ubuntu 16.04 and 20.04 that does not play well together.

     Thus the ONLY way to upgrade Zorin to Zorin 16, which is STILL based on a 2-1/2 year old Ubuntu, 20.04, is a fresh install which means re-installing all the applications and re-configuring everything which is several days work.  I paid for the “Pro” version of the previous release under the promise by Zorin developers that they were working on an in-place upgrade but a year and a half later it is still vaporware so I am only installing Zorin core this time around.

6.0 Kernel Working Well

     The 6.0 kernel is working well, so far there has been only one expedited RCU CPU stall and that was on a virtual host during boot-up on a machine that has quite a few guests.  These can occur at this time just because ALL of the guests are busy with start-up and it simply exceeds available CPU cycles for a short time during boot-up.  If it continues to run this clean until this Friday then we’ll upgrade the physical servers to this kernel on that date at 11PM.

Linux Official 6.0 Release

     At this point I have Linux kernel 6.0 installed on all but private virtual machines and the physical hosts.  So far it’s running MUCH better than rc5 or rc6,  but not perfect.  I’ve gotten ONE expedited RCU CPU stall in a days time on a dozen or so machines.  With 6.0rc5-6 I would see dozens in the same time frame.

     The scheduler has been substantially re-worked in the 6.0 kernel, most of these changes were aimed at AMD CPUs but it made for substantial performance improvements on Intel as well.  One stall in one day across a dozen machines means one process hung for up to two minutes in that time frame, not a HUGE concern but enough that I don’t want to put it on the physical hosts yet.  The scheduling changes cut the latency for the web server down by approximately 6 fold, that ain’t chicken feed!

     I’ve posted the most recent expedited RCU CPU stall with the bug report I had filed with bugzilla.kernel.org, for anyone interested it is bug #216501 and you can read the details here: https://bugzilla.kernel.org/show_bug.cgi?id=216501.

Kernel Progress?

     I received a note from bugzilla.kernel.org that another user found 6.0.0rc7 no longer had the expedited RCU CPU stalls issue, but before I could grab and try it, the official release of 6.0 came out, so I’m going to build it and give it a try on a few select servers.  If it runs clean during the week I’ll propagate it to the other machines this coming weekend.

Service Affecting Kernel Upgrades Completed

     Kernel upgrades are completed on the physical hosts and critical machines.  They are not done on vps1-vps7, but those machines are all single core and single core machines are not experiencing the CPU stall bugs in 5.16-6.0 kernels.

     Some of the other shell servers also are not updated yet but these are sufficiently non-busy that I can easily hit them when nobody is logged in, and for many I have to build kernels yet before I can update.  5.15.71 booted cleanly on all the machines that it is presently installed on.

Kernel Upgrades 11pm Oct 2nd PST (GMT-0700)

     Things are running smoothly but load is high with stock 5.15.0 kernel from Ubuntu.  I’ve configured the latest 5.15 kernel (5.15.71) and installed on a number of machines and it is also running well with no forced preemption, 100HZ clock, and fully tickless kernel.  This reduces overhead somewhat and so I am going to install on the physical servers tonight at 11PM which will require rebooting everything.  This will result in some downtime between 11PM-11:30PM, not more than about 10 minutes for any given service.

     I will be installing this kernel on most of the other machines during the week but I need to get it on the physical hosts tonight.

     This will affect all Eskimo North services, including private virtual servers, shell servers, shared web hosting packages such as virtual domains, personal and business web hosting packages, and e-mail.

     It will also affect our fediverse services https://friendica.eskimo.com/ (a fediverse social media site), https://hubzilla.eskimo.com/ (another fediverse social media site), https://nextcloud.eskimo.com/ (a federated cloud service), and https://yacy.eskimo.com/ (a federated non-censored search engine).  

5.19.12 Broken

     Even though 5.19.11 ran okay on four busy servers for a week, 5.19.12 is NOT running well, and 5.19.11 is no longer available so I’m going to go back to a stock 5.15 Ubuntu kernel on most of the machines.  Tonight there will be additional reboots between 11pm-11:30pm to this end.

     It seems to me that some fundamental change was made to the Linux kernel in 5.16 and forward that greatly improved scheduling and context switching efficiency but introduced serious stability problems that are not yet addressed.  So at this point I’m going back to a stock 5.15 kernel and when the official release of 6.0 comes out we’ll experiment with that.  I may also experiment with some custom configurations of 5.15 to see if we can’t improve the scheduling and context switching efficiency of that kernel somewhat.

New Kernel Failed

     The new kernel is not running well on our main NFS server so another reboot of iglulik, which is the server that provides the /home directories, was required to load a different kernel.  Iglulik is now running the stock Ubuntu 22.04 5.15.0 kernel.  Less than idea but 5.19.12 did not run well on it.  We had CPU stalls but it wasn’t the usual expedited RCU CPU stalls, it was a more generic 2 minute CPU stall that periodically broke NFS.

     Also, I had to switch iptables from legacy to nftables (which uses bpf)) because the legacy iptables is no longer supported.  I had to do this for vps4, vps5, and vps7.