This outage, I learned three resources were particularly important to have available on more than one machine:
- DNS – Without this mail will returned no such address.
- SSL Certificates – Without this no encrypted services, mail, web, databases, can be started.
- Mail spool – Without no mail services.
- Home directories – Without no mail folders other than INBOX, no customer websites, and no shell services.
All of these are single points of failure that bring important services down entirely. Here are some general priorities I try to achieve.
- No loss of customer data, e-mail or home directory contents. I have been very successful at this objective.
- No loss of incoming mail. Not 100% on this mostly because of network outages making DNS unavailable.
- The uninterrupted ability to send and receive mail, more severe issues in this area.
- Uninterrupted availability of your websites.
- The availability of some shell servers. I do not strive to have all of them up all of the time because there is enough overlap that services available on any one server are also available on others. Because of the direct access to the OS on these machines, security is a more difficult challenge than on other services so I do require more time to address issues on these.
- Ancillary services such as Nextcloud, Friendica, Hubzilla, Mastodon, and Yacy.
Here are plans I have to address these issues at this point. They are not fully developed which is part of the reason I am sharing this now, to get your input during the process.
My workstation at home is an 8-core i7-9700k with 128GB of RAM and about 20TB of disk. It is on 24×7 but occasionally I boot windows to play games, so not available 100% of the time but obviously I am not going to be playing games during an outage. Nothing I do really requires 128GB on this machine, I just happen to have an opportunity to acquire 128GB of RAM for about $36 so did. But now since I’m also running Linux on it, and I have five static IPs with Comcast, my idea to address not ever disappearing from the net or loosing incoming mail is to setup a virtual machine on my box here and on it install bind, postfix, and set it up as a name server that is outside of the co-lo facility so if our network connection goes down we will still have a working name server and then postfix will be setup as a store and forward server, that is it will be a lower priority MX server that if the first two are unreachable mail will come to it, store, and then when the primary servers come back online it will forward to them. This would address the second issue, not ever losing incoming mail.
The third issue is more difficult to address because the mail spool is a single point of failure. I could use rsync to maintain a near time duplicate, but the issue is if we switch to that during an outage of the primary server and then rsync then stop the primary incomings and let mail go to the store and forward server while we rsync any changed mail spools back to the original spool directory, any mail that came in between the last rsync from the spool to the secondary spool would be lost. I have to do some experimentation to determine how often rsync can reasonably be run and how minimal that time span can be made.
I can do something similar with home directories, this is less problematic than mail spool because the mail spool contains all INBOX mails for a given user in one file, but most home directory files are not subject to as rapid change and only those people who use procmail to sort into folders will risk any loss in this case, and we can rsync any files with a more current update when primary storage goes back online. If we can duplicate home directories then duplicating the web server is pretty trivial, in fact when we get the big machine stable we will have two web servers operational under normal circumstances.
So while not totally thought out I’m letting you know how I plan to address these issues but open to input. Particularly if there is some risk of losing mail between the time of last rsync and the primary system going down, is that risk worth having the ability to have access to mail during an outage of the primary server?
Now in the more immediate future, the motherboard arrived for the ice, I don’t know for 100% sure if it is the motherboard or power supply, I replaced the supply with one I had on hand but still had the same problem but I’m not 100% sure that supply isn’t also dodgy as it is from the same vendor and I do not remember it’s history. At any rate, I’m going to try to replace the motherboard tonight and if the machine works, I will return it to the co-lo facility Friday evening and take down Inuvik which has friendica, hubzilla, mastodon, yacy, and roundcube on it and take it back home to replace the power supply, and probably return it Saturday depending upon time frame. Power on that machine is kind of a nightmare but I should be able to replace it in one night.
Lastly, I am preparing kernel 6.11.1 for installation, 6.11 fixes a couple of issues. 6.10.x had an issue with some of our CPUs when it came to changing clock speeds in response to loads. It detects an error when writing the MSR register, this is a register in the CPU that controls, among other things, the clock multiplier. It actually succeeds and so it does change clock speeds appropriately but it doesn’t know it succeeds and so generates kernel splats. This is fixed in 6.11.x. I will apply this when I am at the co-lo so there will be a brief (around 2-3 minute) interruption in every service.