Mail, Web, New Server

Web and SSL Mail were temporarily interrupted earlier this evening as I replaced the SSL certificate for *.eskimo.com that was due to expire on July 6th.

Finally got the new server to boot consistently off of RAID.  Main issue was EFI system disk needed to be a physical direct device and not RAID or a logical device of any sort.  This makes sense as it is shared between multiple operating systems or at least can be.   Other than that, also two BIOSes from Asus were bad, on the third try got one that mostly works, I say mostly because it still is not entirely without flaws, like when two boot devices share the same UUID it will only show the first one, but otherwise it’s working.  So will be installing this at the co-lo shortly and then start moving applications to it.

Kernel Upgrade Tonight 11pm-12pm

     I will be performing a kernel upgrade requiring reboot of all of our servers starting at 11PM.  If all goes well we should finish by 11:30PM with boots and midnight by checks to make sure all services properly started, NIS bindings and NFS mounts properly completed, etc.

     This will be to kernel 6.1.33.  Barring the release of some substantial performance gain, I plan to stick with 6.1.x long term kernel release until at least the next LTR release.  To date this has been the best performing long term release kernel we have experienced.

     This one will be compiled somewhat differently, a kernel upgrade failed on our newest server before I could put it online because the nvram mod did not load, so now I am compiling the /dev/nvram support into the kernel rather than separately as a module to avoid the potential for a future recurrence.

     This will affect both our paid services such as virtual private servers, web hosting, e-mail, and linux shell accounts, as well as our free services, https://friendica.eskimo.com/, https://hubzilla.eskimo.com/, https://nextcloud.eskimo.com/, and https://yacy.eskimo.com/.

     Most services should be down for less than ten minutes save for yacy which takes about 40 minutes to rebuild an in memory database after reboot.

New Server

I’ve got the new server physically together.  It posts, boots, and everything tests out good.  Now I’ve got to load an OS in order to find optimal settings for performance.

Someone suggested a good name a while back and I’ve misplaced it.  If you know who you are and could resend, please do.

Microsoft Issue Revealed an Issue with Our Mail Servers

     Microsoft’s outlook mail service, servers, starting with *.protection.outlook.com apparently added some servers without adding an SPF record for them.

     This revealed a problem with our spf client.  It is supposed to not reject but only flag the incoming mail and then it should go to your spam box.  Unfortunately even though it was configured not to it is rejecting mail.

     I have temporarily disabled SPF checking to allow outlook mail used by many large corporations to get through.

     I have contacted Microsoft’s tech contacts to make them aware of their missing SPF record for some of their new servers.

     And investigating why our SPF client is rejecting mail when it is not configured to do so, it appears the format of the configuration files has changed substantially since we installed it and it did not update the configuration file when updates updated the software.

     The new software has added some additional capabilities not present in the old hence the need for new configuration.

     I am uninstalling all the spf related software and re-installing to correct these issues.  In the meantime we will still advertise SPF records for outgoing mail but incoming mail is not being checked so be very careful if you receive any mails asking  you for authentication as they are more likely than not forged.  Do not give banking info or login credentials to ANY e-mail asking, do not follow web links in e-mail.

     I will send additional notice when new software is operational.

 

Today’s May Outage

     Today’s May 30th between 1:45 PM Pacific Daylight Time and 4:10 PM Pacific Daylight Time, outage was caused by a circuit breaker failure at the co-location facility where our equipment is co-located.  Our equipment did not lose power but their core routers did.

     This affected all of our paid and free services, e-mail, web hosting, Linux shell accounts, virtual private servers, and our free services, https://nextcloud.eskimo.com/, https://friendica.eskimo.com/, https://hubzilla.eskimo.com/, and https://yacy.eskimo.com/.

 

Reboots Completed

     Reboots completed, all servers and services are up except manjaro and it’s still being fixed since being reinstalled, some things putting up a good fight.

Kernel Upgrades and Other Happenings

     Saturday evening starting at 11PM we will be performing a kernel upgrade of all of our servers to version 6.1.30.  It has some significant fixes for bugs that, while they haven’t bitten us yet, could.

     I expect reboots to be completed by 11:30PM, various services that don’t restart properly and NFS and NIS issues resolved by midnight provided everything works.

     I do not expect downtime for any individual service, except for https://yacy.eskimo.com/, to exceed ten minutes but yacy will take 30-45 minutes to come back online owing to it’s keeping an index in memory that it needs to regenerate after each reboot.

     This will affect all of Eskimo’s paid and free services including e-mail, Linux shells, Web hosting, virtual private servers, and free services such as https://friendica.eskimo.com, https://hubzilla.eskimo.com, https://nextcloud.eskimo.com/, and https://yacy.eskimo.com/.

     Other positive news, I’ve got all the hardware for our new bigger server now.  I am beginning assembly tonight.  This will take some time to bring into fully operational mode as the thermal budget is rather tight and getting as much performance out of the i9-10900x as possible will take a lot of benchmarking and adjusting.  Because this is used in a co-location facility, I do not wish to go with water cooling and the normal dissipation for this CPU is 160 watts and can double that with extreme overclocking.

     Because this CPU is likely to be thermally limited before it is electrically limited, my plan is start with stock everything and increase the clock until it hits thermal limits under heavy load, then reduce the voltage and try to find the point where thermal limits and electrical stability are limiting at approximately the same point so that I’ve got as much performance out of the chip as possible.

     This chip is a very hot chip but it’s the only chip capable of addressing more than 128GB of RAM in the Intel lineup except Xeon chips, and I don’t like Xeon because the memory controllers tend to be on the slow side so you can not get as much performance as the clock speed would indicate.  I don’t like AMD chips because they tend to suffer worse CPU rot and also their thermal protection generally consists of exploding holes in the die.  I’ve had some Intel chips arrive dead, but I’ve never had any fail in service, but my experience with AMD has been less pleasant which is unfortunate as they do tend to make more clock cycles / watt of heat than Intel, but the thermal protection is just inadequate.

     This new machine eventually will replace Iglulik as the main web server, as well as holding home directories, the large amount of RAM will allow it to cache more of the files as well as allowing yacy to run more smoothly.  I plan on running the web server on bare metal to get as much performance as I possibly can.  Iglulik will then primarily serve to host virtual private servers and some file systems like /misc.  Between having four memory channels and 48 PCIe lanes, this will have horrendous I/O capabilities which should lend itself well to this application. The OS and web server software will sit on a couple of Western Digital nvme SSD’s in a RAID0 configuration and the user files and other non-speed critical system files and also a swap partition will go on a couple of 14TB 7200 RPM rotary drives.  Though the write speed of these high density drives isn’t great, with 256GB of RAM there will be plenty of RAM to buffer writes so it will not negatively impact overall system performance.

Spam Filtering Change

     The majority of spam filters here put spam in a folder named “spam” rather than rejecting it outright.

     However, there are two types of spam that I manually block when discovered, virii and phishing scams.  Virii are various computer viruses, especially ransomware.  When I find a server is infected, I block mail from that server until there is some indication this has been fixed. The same is true of phishing scams, where people try to social engineer to get your authentication information here or elsewhere.

     There are a few really bad players in this area, an outfit called Sendgrid is the absolute worst.  I have had more than 30 of their servers blocked for ongoing malicious content and I’ve never gotten a response from them beyond a form letter and I’ve never seen the abuse actually stop.  Unfortunately they are also used by major corporations to contact their customers.  Therefore, I try to be very selective about servers blocked and limit only to clearly infected servers, but, occasionally I get overly broad.  And these actions are manual which also make them less effective than they could be because often the scammer or spammer has already dumped his entire list when I notice and take action.

     Yesterday I made a significant change in the way this is handled.  I am no longer blocking servers and address space manually.  Rather, I have created a fail2ban jail that recognizes many of these things, also things like a lot of mails sent to non-existent addresses, mail forged as being from eskimo.com but is coming from external sources, etc, and I’m now using it to block these sites.

     After the first night of this being implemented, my spambox had about one third as much spam as it did previously.  I believe this is because it’s acting much faster than I would do manually, but an additional plus, there will be less legitimate mail blocked because this is ALWAYS done on a per server basis never entire address blocks as I often did for some bad players and because these blocks are automatically removed after two days but if the abuse is repeated from the same server then it will be blocked on a longer basis.