Client Mail Server Under Attack

     The client mail server is still under attack.  The attack consists of a botnet that is trying to elicit login and password for users by brute force methods using postfix auth.  So far a little over 3,000 IP addresses are blocked but fail2ban can’t work fast enough to get them all in real time.

Reboots and DDoS

     Iglulik spontaneously rebooted again this morning, still no idea what is causing this.

     Mail was under a DDoS attack as of around 9:30 this morning, still ongoing at 10AM, fail2ban is essentially saturated locking out attacking IPs as fast as it can.  Load is heavy but server is still functional but a bit slower than normal.

NFS on Linux

     I’ve had problems with NFS on Linux basically forever, and it became significantly better AND worse under NFSv4.  It got better in that locking mostly works under version 4.2, so things like alpine work correctly where the mail spool is NFS mounted.  It got worse in that sometimes mounts fail to mount, especially if a server goes away and returns.

     I noticed recently that this is a much larger problem with servers that have a lot of entries in /etc/exports.  This lead me to investigate whether there might be a limit on the file size.  Well, I did not find a limit but I did find that the sanctioned method of exporting the same file system to multiple hosts was to put all the hosts on the same line separated by a space and each host followed up by options, where I had had each host on a separate line.

     So I changed the format of /etc/exports from the old form which is how SunOS expects it to be formatted one, to the form officially sanctioned by Linux and will see if that helps.

2:30 Crash

     Around 2:30pm everything spontaneously rebooted.  Just prior to that our mail server had been under a DDOS attack by a botnet, fail2ban had been working hard to lock out attacking IPs, still the load on the machine was 800, and while I was trying to chase that everything crashed.

     When things came back up, NFS did not work right, mail had quota on the /root partition AND had mounted it read-only.

     I’ve got the physical hosts, web server, mail infrastructure, and ubuntu shell server up and running, I am working on checking the rest of the servers for proper NIS/NFS connectivity.

Iglulik Status

     Sorry the downtime was longer than anticipated.  I ran many stress tests and was not able to get the machine to error, freeze, overheat, or otherwise act up.

     First, I upgraded the BIOS because experience has taught me that newer Asus BIOS software is generally more stable than old Asus software.  On my workstation I had to set the CPU core at 1.39v to be stable at 5Ghz with the older BIOS, the newest allowed me to reduce that to .95v which is easier on the CPU.

     The BIOS has a function where you can save the configuration to a thumb drive and generally you want to do this before an update because it erases all the settings.  It saved just fine but the new BIOS would not read the file created by the old so I had to reconfigure everything by hand.

     I did find some less than optimal settings, for example I had decode above 4G disabled, the problem with this is that it forced Linux to use bounce buffers rather than the hardware DMA’ing directly into the location where the data is required or from it, thus making I/O less efficient, so I fixed that.

     I also increased the CPU core from 1.29v to 1.35v which makes the CPU run slightly hotter and decreased the clock from 4.3 Ghz to 4.2 Ghz so that if it was on the edge of stability it should be better.  However, I ran many stress tests and was unable to get it to fail before I made the changes.

     I ran additional stress tests after completing these changes to make sure temperatures were still within an acceptable range and they were well within safe limits even with the higher CPU core voltage.  So at this point I am just going to watch it and see if it is still a problem.

Iglulik Web Outage

     I have reverted Iglulik to an older known stable kernel and still it spontaneously booted last night (and did not start the web server upon recovery) so now I know there is a hardware problem.

     Tonight shortly after midnight I will be taking this machine down for a while to run some diagnostics to try to identify the hardware problem since nothing is showing up in the logs.  Most likely CPU or memory error, most other things would have been logged.

     It’s been a while since I last did a BIOS update and the last Asus BIOS update I did on my workstation in April greatly improved stability so I will check for a BIOS update while I’m at it.

     Because this server has the /home directories, ALL shell servers and the web server will be out of service and pop/imap will ONLY be able to access your INBOX and no others during this maintenance.

Iglulik Instability

      I believe I have located the major source of instability but unfortunately at a sacrifice to performance.

     I have an Nvidia 210 video card in this machine for the console.  It’s a very low end card but adequate for that purpose, however, in 2019, Nvidia discontinued driver support so I had to switch to using the Linux nouveau driver which given the relatively low performance of the card was not a big deal.

     Well recent Linux kernels have a bug in the driver for this card which results in the card DMA’ing into memory that it has not allocated, and when that memory happens to be used by something else, crash.

     But as it happens Nvidia has again decided to support that card however the drivers, now 340.108, are not compatible with newer kernels so I was forced to go back to 5.4.0 which is considerably less efficient than 5.7.

Iglulik Still Unstable

     5.7.7 kernel was still unstable, so was 5.8rc3, but at least with the latter it logged some information that showed some memory allocations failed with the contiguous memory allocater, a new feature recently introduced into the Linux kernel.

     I am building a new kernel with that disabled, it really isn’t required since there are no huge streaming I/O devices like video that might require it and most everything can DMA through the MMU on this particular machine (which can map disparate memory regions into contiguous memory).  If it does not spontaneously boot into the new kernel, I will boot it this evening.

     There is also the possibility of hardware errors but so far it has not logged any.

 

 

Iglulik Spontaneous Boot

     Iglulik spontaneously rebooted again tonight, this time on 5.7.7 it made it four days between spontaneous boots but this time I discovered what triggered it so I’ve got a bug report files with bugzilla.kernel.org and I’m going to give 5.8pre4 a try if it proves semi-stable on my workstation.  I normally avoid pre-release kernels but 5.7 has been buggy and so far 5.8pre3 has been totally stable on my workstation.

 

Iglulik Reboot Tonight

     One of our servers has been unstable on 5.7.6 and rebooted spontaneously twice in the last few days.  Oddly, only this server seems to be impacted but it is a newer CPU than the others so it may be a kernel problem specific to this CPU.

     I am going to reboot into 5.7.7 tonight IF it hasn’t spontaneously booted into it on it’s own between now and then.  This will happen just after midnight.

     This machine services the web, /home directories, and several shell servers.  Because basically everything relies on /home, everything will be briefly interrupted shortly after midnight except virtual private servers which will not be affected.

     If you are not on Mint, Debian, or Ubuntu, you should just see things lock up briefly, if you are on one of these servers you will be disconnected and will need to re-establish your connection after the boot completes.