Mail

The main NIS server is backup but the mail server is still not operating correctly complaining about “extra groups”, a message I’ve never seen before and not having any luck googling.  I suspect some old auth info is being cached somewhere but damned if I know where so I’ve purged the whole thing and am re-installing from scratch.

 

Authentication Problems

I re-configured the NIS system to have a Linux based NIS master last night.  It was working last night but the main NIS server crashed during the night and although it should still work from the slave servers it is having problems.  So I’m going down to the co-lo facility to restore the crashed server and then if the problem still persists further troubleshoot the issue.  This is affecting mail and some shell servers.

Router Replacement

     The router replacement that was planned for Saturday is probably going to be delayed until Monday.

     In the meantime, the old SunOS machine is being retired and I am setting up a new NIS master on a Linux machine.  Because of differences in the way Linux and SunOS NIS are implemented, SunOS puts the items that are to be distributed network wide in a different set of files, where Linux propagates everything above a certain UID and GID to the network, I am going to have to change some GID’s, some in the 50’s will become higher numbers so that they will be in the NIS system.  This will require finds and chgrp’s on big file systems so there may be some period where you’re GID is ’57’ instead of ‘rmtonly’ which will become ‘shell’ since access is really not part of the plan nor relevant to the platform GID any longer.

     I will potentially resurrect SunOS some time in the future when I can get an emulator working but we will need to find some alternative method of authentication as I am going to start allowing 16 character usernames, longer passwords, and new stronger encryption, as well as yppasswd on the client machines so you can change your password yourself.

Network Work this Saturday

     This will impact ALL of eskimo.com‘s services.

     If all goes well I am going to replace our router this Saturday evening.  It may take a bit to get it to work as the interface is quite different than the old and I’ve got some concerns about configuring the network side of things.  After this happens the shell server, “eskimo.com”, will become “sunos.eskimo.com”, and possibly before, this is because of the new routers lack of support for port forwarding and the need to separate shell services from web services on this IP.

     The main reason for this change is that traffic has grown to where our existing router is challenged and the occasional denial of service attack, which is just a reality of being connected to the Internet, is enough to overload the CPU and cause traffic to be dropped or significantly delayed.

     The existing router has two 500Mhz PowerPC cores, the new unit has 4 1.6 Ghz PowerPC cores or roughly six times as much CPU.  It also has a 1TB hard drive so we can put some more useful software on it than the existing machine.

     I will do my best to minimize the outage time but as I’ve stated, the new interface may take some experimentation.

Work Tonight and Ongoing

This will happen sometime this evening, probably after 10PM but can’t be exact because of other commitments with unknown time frames.

I’m going to be taking the new server, Inuvik, down for about an hour or so to install some adapters for the drives I replaced last week.

The new drives have repurposed pin3 of the power connector, which used to be to provide +3.3 volts (modern drives need only +5 and +12) to now tell the drive to power down.

It would have made more sense to do it the other way, that way drives on old cables that only use that pin to provide +3.3 volts would by default be ON, but ya know, standard committees and all that.

So all these adapters do is sit between power and the drive to open pin 3 so the drives turn on and spin up like normal drives.

This will affect Debian, Manjaro, and some web services.

Downtime

     Is taking me longer than I had hoped because Linux duplicated an entry in the /etc/mdadm/mdadm.conf file (a bug that it does once in a while), and when it does this it won’t boot up all the way.  It gets to where it’s time to assemble the RAID device but since the raw devices are assigned multiple RAID devices (even though they are the same just duplicate entries) it just stops.  So normally I would boot with a rescue disk and fix it but the rescue disk does not have mdadm on it so first I have to install that, however the rescue disk I had was too old, so the repos no longer existed, so had to come back and burn one with the current repositories.  AARRRGGGHHH!

Maintenance Work Tonight

     Just a reminder that I am going to bring a server down to replace two failed hard drives tonight.  This will affect https://friendica.eskimo.com/, https://hubzilla.eskimo.com/, https://eskimo.com/ (but NOT https://www.eskimo.com/), and the shell servers Manjaro.eskimo.com and Debian.eskimo.com.  Estimated downtime is about an hour and probably will start between 10:30 and 11:30 just depending upon when I get there.

Web Restored

     Everything is back in service in terms of the web service here.  For some reason when I saved all mysql databases prior to changing the innodb table options, it did not save mysql (the database with all the permissions and grant tables) properly.  So I had to ressurrect a backup, dump the mysql database from it, and re-import it, but that worked and now everything is back.

Partial Web Outage

     Database is taking longer than I expected to recover.

     The issue was caused by the fact that innodb-file-per-table was not set to true.  This caused all the innodb tables to be stored in the system file.  The problem with this is that space is never recovered and the file grows until the disk is full and then the database crashes.  For this reason, Mariadb has shipped with this set by default after version 10.2 (we are on 10.6).  However, Ubuntu, in their infinite wisdom, ships this with the distro with it NOT set.  I’ve been bit with this before when I installed the last server so should have known to check but it’s been a few years and memory cells are aging.

      To fix this I first needed to copy the entire /var/lib/mysql directory to a larger disk since I can’t even start the database with disk full.  With more than 600GB of data, this took a while.  Then I’ve got to dump all the databases, and there is around 160GB of legitimate data to dump so this also takes a while.  After that I can delete the ibdata1 file and log, then copy all the remaining system files back to the original disk, then with this configuration option correctly set, restore the database from the dump.  There is no shortcut, without losing data, that I am aware of.

Partial Web Outage

     We have a partial web outage now.

     This was caused by a misconfiguration of the mariadb database on the new server that caused it to eat itself.  I am in the process of correcting that configuration issue and restoring the server.  Unfortunately, owing to the size of the database, this will take some time.  Perhaps an hour or so.