Maintenance Work 2/8-2/9/17

     Tonight February 8th leading into early Thursday morning February 9th, there will be maintenance outages lasting approximately 15 minutes per service but not all services will be down simultaneously.

     Maintenance work is necessary to address a problem where servers on one physical host which hosts the mail client server keeps invoking the OOM memory killer and killing the mail virtual machine.

     I have not yet been able to identify the culprit that is eating all of the memory.  The machine normally will have around 16GB-18GB of free memory but something will suddenly eat it all up and the OOM killer will kill some random process which is usually the largest process and more often than not that is the mail virtual machine.

     This machine is also the only machine with a 4.4 kernel so it may be a kernel issue.  I did not upgrade this machine to 4.8 along with the others because 4.8 had issues with mandatory locks in it’s nfsv4 code.  I have since learned that this is only a problem on the client side so will not affect the server.

     I have implemented limits in /etc/security/limits.conf which should be adequate for normal operation and at the same time limit memory consumption below what would cause the machine to invoke the OOM killer.

     The systemd scripts are not reliable on machines with RAID disk partitions and sometimes hang instead of rebooting.  They also do not work properly with some CPUs but that is a kernel issue.

     So I will be going to the co-location facility so that I can be there live and in person in case anything goes wrong.  Under the best of circumstances it takes about 15 minutes to boot these machines because it saves the existing virtual machine states before rebooting, boots, the restores those machines to their previous state.  These saves can involve writing some very large files.

     It is possible, if the limits I have set are two low, that more than one reboot may be required to adjust them.

     Then in addition I will be rebooting the server that is the NFS server for home directories and and the host for a few virtual machines in order to load a kernel that addresses some security issues.