Last night I spent several hours fine tuning our newest machine, Inuvik, I was not able to get any faster CPU speeds, in fact I am running at 4.8Ghz now because I did find with some tests, notably long fft tests, it did show some instabilities even though in a weeks operation these have not manifested.
But I was able to significantly improve memory performance from 2133Mhz to 3400 Mhz and on an 18 core CPU every memory cycle you can get is precious.
I was also able increase the Mesh frequency slightly from 2.6Ghz to 2.8Ghz. The mesh is similar to rings in lower core count CPUs, it is used to provide communications between the cores. I don’t know how much traffic this actually is in a Linux environment so I don’t know to what degree this helps but every cycle one can get anywhere.
Ok, after that I turned to kernel upgrades, and the last machine to upgrade was the one used for our router. It had a drive that had previously shown some SMART errors but after running a diagnostic they went away and it behaved until last night. Last night it absolutely would not boot. Strangest damned thing, could read the drive, write the drive, but could not boot off of it. I’ve never seen a drive failure cause this so I assumed a software issue and re-installed grub, re-installed kernels, re-built the initramfs system, and these are pretty much all the software components you should need to be able to boot but no go. It would find and load grub but grub couldn’t find the kernel, very very odd. I was so convinced this had to be software issue that I had to re-install 25 times to convince myself otherwise. I finally stole one of the drives out of the RAID array and turned it into a new system disk, that worked. But it does not have everything it needs and it’s not a healthy young pup itself.
So for now I’ve moved the routing and all the virtual machines off this box. At present the two services are down are the NIS master which means you can’t change your password or login shell at the moment, and a DNS server, but we have six so that isn’t going to seriously impair things.
I have a new drive which I had purchased when the first drive started puking out SMART errors, and it also is a 7200 RPM drive with 4x the cache the old drive had. At present I’m copying all the data off the failed drive to the drive I stole out of the RAID array to bring the machine back up, and then I’m go replace that old drive that has failed with the new one, recover any data I need for NIS and for the name server off that drive, then reformat and return it to the RAID array.
I am working on a new video conferencing feature to add to our site shortly. I tried to get another suite working but it depended upon a message protocol that we are not able to get to work. This one uses infrastructure I am more familiar with so my chances are somewhat better.
I am also looking at RustDesk as a possible replacement for Guacamole because the developers have really turned Guacamole into an unmaintainable disaster. It really is oriented towards LDAP auth and our network isn’t, so that doesn’t work so well for us.