Inuvik Too Hot

    Inuvik is running too hot.  This machine was running at 4.8Ghz small fft torture test 36 threads 2/threads per core before I brought it over to the co-lo but it is exceeding 96C now but only on a couple of cores.

     When you have a couple of cores running hot on a multi-core CPU but the rest are normal, this is usually indicative of an air bubble between the CPU and cooler so part of the heat spreader is not receiving cooling.  This is more pronounced with the i9-109×0 series of CPUs because the heat spreader is soldered to the die.  On most microprocessors there is thermal compound between the die and the heat spreader. This creates some diffusion that does not occur when the die is soldered to the heat spreader so any air bubbles are more critical.

     I’ve ordered some more Kryonaut Extreme which should get here between October 1st and 3rd, at which time we will pull the machine from the co-lo for a few hours to clean the CPU and heat sink and re-paste it.  I will perhaps be just a smiggin’ more generous with the paste this time.  I am stingy not because of cost but because no matter how conductive thermal paste is it is less conductive than the metals you are trying to transfer heat between so you want as thin of a layer as you can get away with, but the worst thermal paste is better than the best air so a little too much is less bad than not quite enough which appears to be the case presently.

     Between now and then I’ve reduced the speed of the machine from 4.8ghz to 4.4ghz and CPU voltage from 1.37 to 1.2v to reduce heat generation.  This will reduce performance by slightly less than 10%, but give it’s around 97% idle time on the CPU’s this should not be a problem and it’s only temporary.

     Right now this is more of an issue than it otherwise would be because there exists a bug in the kernel code when it writes to the MSR to change the CPU speed in response to excess temperature.  If this bug did not exist the machine would simply have automatically downclocked, but this is a current bug affecting these particular CPUs.