load average calculation imperfections

Mon May 16 16:49:11 GMT 2022

On 16/05/2022 06:25, Mark Geisert wrote:
> Corinna Vinschen wrote:
>> On May 13 13:04, Corinna Vinschen wrote:
>>> On May 13 11:34, Jon Turney wrote:
>>>> On 12/05/2022 10:48, Corinna Vinschen wrote:
>>>>> On May 11 16:40, Mark Geisert wrote:
>>>>>>
>>>>>> The first counter read now gets error 0xC0000BC6 == 
>>>>>> PDH_INVALID_DATA, but no
>>>>>> errors on subsequent counter reads.  This sounds like it now 
>>>>>> matches what
>>>>>> Corinna reported for W11.  I wonder if she's running build 1706 
>>>>>> already.
>>>>>
>>>>> Erm... looks like I didn't read your mail throughly enough.
>>>>>
>>>>> This behaviour, the first call returning with PDH_INVALID_DATA and 
>>>>> only
>>>>> subsequent calls returning valid(?) values, is what breaks the
>>>>> getloadavg function and, consequentially, /proc/loadavg.  So maybe 
>>>>> xload
>>>>> now works, but Cygwin is still broken.
>>>>
>>>> The first attempt to read '% Processor Time' is expected to fail with
>>>> PDH_INVALID_DATA, since it doesn't have a value at a particular 
>>>> instant, but
>>>> one averaged over a period of time.
>>>>
>>>> This is what the following comment is meant to record:
>>>>
>>>> "Note that PDH will only return data for '% Processor Time' after 
>>>> the second
>>>> call to PdhCollectQueryData(), as it's computed over an interval, so 
>>>> the
>>>> first attempt to estimate load will fail and 0.0 will be returned."
>>>
>>> But.
>>>
>>> Every invocation of getloadavg() returns 0.  Even under load.  Calling
>>> `cat /proc/loadavg' is an excercise in futility.
>>>
>>> The only way to make getloadavg() work is to call it in a loop from the
>>> same process with a 1 sec pause between invocations.  In that case, even
>>> a parallel `cat /proc/loadavg' shows the same load values.
>>>
>>> However, as soon as I stop the looping process, the /proc/loadavg values
>>> are frozen in the last state they had when stopping that process.
>>
>> Oh, and, stopping and restarting all Cygwin processes in the session will
>> reset the loadavg to 0.
>>
>>> Any suggestions how to fix this?
> 
> I'm getting somewhat better behavior from repeated 'cat /proc/loadavg' 
> with the following update to Cygwin's loadavg.cc:
> 
> diff --git a/winsup/cygwin/loadavg.cc b/winsup/cygwin/loadavg.cc
> index 127591a2e..cceb3e9fe 100644
> --- a/winsup/cygwin/loadavg.cc
> +++ b/winsup/cygwin/loadavg.cc
> @@ -87,6 +87,9 @@ static bool load_init (void)
>       }
> 
>       initialized = true;
> +
> +    /* prime the data pump, hopefully */
> +    (void) PdhCollectQueryData (query);
>     }

Yeah, something like this might be a good idea, as at the moment we 
report load averages of 0 for the 5 seconds after the first time someone 
asks for it.

It's not ideal, because with this change, we go on to call 
PdhCollectQueryData() again very shortly afterwards, so the first value 
for '% Processor Time' is measured over a very short interval, and so 
may be very inaccurate.

>     return initialized;
> 
> It's only somewhat better because it seems like multiple updaters of the 
> load average act sort of independently.  It's hard to characterize what 
> I'm seeing but let me try.
> 
> First let me shove xload aside by saying it shows instantaneous load and 
> is thus a different animal.  It only cares about total %processor time, 
> so its load average value never goes higher than ncpus, nor does it have 
> any decay behavior built-in.
> 
> Any other Cygwin app I know of is using getloadavg() under the hood.  
> When it calculates a new set of 1,5,15 minute load averages, it uses 
> total %processor time and total processor queue length.  It has a decay 
> behavior that I think has been around since early Unix.  What I haven't 
> noticed before is an "inverse" decay behavior that seems wrong to me, 
> but maybe Linux has this.  That is, if you have just one compute-bound 
> process the load average won't reach 1.0 until that process has been 
> running for a full minute.  You don't see instantaneous load.

In fact it asymptotically approaches 1, so it wouldn't each it until 
you've had a load of 1 for a long time compared to the time you are 
averaging over.

Starting from idle, a unit load after 1 minute would result in an 
1-minute load average of (1 - (1/e)) = ~0.62.   See 
https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html 
for some discussion of that.

That's just how it works, as a measure of demand, not load.

> I guess that's all reasonable so far.  But I think the wrinkle Cygwin is 
> adding, allowing the load average to be calculated by multiple updaters, 
> makes it seem like updaters are not keeping in sync with each other 
> despite the loadavginfo shared data.  I can't quite wrap my head around 
> the current implementation to prove or disprove its correctness.
> 
> Ideally, the shared data should have the most recently calculated 1,5,15 
> minute load averages and a timestamp of when they were calculated.  And 
> then any process that calls getloadavg() should independently decide 
> whether it's time to calculate an updated set of values for machine-wide 
> use.  But can the decay calculations get messed up due to multiple 
> updaters?  I want to say no, but I can't quite convince myself.  Each 
> updater has its own idea of the 1,5,15 timespans, doesn't it, because 
> updates can occur at random, rather than at a set period like a kernel 
> would do?

I think not, because last_time is part of the shared loadavginfo state, 
which is the unix epoch time that the last update was computed, and 
updating that is guarded by a mutex.

That's not to say that this code might not be wrong in some other way :)