Alois Kraus


  Home  |   Contact  |   Syndication    |   Login
  133 Posts | 8 Stories | 368 Comments | 162 Trackbacks


Article Categories


Post Categories


Sunday, September 14, 2014 #

Recently I have seen an interesting performance question on SO which deals with Parallel.For in .NET. The code was basically

            i => {});

The actual work is simulated by an empty delegate which is an easy measure the overhead of Parallel.For. When you let this run on some multicore machines in x86 Release you get number like


The running time is not constant as one might expect but sometimes there are slowdowns in the order of 6 times or more. With WPT is was quickly able to find as root cause the parallel for loop counter increment method:


It is used to increment the loop counter in a lock free way. With all cores in use we have massive memory contention on it but not all the time. By measuring the counts per thread I could see that sometimes one thread gets much less work than the others but there was no clear pattern which did relate to the slowdowns. Since we deal here with memory contention it should surface in L2/L3 cache misses because some other thread constantly manages it to overwrite the value while he is rarely able to update the counter.

The ExchangeAdd64 method uses basically the following approach:

        static void InterlockedAdd()
                    old = Counter;
                    @new = Counter + Increment;
                while( old != Interlocked.CompareExchange(ref Counter, @new, old));

It adds some value to the existing counter and then tries to atomically exchange it. This seems to work in x64 all the time not if you let it run as 32 bit application.

The 64 bit code uses

lock xadd qword ptr [rdx+8],rax ds:xxxxx

whereas the x86 version uses

lock cmpxchg8b qword ptr [edi] ds:xxx

Both approaches use long (64 bit) values although the Parallel.For counter we did use is an integer counter only. This seems to cause funny side effects on the CPU cache lines. There exist no performance counters to measure cache miss rates which makes it hard to verify this claim. But Intel has released the free Performance Counter Monitor (pcm.exe)  which offers much deeper insights what your processor is currently doing. To compile it and to get it working is not entirely straightforward since you need to compile a kernel driver and install it as test signed or you download a publicly signed version of it from the net. The Intel devs seems to be more Linux guys since the project files can be compiled with VS2013 but for every project the binaries end up in a sub folder of the project in the Debug folder (even for release). There is NO common bin folder were all binaries are compiled together. Anyway. The most interesting tool is pcm.exe which can show you per core statistics how many instruction have been executed and how the cache hit/miss counters are for the L2/L3 caches. The executed instructions per core show a sharp decrease which explains why the running time of the Parallel.For loop is sometimes soo slow. But still we do not see the reason for the drop in Execution performance.



But when I look at L3 Cache misses it becomes clear that one core seems to get into a death spiral where he consistently losses for several seconds to update this value which seems to distort the parallel execution of instructions because the data needs to be read from the far slower main memory many times while synchronizing the CPU caches.



I am not sure how to fix it since it should not happen at all and it will hopefully be fixed by a future CPU version. It makes no sense to emit extra code for such corner conditions. This example makes a nice demo what memory contention can do to your application performance. If you have no contention any lock free version will be fast. But if you have many concurrent accesses to the same memory locations you can get much longer running times while at the same time you waste enormous amounts of CPU cycles. Does this matter? Well yes it does if you use e.g. the lock keyword in .NET which internally first spins a bit in the hope that the lock will be released fast. If not it will take a real lock and stop wasting CPU cycles. If you have e.g. 5 threads updating some central variable and you use the lock keyword you will find many burned CPU cycles only to increment a counter. The Interlocked methods offer good alternatives to using locks if you have simple tasks. The interlocked methods are nice but they JIT compiler will not replace your code directly with the corresponding CPU code but you pay with an extra method call which can cost you in the worst cases a factor 10 in performance because you need to swap out quite a lot of data from registers onto the thread stack.

I really like the pcm tool from Intel because it gives me the ability to measure things. Previously I had to guess why this or that was slowing down my application. Now I can prove it. Here is some sample output of it so you can check if the tool might be useful to you as well.

EXEC  : instructions per nominal CPU cycle
IPC   : instructions per CPU cycle
FREQ  : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost)
AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state
L3MISS: L3 cache misses
L2MISS: L2 cache misses (including other core's L2 cache *hits*)
L3HIT : L3 cache hit ratio (0.00-1.00)
L2HIT : L2 cache hit ratio (0.00-1.00)
L3CLK : ratio of CPU cycles lost due to L3 cache misses (0.00-1.00), in some cases could be >1.0 due to a higher memory latency
L2CLK : ratio of CPU cycles lost due to missing L2 cache but still hitting L3 cache (0.00-1.00)
READ  : bytes read from memory controller (in GBytes)
WRITE : bytes written to memory controller (in GBytes)
IO    : bytes read/written due to IO requests to memory controller (in GBytes); this may be an over estimate due to same-cache-line partial requests
TEMP  : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature

Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK  | READ  | WRITE |  IO   | TEMP

   0    0     0.08   1.05   0.08    0.31     355 K   1013 K    0.65    0.32    0.21    0.08     N/A     N/A     N/A     66
   1    0     0.00   0.20   0.01    0.31     111 K    226 K    0.51    0.17    0.53    0.13     N/A     N/A     N/A     66
   2    0     0.01   0.27   0.02    0.30     193 K    517 K    0.63    0.19    0.35    0.15     N/A     N/A     N/A     68
   3    0     0.08   1.00   0.08    0.53     706 K   1838 K    0.62    0.50    0.38    0.17     N/A     N/A     N/A     68
   4    0     0.01   0.38   0.03    0.32     247 K    667 K    0.63    0.22    0.32    0.13     N/A     N/A     N/A     66
   5    0     0.01   0.56   0.02    0.40     177 K    479 K    0.63    0.26    0.37    0.15     N/A     N/A     N/A     66
   6    0     0.01   0.35   0.03    0.32     235 K    500 K    0.53    0.24    0.37    0.10     N/A     N/A     N/A     70
   7    0     0.02   0.43   0.05    0.57     116 K    807 K    0.86    0.41    0.11    0.21     N/A     N/A     N/A     70
SKT    0     0.03   0.70   0.04    0.38    2145 K   6051 K    0.65    0.37    0.29    0.14     N/A     N/A    0.00     63
TOTAL  *     0.03   0.70   0.04    0.38    2145 K   6051 K    0.65    0.37    0.29    0.14     N/A     N/A    0.00     N/A

Instructions retired:  917 M ; Active cycles: 1318 M ; Time (TSC): 3986 Mticks ; C0 (active,non-halted) core residency: 10.81 %

C1 core residency: 15.87 %; C3 core residency: 0.96 %; C6 core residency: 0.00 %; C7 core residency: 72.36 %;
C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %;

PHYSICAL CORE IPC                 : 1.39 => corresponds to 34.79 % utilization for cores in active state
Instructions per nominal CPU cycle: 0.06 => corresponds to 1.44 % core utilization over time interval