Alois Kraus

blog

  Home  |   Contact  |   Syndication    |   Login
  123 Posts | 8 Stories | 326 Comments | 162 Trackbacks

News



Article Categories

Archives

Post Categories

Image Galleries

Programming

Tuesday, August 25, 2015 #

All of us have experienced it but very few of us have gone into the details to really find the reasons for it. First of all why should I care? After the first time the application starts much faster. Well you should care if you want to give your customer not only a good second startup time but also a decent first startup experience. To do that you need to do repeated tests. The first startup is commonly referred to cold startup and all subsequent runs are warm startup scenarios. To test a cold startup scenario you need to reboot your computer every time and start your application. Not very practical and I have no one ever seen going down that road. The difference between warm and cold startup time is that the first time the dlls and data of the application need to be loaded from disc which can take quite some time. All subsequent runs can read the data not from disk but from the in memory file system cache which is managed by the operating system.

What Is The File System Cache?

Windows uses the memory of your system in some clever way. All memory that is used by your applications is shown in the Task Manager. The rest of the memory is free and not used by anything. Really? Of course not. The free memory consists of data of previously exited applications which was in use some time a ago. This "free" memory is called Standby List which is where the file system cache of Windows is really living. It is therefore not a good idea to max out your memory usage in task manager because if you do this you are flushing the file system cache which will result in a slow system. This cache is the reason why your second and all further startup times are much better. No disc accesses at all compared to the first run if all goes well.

How can you flush it?

If you want to measure cold startup times you need to clear the file system cache somehow. The easiest way is to use RAMMap from SysInternals. It has an option to flush the Standby List.

image

That is nice but unfortunately there is no command line tool available to do this in a test script to create repeatable tests. But not all is lost. Process Hacker is a better clone of Process Explorer which comes with full source code. There you can find the undocumented Windows API calls to flush the OS file system cache directly. It is basically

SYSTEM_MEMORY_LIST_COMMAND command = MemoryPurgeStandbyList;
NtSetSystemInformation(SystemMemoryListInformation, &command, sizeof(SYSTEM_MEMORY_LIST_COMMAND));

and you are done. There is also another trick I want to share: You want to clear the file system cache for a specific file? Windows will flush the file system cache if a file was opened with the FILE_FLAG_NO_BUFFERING flag. Here is little C# class that employs this trick:

    class FileSystemTools
    {
        const int FILE_FLAG_NO_BUFFERING = 0x20000000;
        const FileOptions Unbuffered = (FileOptions)FILE_FLAG_NO_BUFFERING;

        /// <summary>
        /// Flush the file system cache for this file. This ensures that all subsequent operations on the file really cause disc
        /// access and you can measure the real disc access time.
        /// </summary>
        /// <param name="file">full path to file.</param>
        public static void FlushFSCacheForFile(string file)
        {
            using (FileStream fileStream = new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 4096, FileOptions.None | Unbuffered))
            {

            }
        }
    }

Hidden Dragons

If the first test run is slow and all other are fast we know that the file system cache is to blame. If you see a pattern like this

image

you know that the first run was cold and all others were warm. If we now flush the file system cache we should get rock solid cold times with no other influencers except virus scanner, indexing service and other pesky applications making our test unpredictable. If we measure again - this time with flushing the file system cache - we get this:

image

 

This is not warm and not cold. What is it then? The answer is it was mostly a warm startup because another instance of the application was still running. If the a process is alive and its working set was not trimmed then the dll data is NOT in the cache since it still belongs to the working set of a running process. You need therefore make sure that you close all applications which keep your dlls in memory (e..g Visual Studio with its designers ….) to test a real cold startup scenario. There are other pitfalls as well. If you kill a process without waiting until it has terminated you can get random good runs because a process kill is an asynchronous operation. You have to wait on the process handle until it becomes signaled to be sure that it is really gone. Or if you are using .NET you need to call Process.WaitForExit() to be sure.

After we have fixed that we get this graph:

image

What the hell is going on here? This was quite hard to find out since there is very little information on the net how to reliably test cold startup scenarios. The hidden thing is Superfetch.

Doesn´t ring a bell? Besides speeding up the boot it also tries to prefetch often used application dlls into memory before the application is started based on intelligent heuristics. This gets in our way if we flush the file system cache. After that the Superfetch system service named SysMain  which is hosted in a generic svchost.exe process tries to prefetch often used dlls at low IO priority. This is additional IO collides with our test where we want to load our dlls as well. To get consistent IO numbers it is therefore best to stop the Superfetch service.

sc stop SysMain

After doing that the graph still looks the same. What is interesting that if our application reads about 30MB more data it starts nearly 10s faster. This is a mystery which only xperf can solve. Here is a Disc IO graph of a 23s slow run

image

And here is a fast 14s run.

image

We read more data but the Disc Service Time is less! The average Disc Queue Depth is very high which should indicate an overwhelmed disc. What stands out is that we have about 70% less disc read operations on the assembly folder where the native images of .NET applications are stored. Sometimes we read the disc in a slow and sometimes in a fast way. To answer the question why we read from the disc in such different ways we need to check out the Disc IO Init call stacks. These will tell us who did initiate the read operations and most important when. First comes the slow run

image

It is no surprise that our slow cold startup causes many hard page faults which triggers the kernel to read the missing pages from disc. This is known to be slow. We can do little about it except to execute less code during startup. The graph shows a nice scattered pattern across the complete startup time. Now lets have a look at the fast run

image

That looks radically different. We load the same dlls but we have no hard faults? All dll data is read before the application main thread starts running. In our process the Prefetcher reads its prefetch database which is 1,3MB in size (see PfSnPrefetchMetaData in 687ms) which allows it to read our dlls in PfSnPrefetchSections (151MB in 3,1s) to read the dll data mostly with sequential reads in about half of the time. That is the real reason why we have so wildly varying startup times. The prefetcher will look at ca. the first 12s of application startup and will then update the prefetch database for your executable. Next time the startup is a little better and it will update it again causing it to prefetch a little more and so on. This is a nice self tuning system which not many people are aware of. From time to time the prefetcher seems to do nothing to check (that is my guess) if the prefetch database is helping at all. That explains the mostly good and sometimes very slow numbers whre no prefetching is active.

If we aim at stable cold repeatable tests we need to delete disable the Superfetch service because this process is responsible to write and update the prefetch database. The prefetch database file is located at

C:\Windows\Prefetch\ExeName-FileHashPrefetchFlag.pf

We need to delete that file as well and now we have real control over our IO during application startup. If you want to calculate the file name of your application prefetch database you need to take the full path of your executable and run it through a hash function which is described at http://forensicswiki.org/wiki/Prefetch. There are still some sites out there which tell you to delete the prefetch folder to prevent caching of dlls to make your system faster. If these guys would have really measured their startup time in a controlled and systematic manner they would not have written such misleading guidelines. As you can see you need to take care about a lot of details to really understand your system. But it is worth it. If you have looked closely at the .pf file name there is a PrefetchFlag at the end. That is a number which you can pass to generic container processes to generate for the same executable different prefetch databases. This can be useful if you have generic container processes with wildly varying dll sets inside them. How do you tell the OS loader that you want to use a different prefetch database to speed up cold startup times?

Simple: Pass it as command line parameter! If you create e.g. an executable with the name prefetchTest.exe and pass /prefetch:dd as command line argument to it you get different prefetch files for each scenario:

prefetchTest.exe
    C:\Windows\Prefetch\PREFETCHTEST.EXE-47E1C1B1.pf 
prefetchTest.exe /prefetch:1
    C:\Windows\Prefetch\PREFETCHTEST.EXE-47E1C1B2.pf
prefetchTest.exe /prefetch:2
    C:\Windows\Prefetch\PREFETCHTEST.EXE-47E1C1B3.pf
That is a nice find. Perhaps you can make your own customers happier by including a cold startup test in your test suite to make the pain public. Most test reports I have seen include only warm numbers because it is quite difficult to create repeatable and stable cold startup tests. Here are my tips as short bullets for repeatable cold startup measurements:

Test Machine Preparation

  • Stop/Disable the Superfetch (SysMain) service.
  • Disable Windows Search (WSearch) service.
  • Disable Virus scanners.
  • Disable scheduled tasks and other system management software.
  • Delete the prefetch database for your executable.

Test Execution

  • Kill all applications which keep your dlls loaded.
  • Flush the file system cache before the test.

When you follow these tips you will get easily repeatable and stable cold startup numbers. If for some reason during startup much more code is executed which causes many page faults you can easily spot it. The warm startup time might not be affected at all because the actual run time of the additional code might be low but the new disc access pattern could be pricy in terms of cold startup time.


Monday, August 17, 2015 #

When you open an etl file you normally want to see which methods were executed most often. To decode the recorded stack pointers WPA needs your symbols. By default WPA adds the Microsoft symbol server to the list. If you click on Load Symbols

image

then you will experience some slowness which is very nicely explained by Sasha here: http://blogs.microsoft.co.il/sasha/2014/12/02/diagnosing-native-memory-leaks-etw-wpa/. It is ok for the first time to get all relevant symbols from Microsoft but afterwards make sure you disable the Microsoft server from the symbol search list!

image

The symbol download folder is by default C:\Symbols. The directory after the srv* is the pdb download folder and after the second * the symbol server address is appended.

If you wonder why you need a SymCache folder:

image

The pdb format generated by the compilers is so arcane, bloated and parser unfriendly that the guys doing the pdb parsing within dbghelp.dll (part of Windbg) decided to convert the bloated pdb format into something much less bloated and much easier to parse. The SymCache folder contains therefore not the pdbs but the converted pdb files.

Here is the comparison:

image

And here the original pdb

image

Now you know why the SymCache folder is a good thing. You can safely delete it if it gets to big as long as you have the original pdbs still around nothing is lost. The vigilant reader has certainly noticed that I have added a path to C:\ProgramData\WindowsPerformanceRecorder\NGenPdbs_Cache. This is the path were WPR and xperf stores the generated managed pdbs while saving an ETL file to disk. By default WPR will create besides your xxx.etl file an extra folder named xxx.etl.NGENPDB which contains all managed pdbs from NGenned binaries. Since WPA of Windows 10 follows not its own convention to expand the symbol search path of the loaded etl file to its accompanying pdb directory I go directly to the source where WPR copies the generated pdbs from.

If you follow this advice and disable the MS symbol servers you will load the pdbs not within the next 20 minutes but in 3,6s for my sample ETL file. Kudos to the Symbol Cache and its inventors. While speaking of performance. PerfView has a much better (read faster) managed pdb generation strategy. WPA and xperf will call ngen.exe createpdb  xxxx.ni.dll for each and every native image in the trace recording. This is not necessary since you can read the pdb age guid directly from the dll and check then in the pdb cache folder if the pdb already exists. No need to call ngen.exe all the time! Finally it is not necessary to copy the generated pdbs to the xxx.etl.NGENPDB directory. My own trace collection tool simply creates a directory link to the NGenPdbs folder and I have all pdbs I could need without any file copies.

If you find that the default paths of WPA, xperf, … on your crowded C drive leave no room for pdbs you can configure WPA to a different storage folder. Or you leave all as is and create with mklink /D directory links from your C drive to a drive which has enough space. I prefer the latter approach since in my experience tools tend to rely on the defaults and at some point they will come back to it (e.g. Windbg …).


Sunday, August 16, 2015 #

Did you ever see a nice tool with a fancy UI and thought: Hey that thing is powerful. I will try it! But later you were left back scratching your head why other can use this tool and get great results but you just don't get it?

Then I have news for you I have created a WPA profile which aims at user and not kernel mode developers. Many columns were removed from the context menus to give you all the power you need to find issues in your application

where no one else has found the root cause.

The ETW profile can be downloaded from here as a zip file. Unpack it to a directory and you are ready to go. So what is inside it?

 

File Description
Simple.wpaProfile The main WPA profile you can use now.
JIT.xml WPA region file referenced by Simple.wpaProfile to get JIT times of your application like PerfView.
GC.xml WPA region file referenced by Simple.wpaProfile to get Garbage Collection metrics like in PerfView only better!
HookRegions.xml WPA region file referenced by Simple.wpaProfile to see mouse clicks in your ETL file when ETWControler is running and capturing your keyboard and mouse events.
default.stacktags Stacktag file which serves as base for your application to find common .NET issues like event handle leaks, too much regex usage, monitor contention and much more.

 

Besides this there are also some WPR profiles added to enable recording of specific .NET events like exceptions together with GC events into a larger 1 GB buffer.

WPR of Windows 10 has under Scenario Analysis .NET Activity which has a very small 160MB ring buffer which is way to small for my needs. It might be good

for you but I have added extra WPR profiles for GC and JIT as well.

To show you the difference I have created a small PerformanceIssueGenerator.exe application. This generates various issues which you can analyze with WPA.

I have recorded the data already and put the 7z file here. If you want to extract it you need to download the 7z from the official site. To view the ETL file

you need a >= Windows 8 machine and Windows Performance Toolkit from the Windows 10 SDK.

 

When you have downloaded the profile and the sample etl file you can apply the profile under Profiles - Apply…

image

Then you get two tabs. The first one contains .NET specific analysis stuff like GC and JIT. The second tab can be used for the usual bottleneck analysis regarding

CPU, disk and memory usage as well as wait chain analysis. You do not need to load the profile every time. You can save it as your default profile

by clicking on Save Startup Profile to open all of your ETL files in this view now.

image

Normally I use PerfView for GC issues to check out the GC Stats of an application to see if anything significant is going on. If you have ever used PerfView then

you will have noticed that it is a great tool combined with a crappy UI. At least for some key scenarios we can now use WPA with hand crafted region files instead of PerfView.

 

GC Views

You now get a GC view like this

image

You can visualize each GC generation type and its time it did take to execute. The Count column also tells you how many GCs you did have. This can help a lot if you

want to get GC metrics only for a specific time region in your application. Now it is easy to see how much time of your use case was available to your application

and how much time your threads did have to wait for the garbage collector. This was my main issue with PerfView and its GCStats view that it is

calculated for the complete duration of the profiling run. Most of the time I want GC metrics only for specific parts of my application because

I am optimizing mostly only at a single place at one time.

Here is how you get the GC statistics in PerfView:

 

image

 

The numbers in GC Rollup By Generation match pretty well to the region file. Also the GC pause time correlate quite well with the distribution of generations although the timings

are not 100% the same but the ratios are a good fit.

image

Since PerfView and ETW use the same data you can rightfully ask the question why there is a difference at all? The answer is that WPA sums not up all GC regions

by their duration. A WPA region is defined by a start and a stop event which is the displayed in the UI like above. But if the regions happen in parallel in several threads WPA will use as sum

only the time where at least one region was active at a time. This diagram illustrates how WPA region summation works:

image

This detail is for GC things not so relevant but it will be very important when we look at JIT statistics. Since a foreground GC is blocking your complete application there is little

concurrency going on. We can also visualize when GCs of which types are happening. The view Generic Events GC Generations will show you how many GCs of which type were happening.

Initially it is collapsed so you have to open it. This view takes a little while to build up since it needs to parse all ETW events for GC Start events which contain as data point the generation number which is

visualized here.

image

This is the greatest thing about WPA of Windows 10. Now you can visualize the data of your custom traces. With a little WPA graph configuration metric you can create for your own application key graphs

which show e.g. Frames per seconds, # of db transactions, … Who needs performance counters when you can visualize your trace data much better now?

Marker Events

A key point to know where you want to start looking for an issue are Marks. In WPRUI you can press Ctrl+Win+x to open a window which allows you to write a marker event to your trace file

while you are recording data. Unfortunately this functionality is not exposed to the outside world to set marks programmatically. If you revert to xperf you can set with xperf -m marks programmatically

if you wish to. But since marks are so useful to navigate in an ETL file I really wonder why the method which xperf calls is not documented at all. If you want to know how real power users

are using WPA then you need to call wpa /?

image

I guess at Microsoft they let their tests run with enabled profiling while saving screenshots as ETW events. If something happens the ETL file is downloaded

from a central repository and the ETL file is opened with a test specific profile. The file is opened and zoomed into the relevant test part which are identified by markers

or regions from a region file. At least some people at Microsoft use this tool so often that it makes sense to automate it even further since the controls of WPA are UI automatable to

script nearly your complete analysis. Now THAT is what I call good performance regression testing.

A little reverse engineering which APIs xperf -m calls did finally show up the method EtwSetMark which is exported by ntdll.dll. I am sure the readers of my blog can figure out the rest as well.

But I really wish it would become an officially supported API since it is so useful. Sure you can define your own marker events as well but since the support in WPA is already built in

it would really help. It would also be nice if xperf would emit the mark event no only to the "NT Kernel Logger" ETW session but all active kernel session so you could mark also the WPA kernel session

which is currently not possible.

 

JIT View

PerfView can also give you the timings how much time each method did take to compile. This is useful if your are compiling too much code on the fly for your WCF proxies or your serializers. Sure it is done only once

but if this time plays a significant part of your application startup you should rethink how your application works. Since .NET 4.5 will NGen all assemblies on the fly if they are used often enough you do not need to

consider using NGen explicitly for your application. But if you have much dynamic code generation going on you can still suffer from long JIT times.

 

image

You can get a similar view by switching from Garbage Collection to JIT Time per Thread

image

As I have shown you in the GC section. The summation of regions is not a simple sum of the duration of all JIT events. The JIT compiler can compile code in many different threads concurrently.

The Duration sum of all threads of JIT time reported by WPA is therefore largely irrelevant if you want to compare two different runs of your application. Instead you need to look at the JIT times

of each thread. You can copy the duration column with the context menu "Copy Column Selection" into Excel

image

image

which gives us the exact same JIT time as Perfview. Now I do not need PerfView for JIT analysis anymore. I am more and more getting away from programming to configuring WPA to give me just the view I need.

If you expand JITs you get at least the namespace of each JIT event. This is the best I could come up with since WPA does not support concatenating  strings of different fields into a region name.

image

But you still can open the also customized view _Activity by Provider Task Process to view the "raw" data if you need more information. If you would drop e.g. Field 5 to the left of the yellow bar you would get a nice summary how

many methods in this namespace were compiled.

image

As you can guess there are lots of more goodies inside the now published profiles to make WPA really usable for non kernel developers. I have streamlined nearly every context menu where I removed all useless or kernel only columns

from the context menus. You have a much easier time now to concentrate on the important things. With this profile WPA is nearly a new application and central analysis station for managed and unmanaged performance analysis.

In the next posts I will walk you through the problems of PerformanceIssuesGenerator and how you can visualize them in WPA efficiently.


Friday, July 31, 2015 #

It has been a long wait to finally get the new version of the new Windows Performance Toolkit. I had not much time to test the betas since until now managed call stack resolution was not working.

WPT for Windows 10 makes some nice progress which you can download here. Click on the Download Standalone SDK and run it. Then uncheck everything except Windows Performance Toolkit and

it will install on your machine within minutes. If you download the complete Windows 10 SDK you are not downloading 170MB but several GB.

 

Management Summary

  • Graphs are more readable.
  • Graphs with short spikes are much better rendered without the need to zoom into to see them.
  • Colors for graphs are changeable (finally).
  • Filtering can be undone now in the context menu.
  • Quick Search in all tables.
  • Symbol load dialog was improved.
  • You can configure your own stack tag files in Trace Properties.
  • No new Graph Types (at least with the default providers I enable).
  • 5 Added columns to CPU Usage Sampled
    • Two could be extremely useful: Source File Name and Source Line Number
  • 12 Added columns to CPU Usage Precise.
    • Two are to user mode developers useful: ProcessOutOfMemory and CPU Usage (in view).
  • Occasional crashes are also featured now.

CPU Usage Sampled

image

The biggest improvement is certainly the possibility to see source file and line numbers. You can find out this way where most CPU is spent on which line if the CPU consumption happens in your code.

A quick recap for the newbies: The CPU Usage Sampled graph is generated by taking the stack trace of all running threads 1000 times per second (this is the default). These call stacks are then added together.

A count in your Main method of 1000 means that the method has used one second of CPU time.

If the call stack ends not in your e.g. Main method when the stack trace is taken then your source file will not be displayed there. As usual you have to take a sharp look at the numbers. Your method

may be causing a high CPU consumption but it might never show up with a source and line number because the stack always ends in an external library call (printf in my case) for which I did not have the pdb loaded.

It would be nice to have file and line numbers for managed code as well but this feature is of limited use as it is now.

Ideally I want to see this stuff while I am drilling into a call stack all my way down and not only if the call stack ends in a method which I did compile.

 

For Reference here is the complete list of columns you can choose from. Green are unchanged columns, Bold are new ones. I never have found out how to give a thread a name in WPT.

If anyone knows more about the mystical thread names and how they can be set I would be very interested in it.

WPT 8.1 WPT 10
% Weight % Weight
Address Address
All Count All Count
Annotation Annotation
Compiler Optimization Compiler Optimization
Count Count
CPU CPU
Display Name Display Name
DPC/ISR DPC/ISR
Function Function
  Image RVA
Inlined Functions Inlined Functions
Is PGO'ed Is PGO'ed
Module Module
PGO Counts PGO Counts
PGO Dynamic Instruction Count PGO Dynamic Instruction Count
Priority Priority
Process Process
Process Name Process Name
Rank Rank
  Section Name
  Section RVA
  Source File Name
  Source Line Number
Stack Stack
Table Table
Thread Activity Tag Thread Activity Tag
Thread ID Thread ID
Thread Name Thread Name
Thread Start Function Thread Start Function
Thread Start Module Thread Start Module
TimeStamp TimeStamp
Trace # Trace #
Weight Weight
Weight (in view) Weight (in view)

 

CPU Usage Precise

This is by far the most complex table. With WPT 10 we have 66 columns to choose from to find out how our threads interact with each other.

I am not sure how WPA detects that a process was out of memory but I think the kernel knows quite well if something did go wrong. The ProcessOutOfMemory column could be handy

to check stress tests when the machine was under heavy load which could explain subsequent application failures.

CPU Usage (in view) seems to be an attempt to display a more true thread running time. I am not sure how this value is calculated but it seems that if all cores are in use and you

get significant Ready times then it differs quite a lot. Otherwise the differences are marginal.

 

There have been quite some renaming of columns which might render your custom filter and enablement conditions useless and WPA will show you an error.

image

If your old preset did enable by default all processes except the Idle process which clutters up the view it was set to:

([Series Name]:="NewProcess" AND NOT ([NewProcess]:="Idle (0)"))

With WPT 10 you need to change it to:

([Series Name]:="New Process" AND NOT ([New Process]:="Idle (0)"))

This happens to all users of WPT 8.1 which have saved a custom default profile which contains this enablement rule.

With an empty enablement rule your view will look like

image

and with the rule

image

you will see no difference when you apply it. Why? Because this rule only enables matching stuff but it does not disable the non matching

processes! I am with you that this is very counter intuitive but it is the way it is. If you save this as new default profile and then open a new file

you will no longer get a view where the Idle process is enabled by default.

For reference I list here all available columns with the old and new names so you have an easier time to update your custom filters, enablement

and expansion rules.

Green lines are unchanged, yellow are renamed columns and bold are new columns in WPT 10.

WPT 8.1 WPT 10
% CPU Usage % CPU Usage
  AdjustIncrement
  AdjustReason
Annotation Annotation
Count Count
Count:Waits Count:Waits
Cpu Cpu
CPU Usage (ms) CPU Usage
  CPU Usage (in view)
  DirectSwitchAttempt
IdealCpu Ideal Cpu
  KernelStackNotResident
LastSwitchOutTime (s) Last Switch-Out Time
NewInPri New In Pri
NewOutPri New Out Pri
NewPrevOutPri New Prev Out Pri
NewPrevState New Prev State
NewPrevWaitMode New Prev Wait Mode
NewPrevWaitReason New Prev Wait Reason
NewPriDecr New Pri Decr
NewProcess New Process
NewProcess Name New Process Name
NewQnt New Qnt
NewState New State
NewInSwitchTime (us) New Switch-In Time
NewThreadId New Thread Id
NewThreadStack New Thread Stack
NewThreadStartFunction New Thread Start Function
NewThreadStartModule New Thread Start Module
NewWaitMode New Wait Mode
NewWaitReason New Wait Reason
  NewThreadPrevRemainingQuantum
  NewThreadRank
  NewThreadRemainingQuantum
NextSwitchOutTime (s) Next Switch-Out Time
OldOutPri Old Out Pri
OldProcess Old Process
OldProcess Name Old Process Name
OldQnt Old Qnt
OldState Old State
OldInSwitchTime (us) Old Switch-In Time
OldThreadId Old Thread Id
OldThreadStartFunction Old Thread Start Function
OldThreadStartModule Old Thread Start Module
OldWaitMode Old Wait Mode
OldWaitReason Old Wait Reason
  OldThreadRank
  OldThreadRemainingQuantum
PrevCState Prev CState
  ProcessOutOfMemory
Ready (us) Ready
ReadyThreadStack Ready Thread Stack
ReadyTime (s) Ready Time
ReadyingProcess Readying Process
ReadyingProcess Name Readying Process Name
ReadyingThreadId Readying Thread Id
ReadyingThreadStartFunction Readying Thread Start Function
ReadyingThreadStartModule Readying Thread Start Module
  ReadyThreadInDPC
SwitchInTime (s) Switch-In Time
Table Table
Thread Activity Tag Thread Activity Tag
Thread Name Thread Name
TimeSinceLast (us) Time Since Last
Trace # Trace #
Waits (us) Waits

File IO View

The File IO view has now an additional column named File Path which acts like in Path Tree in Disc IO which is extremely useful if you do differential IO analysis. It seems that my

complaints about the not complete feature parity between Disc and File view have helped a bit.

image

This view has still a bug. If I zoom into it all graphs are disabled and I have to reenable them after every zoom operation. If I zoom out all is ok.

Generic Events

Besides some minor changes the biggest change is that finally it has got a DateTime (local) column. It shows the clock time and not the time since trace start.

This can be very useful for advanced scenarios when someone did report an issue and he did care to write down the time when it did happen.

image

Process View

I did not know it until it was removed. Process Tree is no longer with us. It was not really useful anyway.

Process Tree was added to WPT. That was the reason I never saw it before. After using it for some time I find it is easier to navigate in the tree if

you have a parent process which creates many childs.

 

Disc IO View

The Disc IO view has got only one additional column with the name Boosted which is for my disc IO always false. Not sure if this is a Windows 10 only feature. If someone knows more about it please comment.

 

Windows Performance Recorder

The recorder has not changed much. It has got some additional profiles like ones for .NET, Edge Browser, … The buffer size for .NET Activity is hard coded

to a 160 MB ring buffer which might not be enough for machines were many .NET apps are running. I still do not buy into this easy to use UI since the xml configs

need hard coding the values into it. It can be a hard value or a percentage of the system memory. Both choices are bad if I want to automate performance regression testing with

very specific workloads and tracing requirements. This GUI and the command line tool (wpr.exe) are just too inflexible.

Yes I know the complete xml has inheritance built into it so I could create derived profiles which override only the specific settings I care about. But this is still not dynamic enough

if I want to create a configuration for each problem and set them depending on the use case at hand.

image

  • xperf is still a good tool to stop a profiling session because it does a full CLR rundown.
  • But PerfView is much faster at creating the managed pdbs by not trying to recreate them over and over again with NGen.
  • Tracelog from the Windows SDK is still the only tool to start a trace session where you can configure the Win 8.1 enhancement to the ETW infrastructure.
    • E.g. filter events from a specific process, or disable events or stackwalking for user mode providers.

Summary

The new WPA tool has got some nice new features but they still need some polishing to become truly awesome. In my daily work I miss the automatic addition

of the etl.NGenPdb folder when present for the generated pdbs of an etl file. This is a useful convention and the old WPA was so nice to automatically include it to my symbol path.

WPA is very sophisticated. To use it right is an art which needs nearly daily practice. That could explain why this tool although it has a great UI has still got not so

much traction. It is like Windbg in disguise which will quickly cause information overload to the untrained brain.

But with some basic tricks which I will show in the near future you can crack otherwise unsolvable performance nuts with a few mouse clicks.


Thursday, June 18, 2015 #

Some bugs are easy some are hard. If you are dealing with a massive multithreaded distributed application consisting of many different processes talking to each other performance issues are pretty much impossible to detect with standard profilers. But with WPT the word impossible can be replaced with possible although much work is still needed. Recently I had an interesting regression issue where a use case did become over one second slower which was a quite big difference. So it should be easy to spot right? Two things were easy but one fraction ( 60% ) was still hiding in the data. Since the title of this post is already very clear here is a small test application which shows the issue:

    class Program
    {
        static Stopwatch StartTime = Stopwatch.StartNew();

        static TimeSpan WaitTime = TimeSpan.FromMilliseconds(2);

        static bool ReturnTrueAfter2ms()
        {
            if( StartTime.Elapsed.Ticks > WaitTime.Ticks )
            {
                StartTime.Restart();
                return true;
            }

            return false;
        }
        
        static void Main(string[] args)
        {
            while (true)
            {
                SpinWait.SpinUntil(ReturnTrueAfter2ms);
            }
        }
    }
In the main method we are continuously Spinwaiting for an operation which usually takes some ms (e.g. network IO) to complete. The goal of SpinWait is to spare you context switches in favor of burning CPU cycles. We are talking about some microseconds to spare instead of a context switch. Spinwait will therefore by default spin a few microseconds until it gives up and tries to yield to other threads with increasing persistence. If you look at the code of SpinWait

 

image

you find that for every 20th run it will sleep for 1ms. That in itself is not a big deal if your method takes longer to execute. Only by looking at the code all looks sound and reasonable. What could possibly go wrong? Let have a look at the costs for a context switch with Thread.Yield, Thread.Sleep ( 0 ) and Thread.Sleep( 1 ). Thread.Yield is quite cheap if the other thread has not much work to do. The cost looks like 6 microseconds for doing only a context switch and up to 70 microseconds if the other thread is doing some work until it waits again for something.

image

Thread.Sleep( 0 ) on the other hand lets the thread run forever if no other threads want to run on this CPU. The scheduler looks after the thread for ca. every 250ms where a non switching context switch event is logged but apart from that nothing is stopping my thread.

Things become interesting for Thread.Sleep(1) where we forcefully give up our thread quantum.

 

image

Remember the goal of SpinWait is to spare you an expensive context switch. It spins for about 20-80 microseconds. Then it tries from time to time a Thread.Yield, Thread.Sleep ( 0 ) and every 20th time a Thread.Sleep(1). In the comments of SpinWait it is noted that a thread can wait for a long time. In fact a Thread.Sleep(1) can sleep up to 15ms because you give up your current thread quantum and the scheduler looks after your thread only during the next clock tick which is 15,6ms later again to check if your thread is ready to run again. SpinWait is therefore only useful if your operation completes within a few microseconds in 99.9% of all cases. But if your operation frequently or nearly always takes much longer than a typical context switch then this class will not make you any faster. Instead of sparing a few microseconds you are potentially wasting up to 15 milliseconds every time you call SpingWait.SpinUntil!

If you have many call stacks in your application which cause SpinWait to sleep for a thread quantum of 15ms then you are using SpinWait wrongly!

image

This one was very hard to find because I usually search for long pauses when I am after a one second delay. But this time 600ms were scattered across dozens of calls almost evenly distributed throughout the use case because SpinWait was used in a central low level component which was frequently called.

If you looked at the Excel chart above and wondered where difference between with and without Visual Studio for Thread.Sleep(1) comes from I can tell you now. By default the windows timer ticks every 15,6ms which then triggers the thread scheduler to check out its running threads. The strange 15,6ms value comes from 64 ticks/s = 15,6 ms/tick. If someone uses a multimedia resolution timer or changes the system wide timer tick to increase the tick rate your threads will more often be watched after the thread scheduler. That is the reason why Thread.Sleep(1) can wait for "only" 2ms or 15 ms depending on which applications are running which may alter the system wide clock rate. Another great source of many more details can be found at https://randomascii.wordpress.com/2013/07/08/windows-timer-resolution-megawatts-wasted/.

You can use the clockres tool from SysInternals to check out your current timer resolution. The dynamic windows timer tick rate makes it especially tricky to find performance degradations with unit tests on machines where Visual Studio or another WPF GUI is running where the tick rate is 1kHz. But on an installed server the clock tick rate is 64 which will make a big difference. Luckily the performance regression tests did show the issue long before any customer would get this version. It really pays off to performance regression tests. It is even better if you not only execute performance regression tests but you use xperf to record profiling data for all performance tests and store profiling data somewhere. Later when a test breaks you don't need to go to the testing guys and ask them to install the current baseline which has the issue and an old baseline which was good. Then you need to profile both use cases and then you can start analyzing. What a waste of time!

It is much easier to let your central performance tests run with ETW profiling which will affect even heavy use cases < 10%. But you get all the data you could wish for to find the root cause much faster. If you train the people how to use WPA to analyze issues and how to use PerfView, Windbg, VMMap and memory dumps you can create software which has significantly less issues than your competitors because you really understand your system. The usual thinking to treat system libraries and the operating system as black boxes where miracles happen has its place to not be distracted by the many things you need to consider before you write one line of software. But you should not stop at the black boxes. Instead you should develop the ability to look into the black boxes when necessary and write a nice blog post about the found issues so other people can learn how things (not) work.


Tuesday, April 21, 2015 #

All are talking about Windows 10 but what about the developer Tools? Since 4/20/2015 you can also download the beta of the upcoming Visual Studio 2015 and Windows / Phone SDKs. This includes also a new version of the (at least for me) long awaited Windows Performance Analyzer. You can get the ISO image here:

http://dev.windows.com/en-us/windows-10-developer-preview-tools?ocid=WIP_r_Mar_Body_SDK

From there you can download the iso file for the tools. You do not need to install Visual Studio CTP for this.  Now mount the iso files and install for x64 Windows versions (x86 is also there)

\packages\OneSDK\Windows App Certification Kit x64-x86_en-us.msi

This will put the new Windows Performance Toolkit under

C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit

You do not need Windows 10 to be able to install the toolkit beta. You can install the kit at least on Win 8.1 and (not tested) also on Win7 machines.

So what's inside? As far as I can tell the changes are not too dramatic but some nice improvements are there.

 

Xperf

Added option to -injectonly -merge.

    -injectonly                  Do not copy (merge) events, only inject 
                                 image identification information. Cannot be 
                                 used with -mergeonly.

What is the meaning of this? You can extract from a kernel trace file the events into a separate file and then merge a user etl file which will then become readable with WPA (albeit the process names do not show up).

This will generate the extract of the kernel etl file:

xperf -merge  c:\temp\LowLatencyGCEvents.kernel.etl c:\temp\lowlatnetinjected.etl -injectonly

Now you  can merge it with a user mode session etl file without ending up with a multi GB etl file if the user mode session contains only a few interesting events in which you are interested in.

xperf -merge  c:\temp\LowLatencyGCEvents.etl c:\temp\lowlatnetinjected.etl c:\temp\mergedWithImage.etl

Besides this there are no new features to xperf that are worth noting.

WPA

Now you can open the enriched user mode etl file with WPA and finally you can read the events:

image

Which new features do you spot in the picture above?

  • Colors! You can select the colors you want for your graphs.
  • You can graph numerical data from custom events now!
    • This works currently only to some extent. I hope it will get better in the released version.
    • The main issue is that it cannot graph hex values which have defined in their event template as outType hex values:

<template tid="GCHeapStats_V1">
    <data name="GenerationSize0" inType="win:UInt64" outType="win:HexInt64"/>

    • Another issue that I would like the possibility to graph a running value as line chart (e.g. memory usage, durations, …). Currently it is displayed only as a spike in the line diagram. Ok I can use bar charts but the line diagram type looks not too useful to me since I never know how many events are hiding behind this bar. Yes I can hover over the bar and press Alt+Space (also a new feature) to see more information about the graphed values.
  • Finally you can add at least for generic events in the View Editor a DateTime (Local) column. This is a first step into the right direction. But why on earth is it still not possible to switch on local time for the x axis? This would really help when I spot an issue and I can look at the clock, write down the time, save the trace and continue with the tests. Later I can look at the gathered data and the timepoints from my notes to see if something suspicious was going on. This is even more annoying for long term recordings (hours or even days) where the customer tells you that around 2 o`clock something bad did happen. Currently I have to look into Trace - System Configuration - Traces -  Start/End (UTC) Time to calculate from the seconds since trace start and the reported time the actual time window I should take a look at.

Other Features

Symbol Loading was significantly improved. Now you can deselect symbol locations you do not want via a checkbox instead of manually editing a long string. That is good thing although I miss the automatic heuristic to add from an etl file to add xxx.etl.NGenPDB to the symbol path if the folder exists.

The symbol settings are remembered which is good. What I do not like is that although I have disabled the MS symbol server it tries to load missing symbols still from MS when I select to load the symbols automatically after trace load.

image

When this is enabled there is no way to cancel symbol load after it has started and reconfigure it. This is not the best solution for me because it can take a long time until all symbols are downloaded. I have not tried to check out what will happen on a machine with no internet connection. The old WPA had a cancel button which did work after a long time but at least it was possible to cancel the symbol download.

image

Custom stack tag files can now be added. Before that you needed to overwrite the default.stacktags file in the folder where WPA was installed.  It was easy but I still felt dirty when I did overwrite the file.

If you have not used stack tags yet you should definitely have a look at them because if you see the same issue over and over again you can add this method to your stack tag file and add it under a problem node. When the problem node gets much CPU you can spot the "old" issue within seconds without the need to drill deep into some call stacks.

image

 

What is still on my Wish List?

  • Path Tree as in File IO like in the Disc IO table view.
  • Add a process command line as column in CPU Usage graphs so I can see which generic process caused which CPU consumption.
  • Some help with Wait Chain analysis. I have seen in seen in some WPA video this option but it seems not to be published.
  • WPA still knows nothing about managed code. Basic GC metrics as in PerfView should be part of it!
  • Region files where the SelfNest feature works as expected. Ideally I would like to create a "stack trace" like structure from my custom traces as regions.
    • Currently WPA seems to ignore the TID option and uses for method enter/leave traces simply the nearest trace in the timeline which gives garbage call stacks.
  • Improved Stack Tagging
    • I want to see for expensive methods how expensive they really are. This includes GC overhead. Currently this is grouped extra under CLR - GC because the stack tags "steal" from other methods if they have a better matching call stack. The exact logic is not obvious but it could be something like a match counter. The rule with most matches wins. It would be nice to add some nesting to stacktags so I can visualize the GC overhead of specific known expensive methods.

image

My Stacktag file looks like this one:

  <Tag Name="MyApplication">
    <Tag Name="Main">
        <Entrypoint Module="LowLatencyGC.exe" Method="*Program::Main*"/>
        <Tag Name="GC">
            <Entrypoint Module="clr.dll" Method="*GC*"/>
        </Tag>

    </Tag>
  </Tag>


  <Tag Name="CLR"> 
    <Tag Name="GC">
      <Entrypoint Module="clr.dll" Method="*GC*"/>
    </Tag>
 
  </Tag>

     

I want to group all methods which call Main under MyApplication and then factor out the GC overhead. For all other methods the generic CLR GC node should match so I can see for specific methods and the "other" GC overhead nicely lined up. This is currently not possible. Perhaps a IsChild="true" attribute which then alters the tagging logic would do the trick?

That's all for now. I am sure I have missed some new features like the Analysis Assistant which currently looks like some scratch pad for my notes. Perhaps there will show up in a later build wait chain traversal? Lets keep the fingers crossed that the existing bugs of the Beta get fixed and we get an even better tool which is already awesome.


Wednesday, March 25, 2015 #

Todays little program has a simple task. It should decode a hex bitmask into its original bits as hex values from which it was composed from. As bonus we can also parse an ETW manifest and print the corresponding keywords. Since most low level tools use hex numbers which have specific bits turned on I often want to know which things were actually enabled. This is useful to e.g. see which providers PerfView enables for .NET specific stuff.

If you check in PerfView the .NET Alloc checkbox I want to see what is enabled.

xperf -Loggers | findstr /i Microsoft-Windows-DotNETRuntime

".NET Common Language Runtime":0xc14fccbd

BitmaskDecoder.exe 0xc14fccbd C:\Windows\Microsoft.NET\Framework64\v4.0.30319\CLR-ETW.man

0x1             GCKeyword
0x4             FusionKeyword
0x8             LoaderKeyword
0x10            JitKeyword
0x20            NGenKeyword
0x80            EndEnumerationKeyword
0x400           SecurityKeyword
0x800           AppDomainResourceManagementKeyword
0x4000          ContentionKeyword
0x8000          ExceptionKeyword
0x10000         ThreadingKeyword
0x20000         JittedMethodILToNativeMapKeyword
0x40000         OverrideAndSuppressNGenEventsKeyword
0x80000         TypeKeyword
0x400000        GCHeapSurvivalAndMovementKeyword
0x1000000       GCHeapAndTypeNamesKeyword
0x40000000      StackKeyword
0x80000000      ThreadTransferKeyword

"Microsoft-Windows-DotNETRuntimePrivate":0x4002000b

BitmaskDecoder.exe 0x4002000b d:\privclr\ClrEtwAll.man


0x1             undocumented
0x2             undocumented
0x8             undocumented
0x20000         MulticoreJitPrivateKeyword
0x40000000      StackKeyword

 

If you on the other hand enable ETW .NET Alloc you get

BitmaskDecoder.exe 0xc16fccbd C:\Windows\Microsoft.NET\Framework64\v4.0.30319\CLR-ETW.man

0x1             GCKeyword
0x4             FusionKeyword
0x8             LoaderKeyword
0x10            JitKeyword
0x20            NGenKeyword
0x80            EndEnumerationKeyword
0x400           SecurityKeyword
0x800           AppDomainResourceManagementKeyword
0x4000          ContentionKeyword
0x8000          ExceptionKeyword
0x10000         ThreadingKeyword
0x20000         JittedMethodILToNativeMapKeyword
0x40000         OverrideAndSuppressNGenEventsKeyword
0x80000         TypeKeyword
0x200000        GCSampledObjectAllocationHighKeyword
0x400000        GCHeapSurvivalAndMovementKeyword
0x1000000       GCHeapAndTypeNamesKeyword
0x40000000      StackKeyword
0x80000000      ThreadTransferKeyword

The only difference between .NET Alloc and ETW .NET Alloc is that PerfView addtionally enables the high sampling rate (max 100 alloc events/s) with GCSampledObjectAllocationHighKeyword. If you want to record longer with less events you can use this mask and replace GCSampledObjectAllocationHighKeyword with GCSampledObjectAllocationLowKeyword to get 20 times less allocation events (max 5 alloc events/s).

This makes it easy to find out what other people think about good ETW settings if you have the manifest for the provider. If you have found the list of hex values you want to combine you can call

BitmaskDecoder.exe -encode 0x1 0x2 0x4 0x80000

Encoded value: 0x80007

to get the corresponding hex mask. You can use it of course also for other bit masks if you omit the ETW manifest file name.

Here is the code for this simple tool:

Update: Now it also supports keywords. Thanks for the suggestion Andre.

You can now also write

BitmaskDecoder.exe -encode C:\Windows\Microsoft.NET\Framework64\v4.0.30319\CLR-ETW.man  JittedMethodILToNativeMapKeyword GCHeapAndTypeNamesKeyword

The executable can also be found here.

using System;
using System.Globalization;
using System.IO;
using System.Linq;
 
namespace BitMaskDecoder
{
    class Program
    {
        static void Main(string[] args)
        {
            try
            {
                if (args.Length == 0)
                {
                    Help();
                    return;
                }
 
                string[] lines = null;
 
                if (args[0] == "-encode")
                {
                    if( args.Length < 2 )
                    {
                        Help();
                        return;
                    }
 
                    int skip = 1;
                    if( File.Exists(args[1]) )
                    {
                        lines = File.ReadAllLines(args[1]);
                        skip++;
                    }
                    ulong res = args.Skip(skip).Select(x => x.TrimStart(LeadingHex))
                                               .Select(str => DecodeKeyword(str,lines))
                                               .Aggregate((x, y) => x | y);
                    Console.WriteLine("Encoded value: 0x{0:X}", res);
                    return;
                }
 
                if (args.Length > 1)
                {
                    lines = File.ReadAllLines(args[1]);
                }
 
                string hex = args[0].TrimStart(LeadingHex);
 
                ulong hexNum = ulong.Parse(hex, System.Globalization.NumberStyles.AllowHexSpecifier);
                for (int i = 0; i < 64; i++)
                {
                    ulong mask = 1ul << i;
                    ulong hexMask = mask & hexNum;
                    if (hexMask != 0)
                    {
                        string value = String.Format("0x{0:X}", hexMask);
                        string searchValue = value + "\"";
                        bool bFound = false;
                        if (lines != null)
                        {
                            string curProvider = null;
                            for (int line = 0; line < lines.Length; line++)
                            {
 
                                string lineStr = lines[line];
                                if( lineStr.Contains("<provider") )
                                {
                                    curProvider = ExtractName(lineStr);
                                }
 
                                // we are after lines like <keyword name="GCKeyword" mask="0x1" message="$(string.RuntimePublisher.GCKeywordMessage)" symbol="CLR_GC_KEYWORD"/>
                                if (lineStr.Contains(searchValue) && lineStr.Contains("keyword"))
                                {
                                    Console.WriteLine("{0,-8}\t{1,-50}\t{2}", value, ExtractName(lineStr), curProvider);
                                    bFound = true;
                                }
                            }
                        }
 
                        if (!bFound)
                        {
                            Console.WriteLine("{0,-8}\t{1}", value, "undocumented");
                        }
                    }
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine("Error: {0} {1}", ex.Message, ex.StackTrace);
            }
        }
 
        static void Help()
        {
            Console.WriteLine("BitmaskDecoder 0xddddd [ETWManifest] or -encode [manifestfile] 0xddd 0xddddd keyword1 keyword2 ....");
            Console.WriteLine("\tDecode a hex value into its bits and print the bits as hex values.");
            Console.WriteLine("\tWhen ETWManifest file is present the file is parsed and the matching lines of the manifest with the keywords are displayed");
        }
 
        static ulong DecodeKeyword(string hexOrKeyword, string[] manifest)
        {
            ulong lret = 0;
            if (!ulong.TryParse(hexOrKeyword, NumberStyles.AllowHexSpecifier, CultureInfo.InvariantCulture, out lret))
            {
                lret = ulong.Parse(ExtractMaskValue(hexOrKeyword, manifest), NumberStyles.AllowHexSpecifier);
            }
            return lret;
        }
 
        static string ExtractName(string line)
        {
            return ExtractAttribute(line, "name");
        }
 
        static string ExtractMaskValue(string keyword, string[] lines)
        {
            string hex = lines.Where(line => line.Contains("<keyword") && line.Contains(keyword))
                              .Select(line => ExtractAttribute(line, "mask"))
                              .FirstOrDefault();
            if( hex == null )
            {
                throw new NotSupportedException(String.Format("The keyword {0} was not found in the manfifest", keyword));
            }
            return hex.TrimStart(LeadingHex);
        }
 
        static string ExtractAttribute(string line, string attribute)
        {
            return new string(line.Substring(line.IndexOf(attribute))
                                            .SkipWhile(c => c != '"') // search first "
                                            .Skip(1)                  // skip "
                                            .TakeWhile(c => c != '"') // take all until next "
                                            .ToArray());
        }
 
        static char[] LeadingHex = new char[] { '0', 'x' };
 
    }
} 

 

 

 



			

Sunday, March 22, 2015 #

Since ETW know how is still considered an arcane science by most people I have the chance to look at a lot of "easy" things which did affect performance a lot but they went bye unnoticed until I did look into with ETW tracing. While checking some GC related stuff to reduce the memory allocations I have noticed that one process had a object[] allocations on the Large Object Heap. This is nothing unusual at all but the allocation stack was interesting.

image

The allocations were coming from Delegate.Remove. What does this tell me when I see large object allocations in this method? Inside MulticastDelegate you can find the allocation with e.g. ILSpy at

private object[] DeleteFromInvocationList(object[] invocationList, int invocationCount, int deleteIndex, int deleteCount)
{
    object[] array = this._invocationList as object[];
    int num = array.Length;
    while (num / 2 >= invocationCount - deleteCount) // Try to use a much smaller array /2 or /4 or even less if the remaining delegates can fit into it.
    {
        num /= 2;
    }
    object[] array2 = new object[num];    // Here is the delegate list allocation
    for (int i = 0; i < deleteIndex; i++)  // Copy all delegates until we get to the deletion index
    {
        array2[i] = invocationList[i];
    }
    for (int j = deleteIndex + deleteCount; j < invocationCount; j++)  // continue filling up the array with the remaining delegates after the deletion index and deletion count 
    {
        array2[j - deleteCount] = invocationList[j];
    }
    return array2;
}

This code basically tries to allocate a smaller array (half of the original size or less) if possible. Otherwise an array with the same size is allocated. The for loops copy the remaining delegates into the new invocation array.

Ok now we know that we have allocated a new array for the delegate invocation list. What is the error here?

Nope not yet.

Still not.

I strongly suspect that we have an event handler leak here. Why? Delegate.Remove is called mainly by event handler remove methods. Since the allocated object array goes to the LOH it must be bigger than 85000 bytes. How many object references do fit into a 85k array under x64?

85000/8 = 10625 Delegates

We know therefore that the event which comes later in the call stack had at > 10625 subscribers. So many subscribers are usually the sign of an event handler leak and sure it was. It was a static event which was the source of a quite huge memory leak which was very visible in some memory dumps which were taken later. Thanks to ETW allocation profiling I was able to see the root cause of a memory leak without using a debugger.

If you are not using PerfView for some reason but prefer xperf or tracelog to configure ETW tracing you can enable .NET Allocation Sample profiling with

xperf -on proc_thread+loader+profile -stackwalk profile -f c:\temp\kernel.etl -start clr -on Microsoft-Windows-DotNETRuntime:0x40200095 -f c:\temp\clr.etl

xperf -stop -stop clr -d c:\temp\traces.etl

The magic value 0x40200095 comes from

C:\Windows\Microsoft.NET\Framework64\v4.0.30319\CLR-ETW.man

Keyword Name Description
0x0000000000000001 GCKeyword GC events
0x0000000000000004 FusionKeyword        Binder (Log assembly loading attempts from various locations)
0x0000000000000008 LoaderKeyword       Loader (Assembly Load events). Also necessary to get the module name in JITed call stacks.
0x0000000000008000 ExceptionKeyword Exception
0x0000000000200000 GCSampledObjectAllocationHighKeyword When enabled limit rate of allocation events to 100 events/s.
0x0000000002000000 GCSampledObjectAllocationLowKeyword When enabled limit rate of allocation events to 5 events/s.
0x0000000040000000 StackKeyword When enabled a CLR Stackwalk is performed for every GCSampledObjectAllocation event. This is how PerfView gets its call stacks even for non Ngenned x64 code on Win7 machines where the ETW stackwalks would skip Jited code.
0x0000000000000010 JitKeyword Needed when you want to get call stacks of an application. If the application was already started you need to force the enumeration of all running processes with an extra CLR Rundown provider which gets all required data.
0x0000000000000080 EndEnumerationKeyword Needed to enumerate JITed methods upon process exit which were not recorded by the JitKeyword

The xperf line above is all you need when the process starts when your ETW session is already running and the process terminates while the session was running. If not you need to force the CLR to enumerate all methods and modules which were JITed. If e.g. your process was already running and still runs when you stop profiling you need the second run to enumerate all already JITed methods. This is called CLRRundown with another provider.

xperf -start clrrundown -on Microsoft-Windows-DotNETRuntimeRundown:0x118 -f c:\temp\rundown.etl

To merge it with the previous recording you can use

xperf -merge c:\temp\traces.etl c:\temp\rundown.etl c:\temp\tracesMerged.etl

The value 0x118 is the combination of LoaderRundownKeyword (0x8), JitRundownKeyword (0x10) and EndRundownKeyword (0x100). The other values which are responsible for NGen stuff are not needed anymore because we can generate from the native images pdbs with xperf and PerfView while stopping an ETW recording which is the fastest approach.

Since corclr is open sourced we can also get our hands on all the internals (finally!). It turns out that we are not limited to 5 (GCSampledObjectAllocationLowKeyword) or 100 (GCSampledObjectAllocationHighKeyword) allocation events/s. If we want to get them all we can set the environment variable

set COM_PLUS_UNSUPPORTED_ETW_ObjectAllocationEventsPerTypePerSec=10000

to the number of allocation events/s we want. If you set it to 10000 you get nearly all allocations at the cost of a slower system but if you want exact number this can help in some situations. If you want to see the other hidden COMPLUS flags have a look at https://github.com/dotnet/coreclr/blob/master/src/inc/clrconfigvalues.h. There are some pretty interesting tweaks inside it. E.g. you can configure the spinning behavior of CLR locks aka Monitor and other stuff which could come in handy to tweak or debug an application.

PerfView knows other .NET private and stress providers but there are no manifest´s for them. This is no longer true. Since CoreClr is basically a clone of the desktop CLR we can take a look at ClrEtwAll.man which contains the definition for all CLR providers. The only missing thing is the resource dll to register it. I have compiled CoreClr and put the resulting clretwrc.dll along with the manifest for the Microsoft-Windows-DotNETRuntimeStress (CC2BCBBA-16B6-4cf3-8990-D74C2E8AF500) and Microsoft-Windows-DotNETRuntimePrivate (763FD754-7086-4dfe-95EB-C01A46FAF4CA) here: https://drive.google.com/folderview?id=0BxlmobpeaahAbEdvWHhQb1NCcFU&usp=sharing&tid=0BxlmobpeaahAa2dMVEMyVG9nRnc.

You need to put the resource dll to D:\privclr or edit the path in ClrEtwAll.man to your location of the dll to make the registration with

wevtutil im ClrEtwAll.man

successful. If you really want to look into these logs you can do it with WPA now as well. Some events could be interesting for troubleshooting:

image

The xperf calls above were only examples where the -BufferSize, -MinBuffers, -MaxBuffers and -MaxFile were not supplied to make the example not overloaded by these things. In reality you want to trace your recording either until a specific size of the recording was reached (-MaxFile) (e.g. 4 GB is a quite huge size). Additionally you can configure -FileMode Circular to record in a ring buffer which enables you to record for a very long time until something interesting happens where you want to stop recording usually. This can be achieved e.g. with Performance Counter triggers which can execute a command when a performance counter was above or below a threshold.

A more typical xperf command line to record into a ring buffer with 2 GB of buffer size which records

  • Context Switches with call stacks
  • Enables Profiling with 1ms sample rate with call stacks
  • Process lifetimes
  • Process memory information

is

xperf -on proc_thread+loader+meminfo+disk_io+filename+cswitch+dispatcher+profile -stackwalk profile+cswitch+readythread -f C:\kernelCircular.etl -MinBuffers 100 -BufferSize 512 -MaxBuffers 2000 -MaxFile 2000 -FileMode Circular

WPR always uses a 10% physical memory ring buffer (1,6GB on my 16 GB box) which is allocated right from the start. The xperf trace buffers on the other hand can grow until 2GB in 512KB chunks. This is better because you usually need two trace sessions (one user and one kernel session) and you do not want to allocate upfront for each session the maximum size which is potentially never reached.

That's it for today. Happy troubleshooting and try out your new tools. If you had good or bad experiences with these tools I would love to hear about it.


Monday, January 26, 2015 #

Since most of us use multiple threads we get all sorts of race conditions which we can solve with proper locking and concurrent data structures. A simple issue is if you have concurrency on an object which exposes a Dictionary<string,string> from where you get exceptions back if concurrent writers add the same data to it. You would then either need to coordinate the writers or use a different dictionary which allows concurrent adds. As it turns out ConcurrentDictionary is just the collection you were searching for and your exceptions are gone. You let the tests run. All green, check in and submit it to the main branch.

A long time later at the end of the project until now ignored performance issues tend to get identified and fixed. Someone calls you that your checkin from a long time ago is the root cause of a major performance issue and that you are showing up as mount Everest in the VirtualAlloc reports.

 

image

How can that be? The problem with the Concurrent collections which claim to be largely lock free is that to achieve uncontended writes you have to allocate your key data structures per core. This means that you are allocating memory proportional to the number of your cores. This is not a problem if you have relatively few instances around but if you are exchanging a dictionary in key data structures which are heavily allocated you will get much slower. You can test it for yourself with a small sample app like this:

    class DataObject // Original Version
    {
        public IDictionary<string, string> Data { get; private set; }

        public DataObject()
        {
            Data = new Dictionary<string, string>();
        }
    }

    class FixedDataObject // Fixed version with ConcurrentDictionary
    {
        public IDictionary<string, string> Data { get; private set; }

        public FixedDataObject()
        {
            Data = new ConcurrentDictionary<string, string>();
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            List<DataObject> dataList = new List<DataObject>();
            List<FixedDataObject> dataListFixed = new List<FixedDataObject>();

            var sw = Stopwatch.StartNew();
            const int Runs = 300*1000;
            for(int i=0;i<Runs;i++)
            {
                if( args.Length == 0 ) // No args use old
                {
                    dataList.Add(new DataObject());
                }
                else // some args passed use fixed version
                {
                    dataListFixed.Add(new FixedDataObject());
                }
            }
            sw.Stop();
            Console.WriteLine("{0:N0} did take {1:F2}s", Runs, sw.Elapsed.TotalSeconds);

            GC.KeepAlive(dataList);
            GC.KeepAlive(dataListFixed);
        }
    }

Old Version

300.000 did take 0,14s

Fixed Version with Concurrent Dictionary

300.000 did take 1,08s

Update

On a 24 core server it takes 3,3s and it consumes 1,4GB of memory. Data structures which allocate proportional to the number of cores for the sake of thread performance do not sound like a good idea to me.

This is over 7 times slower and introduces tons of GCs in your application limiting what you can do in parallel even further. The worst part of this is that the bigger the server you let it run on the slower (more cores means more memory to allocate!) it will get. On a slow dev machine with 2 or 4 cores you might not notice it much. But if you deploy this on real server hardware with 40 cores you have a really bad performance problem at hand. That is the reason why you need to do performance testing on virtually every hardware configuration you support.

The fix was to go back to a plain dictionary with a lock. ConcurrentDictionary supports a loadfactor which controls how many arrays for isolated writes is should allocated but it was still a factor two slower.

So beware by simply replacing non concurrent data structures with their lock free but much more memory hungry counterparts you might be making your system much slower. In effect we were loading data slower the bigger the server did get. This is definitively not what the customer has paid for.


Sunday, December 21, 2014 #

We all make dumb mistakes which are not interesting to share. But sometimes people make interesting mistakes. Below is a simple helper class that makes combining several delegates in a logical and manner very easy.

 

    static class Constraints<T>
    {
        /// <summary>
        /// Logical and of a list of predicates.
        /// </summary>
        public static Predicate<T> And(params Predicate<T>[] ands)
        {
            // return a new delegate which checks if all delegates in the array did return 
            // true for a given item
            return item => ands.All(x => x(item));
        }
    }

You can use it then like this:

   class Program
    {
        static void Main(string[] args)
        {
            int[] data = new int[20 * 1000 * 1000]; // 20 million ints

            Predicate<int> CheckForZero = Constraints<int>.And(i => i > -1, 
                                                               i => i < 1);

            var sw = Stopwatch.StartNew();
            for (int i = 0; i < data.Length; i++)
            {
                if (CheckForZero(data[i]) == false)
                {
                    Console.WriteLine("Violation found at {0}", i);
                }
            }
            sw.Stop();
            Console.WriteLine("Did take: {0:F2}s", sw.Elapsed.TotalSeconds);
        }
   }
Now you have a nice way to combine separate checks in a reusable way. The only issue is that for many calls to CheckForZero the running time is quite high.
Did take: 0,90s

If we compare this to the "normal" way

            for (int i = 0; i < data.Length; i++)
            {
                if ( !( data[i] > -1 && data[i] < 1) )
                {
                    Console.WriteLine("Violation found at {0}", i);
                }
            }

Did take: 0,01s

Whoa. The delegate version is 90 times slower! I found this issue by looking at the mount Everest's in terms of CPU consumption. What is causing so much overhead?

Well you just did wake up a sleeping dragon

image

with your LINQ delegate version. It is the Garbage Collector! Here is the allocation report of our slow sample:

image

We have a 80MB int array with 20 million elements. But we allocate 2,5GB of closure classes! Our Constraints.And method needs one closure to capture the list of delegates.

This is only needed once since we cache the returned delegate and will not show up in the profiler at all.

image

 

Things get interesting when we invoke the And method of the captured class. In this case we are allocating one additional closure to give access to the current item to check and the enclosing capturing class which gives us access to the ands array needs also be allocated. On top of that we need to create a new Func<Predicate<T>, bool> delegate instance as argument to the All LINQ statement.

image

To sum it up we allocate on each iteration

  • 1 c_DisplayClass4   32 bytes
  • 1 Func<Predicate<T,bool>> delegate instance 64 bytes
  • 1 SZArrayHelper+SZGenericArrayEnumerator 32 bytes
    • Where does this come from? It is caused by the All LINQ statements which iterates over an IEnumerable. If you foreach over an array it needs to get an enumerator from the object to enumerate. Since an array has none out of the box the CLR allocates a helper enumerator which is this strange SZArrayHelper class. If you directly foreach over an array this is NOT used.  Instead more efficient code is generated.

 

In total we allocate 128 bytes on each integer we want to check (4 bytes) which is quite some overhead. 128*20.000.000=2.560.000.000 which is exactly what the profiler reports. Since I did use from PerfView the ETW .NET Alloc method

image 

I get a correct total sum of the allocated bytes back. But the types which are responsible for the allocation and especially the counts can be misleading. If you use .NET Alloc you get all allocations but the use case will not run 1s but >2 minutes which is not something I really like to do. If you want to get exact allocation number you need to check the .NET Alloc checkbox which injects a profiling dll into all new started processes so you can get exact allocation reports. Unfortunately this does not seem to work with the old and current PerfView (1.7) version where I see anything but my new allocations.

image

When in doubt I would always use ETW .NET Alloc to get reliable although sampled numbers. This works also for JITed code in x64 on Windows 7 machines because for each allocation stack a managed stackwalk is also performed which gives you the full allocation stack. ETW does not work for JITed x64 code on Win7 except if you precompile it with NGen as workaround. On Win 8 and beyond this is no longer an issue except if you try to use API Hooking. There the x64 stack walker still stops walking the stack when the method interception is done in a form it cannot recognize.

This sounds like some complex stuff. Can we fix it? How about providing an overload to our class?

        public static Predicate<T> And(Predicate<T> p1, Predicate<T> p2)
        {
            return item => p1(item) && p2(item);
        }
 

This little overload spares us pretty much all of the overhead. We do not allocate a single byte while iterating over our array. This time we get

Did take: 0,14s

Which is nearly 9 times faster. This is quite good for such a simple overload which is preferred by the compiler when it finds this one as best match. Now we only pay for 3 delegate invocations which are not avoidable if we want to keep the flexibility here.

Now go and check your LINQ statements for hidden dragons. 2,5GB allocations with one line are quite easy to achieve with delegates and LINQ. This is the main reason why LINQ is banned from the Roslyn project because of these subtle delegate issues. It is kind of ironic that the inventors of LINQ do not want to use their own invention in their C# compiler backend for performance reasons.


Sunday, December 14, 2014 #

When you want fast and efficient File IO in Windows you can specify a myriad of flags to CreateFile. But which of these many flags help in which scenarios? MSDN says that some read ahead buffering is enabled if Sequential Scan is done. First a little look into the docs can help:

 

FILE_FLAG_SEQUENTIAL_SCAN  0x08000000

Access is intended to be sequential from beginning to end. The system can use this as a hint to optimize file caching.

This flag should not be used if read-behind (that is, reverse scans) will be used.

This flag has no effect if the file system does not support cached I/O and FILE_FLAG_NO_BUFFERING.

For more information, see the Caching Behavior section of this topic.

Under Caching Behavior is more information:

Specifying the FILE_FLAG_SEQUENTIAL_SCAN flag can increase performance for applications that read large files using sequential access. Performance gains can be even more noticeable for applications that read large files mostly sequentially, but occasionally skip forward over small ranges of bytes. If an application moves the file pointer for random access, optimum caching performance most likely will not occur. However, correct operation is still guaranteed.

FILE_FLAG_RANDOM_ACCESS   0x10000000

Access is intended to be random. The system can use this as a hint to optimize file caching.

This flag has no effect if the file system does not support cached I/O and FILE_FLAG_NO_BUFFERING.

For more information, see the Caching Behavior section of this topic.

 

The Old New Thing clears up the confusion by adding that if you specify neither flag the OS is free to use read ahead or not depending on the observed usage pattern. An interesting note is that if you specify RANDOM_ACCESS then the read pages are evicted from the file system cache with the usual LRU policy. This ensures that random reads do not pollute the file system cache. But if you read the same stuff several times shortly after the first read you still get the cached pages back.  Lets try it out by reading from 1000 files the first 10KB  with two 5000 byte reads for each file:

        static void ReadOnly(string dir)
        {
            string[] files = Directory.GetFiles(dir, "*.bin");
            var sw = Stopwatch.StartNew();
            int bytesRead = 0;
            foreach(var file in files)
            {
                Parser.FlushFileBufferOfFileByOpeningItUnbuffered(file);
                FileStream stream = new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.None, 1, FileOptions.RandomAccess);
                byte[] buffer = new byte[5000];

                bytesRead += stream.Read(buffer, 0 , buffer.Length);
                bytesRead += stream.Read(buffer, 0, buffer.Length);
            }
            sw.Stop();
            Console.WriteLine("Did read {0:N0} bytes from {1} Files in {2:F2}s", bytesRead, files.Length, sw.Elapsed.TotalSeconds);
        }

To see the actual disc access every time we execute this code we need to delete the file system cache for this file. It turns out that you can do this quite easily by opening the file file once for unbuffered IO like this

        const FileOptions FileFlagNoBuffering = (FileOptions)0x20000000;

        public static void FlushFileBufferOfFileByOpeningItUnbuffered(string fileName)
        {
            var tmp = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, 1, FileOptions.SequentialScan | FileFlagNoBuffering);
            tmp.Dispose();
        }

FileAccess.RandomAccess

Now the file data is always read from the physical disc and the File IO layer. Since the file system sector size is usually 4KiB you will need for the first 5 KB read 8192 bytes and for the rest another 4096 bytes to successfully read 10 000 bytes. After enough data from the device has been read you can see the two actual 5000 byte reads from offset 0 and 5000 as you would expect them. From the hard disc actually 12288 bytes are read. It makes therefore from an IO perspective no difference if you read 10000 bytes or 12KiB from a disc with a 4KiB cluster size.

image

If you want to find the true file size requested by actual reads you need to filter for the File IO Init stacks (checked with Win 8.1)

fltmgr.sys!FltpPerformPreCallbacks

fltmgr.sys!FltpPassThrough

whereas the other stacks deal with read ahead caching and the translated calls to the actual device where the read size is adjusted to the actual sector read size the device actually supports.

image

You might ask what the IRP column is for. IRP is an I/O request packet which is the similar to an opaque file handle. Each device driver can create its own IRP where it issues IO requests to other drivers. That way you can follow how different drivers split their work up. So far I have not found a useful grouping for it.

It looks like the FileAccess.RandomAccess behaves largely like unbuffered IO where you exactly read as much data as you need in the granularity of the cluster size of the actual device.

FileAccess.SequentialScan

Now we can try it out a second time by using FileAccess.SequentialScan. The first read is 8KiB and the next one is 184KiB which comes from the read ahead prefetching mechanism. The assumption that we read much more than the first 10KB is quite optimistic which leads in this case to over 18 times more hard disc reads than we would need to for this use case.

image

FileAccess.SequentialScan is certainly a bad choice for small buffered reads. If we look at the timings we see that as well.

Scenario: 1K Files with 2x5000 bytes reads Time in s
FileAccess.RandomAccess 3,9
FileAccess.SequentialScan 6,5

FileAccess.None

Last but not least what will happen when we do not specify any option (FileOptions.None). At least in this case it behaves exactly like Sequential Scan where also 196MB of data is fetched from the disc although we only wanted to read only 10MB from the file headers.

Conclusions

That sounds like if you never should use SequentialScan. But in most scenarios where you read a file, parse a little bit and then read the next chunk of the file this is exactly what you want. In theory you can use async IO to read the file asynchronously while parsing the stuff on another thread. But if you are lucky you can do it without extra threading on your side while the OS is kind enough to read ahead of your application logic while you are still reading and parsing the file single threaded. You get the the benefits of multithreading without using extra application threads for it!

The read ahead logic is not optimal in all cases but helps where otherwise much more complex code would be necessary to read from a hard disc at full speed (60-300MB/s). What is the morale of this story? Never use FileOptions.SequentialScan or FileOptions.None if you want to read only the file headers! Otherwise you are consuming much more IO bandwidth than necessary to achieve your goal and you are left scratching your head why switching from unbuffered IO to buffered IO has made reading your files so much slower.


Monday, December 1, 2014 #

Recently I got an interesting drawing issue which looks like a Windows bug. Let´s suppose you want to draw a string in Chinese in a Windows Forms or an WPF application. There you have two options.

  1. TextRenderer.DrawText
  2. Graphics.DrawString

The problem is both methods lead to crappy rendered text when the used Font is bold. The "old" method by using Graphics.DrawString renders some Chinese glyphs wrong. If you try a new shiny WPF application you will find out that WPF uses the equivalent to Graphics.DrawString since the text looks the same. Graphics.DrawString and TextRenderer draw the following string with an Arial font like this:

image

Did you notice that the last 3 characters are slightly shifted above and rendered in some bold font? This is normally a sign that the used font (Arial) in this case did not contain all Unicode characters of the string to draw. Before drawing empty rectangles Windows uses the next suitable font which looks like Microsoft YaHei in this case.

To fix this you need to tweak the registry and add a key for Arial to prefer instead a different font to render the complete string with the same font. With the following setting you switch Arial to SimSun as first preferred Font (the Arial key did not even exist on my 8.1 machine).

image

After doing this modification you need to log out and log in again to enable this change. The only issue is that Graphics.DrawString draws the string just like before. No change here. Either it ignores Font Linking in favor of some awkward switch like if (char > dddd) "Microsoft YaHei"  else "Arial" or it has a plain bug in it. The usual suspects on StackOverflow recommend to switch to TextRenderer which works much better. But it breaks if you try to render bold fonts which look quite bad if you do not set explicitly a background color. Here are some pictures how the text changes if the font size grows:

FontRendering

In my application I cannot set a background color since I am drawing onto a Bitmap which is overlayed with other graphics which must remain visible behind the rendered string. The other option is to use as background e.g. Magenta and use this a transparent color. Unfortunately this doesn't work due to the anti aliasing which leaves some purple shadows on the drawn text which looks really ugly. So far it looks like Windows leaves you with two bad options if you want to make a professional looking graphical localizable application:

  • Either the Chinese characters are drawn with two different fonts. (TextRenderer).
  • Or you end up with barely readable text  (Graphics.DrawString).

If you have solved the issue without setting an explicit background color I would love to hear about your solution. The code to draw the text in both Picture boxes is like this:

        const string combined = "测试标记好不好This is a aber long";
        private void cFontSize_ValueChanged(object sender, EventArgs e)
        {
            Bitmap bmp = new Bitmap(750, 140);
            Font f = new Font("Arial", (float)cFontSize.Value, IsBold ? FontStyle.Bold : FontStyle.Regular);

            using (var g = Graphics.FromImage(bmp))
            {
                g.CompositingMode = System.Drawing.Drawing2D.CompositingMode.SourceOver;

                // Rendering with a Background Color solves the issue but 
                // this would produce boxes of white or black color in the displayed image which looks even worse
                TextRenderer.DrawText(g, combined, f, new Point(0, 0), FontColor);
            }
            cPictureBox.Image = bmp;

            Bitmap bmp2 = new Bitmap(750, 140);

            using (var g = Graphics.FromImage(bmp2))
            {
                g.CompositingMode = System.Drawing.Drawing2D.CompositingMode.SourceOver;
                g.DrawString(combined, f, FontBrush, 0, 0);
            }

            cPicture2.Image = bmp2;
        }

Pretty straightforward code I would think. WPF is also not the solution since it draws the text in the same way like Windows Forms. I wonder what market share Chinese people currently have that this issue is even in Windows 8.1 not fixed.

Update 1

After adding Arial Bold as Font Link with the settings

It looks like Graphics.DrawString is the winner with the best rendered string again.

image

WPF still does not love my Font Linking settings. Whatever. Graphics.DrawString looks good now in all languages which is just what I need for now.

image

Update 2

It turns out that a German Windows 8.1 behaves differently than a English Windows 8.1 although I have installed the language pack for Chinese  for it.

  1. You do not need to log out and log in again to see the changes in effect and the Arail Bold setting was not required at all. It is sufficient to have one Font Linking setting in place for a German Windows 8.1
  2. On an English Windows 8.1 the bold fonts are still rendered wrong. Windows 7 dito. This is not a solution.

The only real solution I have come up with was to disable bold fonts for Chinese UI languages and live with non bold fonts. The Old New Thing has an "old" post on it called The danger of using boldface for Chinese characters which also points towards the direction that there is no way to render small bold fonts in Chinese which remain readable.


Sunday, October 12, 2014 #

Although there was no part one my other post "Can You Execute Code While Waiting For A Lock?" can be considered as part one. I hope that this will not become a larger series where code is executed in the wrong order. Suppose you want to execute the commands

  1. Init
  2. Action
  3. OtherAction
  4. Dispose
  5. Shutdown

in this order on the UI thread of an application from another thread. Then you would do something like this:

using System;
using System.Diagnostics;
using System.Threading;
using System.Windows.Forms;

namespace MessagePumping
{
    public partial class Form1 : Form
    {
        public SynchronizationContext WindowsFormsSynchronizationContext;

        public Form1()
        {
            InitializeComponent();
            var tmp = this.Handle;

            WindowsFormsSynchronizationContext = System.Windows.Forms.WindowsFormsSynchronizationContext.Current;
            BlockFor();
        }

        void BlockFor()
        {
            Barrier b = new Barrier(2);
            Thread t = new Thread(() =>
            {
                lock (this)
                {
                    b.SignalAndWait(); // let main UI thread run into lock                    
                    Thread.Sleep(300); // enter the lock on the UI thread

                    Action otherAction = () => Debug.Print("OtherAction");
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Init"), null);
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Action"), null);
                    WindowsFormsSynchronizationContext.Post((o) => otherAction(), null);
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Dispose"), null);
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Shutdown"), null);
                }
            });
            t.Start();

            b.SignalAndWait();
            lock (this) // deadlock which never releases lock
            {
            }
        }

    }
This will print the messages in order just as you would expect it:

Init
Action
OtherAction
Dispose
Shutdown

Now lets slightly change the code of BlockFor

 

        void BlockFor()
        {
            Barrier b = new Barrier(2);
            Thread t = new Thread(() =>
            {
                lock (this)
                {
                    b.SignalAndWait(); // let main UI thread run into lock                    
                    Thread.Sleep(300); // enter the lock on the UI thread 
                    Action otherAction = () => Debug.Print("OtherAction");
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Init"), null);
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Action"), null);
                    WindowsFormsSynchronizationContext.Post((o) =>
                    {
                        WindowsFormsSynchronizationContext.Send((o2) => otherAction(), null);
                    },null);
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Dispose"), null);
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Shutdown"), null);
                }
            });
            t.Start();

            b.SignalAndWait();
            lock (this) // deadlock which never releases lock
            {
            }
        }

and check if during a post a send operation is working as expected:

Init
Action
OtherAction
Dispose
Shutdown

The output is till the same. So far so good. Now keep the code but change something else and now we get:

Init
Action
Dispose
Shutdown
OtherAction

What has changed?

Still not it.

 

The SynchronizationContext! You were thinking you were using a WindowsFormsSynchronizationContext? No you weren´t. The WindowsFormsSynchronizationContext.Current is returning not a WindowsFormsSynchronizationContext but a SynchronizationContext instance. The reason is that the static Current property comes from SynchronizationContext. You were using the current SynchronizationContext whatever it was. In the first run it was actually a WPF DispatcherSynchronizationContext  instance which did work just as expected. But if we actually force the usage of an WindowsFormsSynchronizationContext by setting it explictely

        public Form1()
        {
            InitializeComponent();
            var tmp = this.Handle;

            SynchronizationContext.SetSynchronizationContext(new WindowsFormsSynchronizationContext());
            WindowsFormsSynchronizationContext = System.Windows.Forms.WindowsFormsSynchronizationContext.Current;
            BlockForThreeSeconds2();
        }

Then we get the faulty behavior that shutdown is executed before our OtherAction which was not our intention. The first question is why are we getting this? When using window messages you are adding messages to a queue. When you post a message you add it at the end of the queue. The same happens if you Send a messages with the only exception that the call blocks until all pending messages and your message have been processed. Our OhterAction message is special since it calls Send to the message pump while it is being processed. The logical picture you have in mind is

image

 

but what actually happens is that while your wrapper call is executed it enqueues OtherAction at the end of the message queue and waits until OtherAction has been called. But since the Dispose and Shutdown messages come first our OtherAction message will be called far too late.

image

Ok that explains why WinForms suck. But why did it work with the WPF Dispatcher? The WPF dispatcher uses a little optimization: If you try to enqueue a message on the thread that runs the message pump then it will NOT be enqueued but directly executed. The WPF Dispatcher model is therefore much saner thane Winforms and prevents you from shooting yourself in the foot. If you do not use the WinForms synchronization context but you call directly on the Control.Begin/Invoke you can fix this faulty model by calling InvokeRequired

                    Action otherAction = () => Debug.Print("OtherAction");
                    this.BeginInvoke((Action) (() => Debug.Print("Init")));
                    this.BeginInvoke((Action) (() => Debug.Print("Action")));
                    this.BeginInvoke((Action) (() =>
                    {
                        if (this.InvokeRequired) // only enqueue to message queue if we are beeing called from a different thread
                        {
                            this.Invoke((Action)(() => otherAction()));
                        }
                        else
                        {
                            otherAction(); // otherwise execute the callback directly
                        }
                    }));
                    this.BeginInvoke((Action) (() => Debug.Print("Dispose")));
                    this.BeginInvoke((Action) (() => Debug.Print("Shutdown"))); 

The rule to check with InvokeRequiredworks before actually enqueuing messages works well and helps to alleviate the problem of unintended message reordering. Unfortunately this does not work with WindowsFormsSynchronizationContexts since it has no InvokeRequired method to do the check by yourself. Until then your best bet would be to abandon the usage of WindowsFormsSynchronizationContext because of this or you simply use the WPF DispatcherSynchronizationContext which works well. But you should be aware that you then have up to 3 message queues in parallel. First it is the window message queue of your window. Then comes the Winforms message queue which uses its own window message to start working on queued messages. And as last queue you have the WPF message queue which also has its own private message queue and window message.

If you interleave WPF and WinForms messages in some calling code like this

        public WindowsFormsSynchronizationContext WinFormsContext;
        public DispatcherSynchronizationContext WPFContext;

        public Form1()
        {
            InitializeComponent();
            var tmp = this.Handle;

            WinFormsContext = new System.Windows.Forms.WindowsFormsSynchronizationContext();
            WPFContext = new DispatcherSynchronizationContext();
            BlockFor();
        }

        void BlockFor()
        {
            Barrier b = new Barrier(2);
            Thread t = new Thread(() =>
            {
                lock (this)
                {
                    b.SignalAndWait(); // let main UI thread run into lock                    
                    Thread.Sleep(300); // enter the lock on the UI thread

                    Action otherAction = () => Debug.Print("OtherAction");
                    WPFContext.Post((o) => Debug.Print("Init"), null);
                    WinFormsContext.Post((o) => Debug.Print("WinForms Init"),null);
                    WPFContext.Post((o) => Debug.Print("Action"), null);
                    WinFormsContext.Post((o) => Debug.Print("WinForms Action"), null);
                    WPFContext.Post((o) =>
                       WPFContext.Send((o2) => otherAction(), null), null);
                    WPFContext.Post((o) => Debug.Print("Dispose"), null);
                    WinFormsContext.Post((o) => Debug.Print("WinForms Dispose"), null);
                    WPFContext.Post((o) => Debug.Print("Shutdown"), null);
                    WinFormsContext.Post((o) => Debug.Print("WinForms Shutdown"), null);
                }
            });
            t.Start();

then you do not get interleaved messages as the calling code suggests but this:

Init
WinForms Init
WinForms Action
WinForms Dispose
WinForms Shutdown

Action
OtherAction
Dispose
Shutdown

This is another subtle difference between WPF and Windows Forms message pumping. WPF will execute only one posted message whereas Winforms will execute all pending posted messages at once until there are no more left to execute. If you have a mixed WPF/Winforms application and you want to get rid of Invokes in favor of BeginInvokes because message pumping is evil you just are about to find the next trap door late during your system integration test. If you can you should stick to only one message queue and prefer Invoke/Send where you can. If you are using Winforms BeginInvoke you not only do not know when it will be run (perhaps too late) but if you need coordination with WPF window messages you are out of luck in this scenario. I am sure you will not design a system this way but at some day someone will add a hidden dependency between both message pumps leaving you little chances to design you out of this mess.

I hope that MS will address these (obvious?) deficiencies in Winforms message pumping so we do not have to worry about stuff we never wanted to know in such great detail in the first place.

Update 1

No I will not make a different post out of it. The last example has an interesting twist if you are using the WPF Dispatcher. It has changed between .NET Framework 4.0 and 4.5. Depending on the target platform you are compiling for (.NET 4.0 or 4.5+) you will get different behavior. The .NET 4.0 Dispatcher WILL pump messages just like WinForms. This means the code above for the WPF context post messages only returns for .NET 4.5:

Init
Action
OtherAction
Dispose
Shutdown

But if you compile for .NET 4.0

image

you get

Init
Action
Dispose
Shutdown
OtherAction

just like in the bad case of Windows Forms. The magic happens in the DispatcherSynchronizationContext class at:

public override void Send(SendOrPostCallback d, object state)
{
    if (BaseCompatibilityPreferences.GetInlineDispatcherSynchronizationContextSend() && this._dispatcher.CheckAccess())
    {
        this._dispatcher.Invoke(DispatcherPriority.Send, d, state);
        return;
    }
    this._dispatcher.Invoke(this._priority, d, state);
}

This returns true or false depending on what is set in the TargetFramwork attribute of your currently executing assembly.

[assembly: TargetFramework(".NETFramework,Version=v4.5.2", FrameworkDisplayName = ".NET Framework 4.5.2")]

Interesting. You can find the future .NET Framework version numbers (but not their release dates) in the BinaryCompatiblityMap which contains the promising name AddQuirksForFramework.

 

        if (buildAgainstVersion >= 50000)
        {
            this.TargetsAtLeast_Desktop_V5_0 = true;
        }
        if (buildAgainstVersion >= 40504)
        {
            this.TargetsAtLeast_Desktop_V4_5_4 = true;
        }
        if (buildAgainstVersion >= 40503)
        {
            this.TargetsAtLeast_Desktop_V4_5_3 = true;
        }
        if (buildAgainstVersion >= 40502)
        {
            this.TargetsAtLeast_Desktop_V4_5_2 = true;
        }
        if (buildAgainstVersion >= 40501)
        {
            this.TargetsAtLeast_Desktop_V4_5_1 = true;
        }
        if (buildAgainstVersion >= 40500)
        {
            this.TargetsAtLeast_Desktop_V4_5 = true;
            this.AddQuirksForFramework(TargetFrameworkId.Phone, 70100);
            this.AddQuirksForFramework(TargetFrameworkId.Silverlight, 50000);
            return;
        }
       …

So we can expect a .NET 4.5.3, 4.5.4 and .NET 5.0 in the future with some confidence now.


Tuesday, October 7, 2014 #

Threading can sometimes lead to impossible bugs where you have a hard time to find out how you got into this state. Lets suppose you have a nice WPF application which you dutifully test via the UIAutomation provider which can alter some state in your application.

    /// <summary>
    /// Interaction logic for TestWindow.xaml
    /// </summary>
    public partial class TestWindow : Window
    {
        int Data;

        public TestWindow()
        {
            InitializeComponent();
        }

        public void AlterData(int newValue)
        {
            Debug.Print("Executing AlterData {0}", Environment.StackTrace);
            Data = newValue;
        }

        protected override AutomationPeer OnCreateAutomationPeer()
        {
            var lret = new OwnPeer(this);
            return lret;
        }

        class OwnPeer : WindowAutomationPeer
        {
            public OwnPeer(TestWindow owner)
                : base(owner)
            { }

            protected override HostedWindowWrapper GetHostRawElementProviderCore()
            {
                Debug.Print("GetHostRawElementProviderCore called");
                ((TestWindow)base.Owner).AlterData(500);
                return base.GetHostRawElementProviderCore();
            }
        }
}

When you look at the call stacks all is normal:

Executing AlterData    at System.Environment.GetStackTrace(Exception e, Boolean needFileInfo)
   at System.Environment.get_StackTrace()
   at ThreadingIsHard.TestWindow.AlterData(Int32 newValue) in d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs:line 52
   at ThreadingIsHard.TestWindow.OwnPeer.GetHostRawElementProviderCore() in d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs:line 69
   at MS.Internal.Automation.ElementProxy.InContextGetHostRawElementProvider(Object unused)
   at MS.Internal.Automation.ElementUtil.<>c__DisplayClass1.<Invoke>b__0(Object unused)
   at System.Windows.Threading.ExceptionWrapper.InternalRealCall(Delegate callback, Object args, Int32 numArgs)
   at MS.Internal.Threading.ExceptionFilterHelper.TryCatchWhen(Object source, Delegate method, Object args, Int32 numArgs, Delegate catchHandler)
   at System.Windows.Threading.DispatcherOperation.InvokeImpl()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Windows.Threading.DispatcherOperation.Invoke()
   at System.Windows.Threading.Dispatcher.ProcessQueue()
   at System.Windows.Threading.Dispatcher.WndProcHook(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndWrapper.WndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndSubclass.DispatcherCallbackOperation(Object o)
   at System.Windows.Threading.ExceptionWrapper.InternalRealCall(Delegate callback, Object args, Int32 numArgs)
   at MS.Internal.Threading.ExceptionFilterHelper.TryCatchWhen(Object source, Delegate method, Object args, Int32 numArgs, Delegate catchHandler)
   at System.Windows.Threading.Dispatcher.LegacyInvokeImpl(DispatcherPriority priority, TimeSpan timeout, Delegate method, Object args, Int32 numArgs)
   at MS.Win32.HwndSubclass.SubclassWndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam)
   at MS.Win32.UnsafeNativeMethods.DispatchMessage(MSG& msg)
   at MS.Win32.UnsafeNativeMethods.DispatchMessage(MSG& msg)
   at System.Windows.Threading.Dispatcher.PushFrameImpl(DispatcherFrame frame)
   at System.Windows.Application.RunInternal(Window window)
   at System.Windows.Application.Run()
   at ThreadingIsHard.App.Main() in d:\Source\UIAutomationSample\UIAutomationTest\obj\Debug\App.g.cs:line 0

So far so expected. Now lets make our well behaving application at little more unexpected and take a lock for some time. To make it simple I deadlock the main UI thread by trying to enter a lock which is owned by another thread like this

        private void Button_Click(object sender, RoutedEventArgs e)
        {
            Deadlock();
        }

        public void Deadlock()
        {
            Barrier b = new Barrier(2);
            Thread t = new Thread(() =>
            {
                lock (this)
                {
                    b.SignalAndWait(); // let main UI thread run into lock
                    Thread.Sleep(Timeout.Infinite); // never leave the lock
                }
            });
            t.Start();

            b.SignalAndWait();
            lock (this) // deadlock which never releases lock
            {
            }
        }

If you have never seen Barrier it is a class introduced with .NET 4.0 which is very handy to synchronize threads to a specific time point. Only when both threads have signaled then both threads are released at the same time. In our case this ensures a nice deadlock. Here is the quiz:

Can something bad happen while the main UI thread is deadlocked?

As it turns out it can. Lets have a look at this impossible call stack:

Executing AlterData    at System.Environment.GetStackTrace(Exception e, Boolean needFileInfo)
   at System.Environment.get_StackTrace()
   at ThreadingIsHard.TestWindow.AlterData(Int32 newValue) in d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs:line 25
   at ThreadingIsHard.TestWindow.OwnPeer.GetHostRawElementProviderCore() in d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs:line 44
   at MS.Internal.Automation.ElementProxy.InContextGetHostRawElementProvider(Object unused)
   at MS.Internal.Automation.ElementUtil.<>c__DisplayClass1.<Invoke>b__0(Object unused)
   at System.Windows.Threading.ExceptionWrapper.InternalRealCall(Delegate callback, Object args, Int32 numArgs)
   at MS.Internal.Threading.ExceptionFilterHelper.TryCatchWhen(Object source, Delegate method, Object args, Int32 numArgs, Delegate catchHandler)
   at System.Windows.Threading.Dispatcher.LegacyInvokeImpl(DispatcherPriority priority, TimeSpan timeout, Delegate method, Object args, Int32 numArgs)
   at MS.Internal.Automation.ElementUtil.Invoke(AutomationPeer peer, DispatcherOperationCallback work, Object arg)
   at MS.Internal.Automation.ElementProxy.get_HostRawElementProvider()
   at MS.Internal.Automation.UiaCoreProviderApi.RawUiaReturnRawElementProvider(IntPtr hwnd, IntPtr wParam, IntPtr lParam, IRawElementProviderSimple el)
   at MS.Internal.Automation.UiaCoreProviderApi.RawUiaReturnRawElementProvider(IntPtr hwnd, IntPtr wParam, IntPtr lParam, IRawElementProviderSimple el)
   at System.Windows.Automation.Provider.AutomationInteropProvider.ReturnRawElementProvider(IntPtr hwnd, IntPtr wParam, IntPtr lParam, IRawElementProviderSimple el)
   at System.Windows.Interop.HwndTarget.CriticalHandleWMGetobject(IntPtr wparam, IntPtr lparam, Visual root, IntPtr handle)
   at System.Windows.Interop.HwndTarget.HandleMessage(WindowMessage msg, IntPtr wparam, IntPtr lparam)
   at System.Windows.Interop.HwndSource.HwndTargetFilterMessage(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndWrapper.WndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndSubclass.DispatcherCallbackOperation(Object o)
   at System.Windows.Threading.ExceptionWrapper.InternalRealCall(Delegate callback, Object args, Int32 numArgs)
   at MS.Internal.Threading.ExceptionFilterHelper.TryCatchWhen(Object source, Delegate method, Object args, Int32 numArgs, Delegate catchHandler)
   at System.Windows.Threading.Dispatcher.LegacyInvokeImpl(DispatcherPriority priority, TimeSpan timeout, Delegate method, Object args, Int32 numArgs)
   at MS.Win32.HwndSubclass.SubclassWndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam)
   at System.Threading.SynchronizationContext.WaitHelper(IntPtr[] waitHandles, Boolean waitAll, Int32 millisecondsTimeout)
   at System.Windows.Threading.DispatcherSynchronizationContext.Wait(IntPtr[] waitHandles, Boolean waitAll, Int32 millisecondsTimeout)
   at System.Threading.Monitor.Enter(Object obj)
   at ThreadingIsHard.TestWindow.Deadlock() in d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs:line 69
   at ThreadingIsHard.TestWindow.Button_Click(Object sender, RoutedEventArgs e) in d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs:line 52
   at System.Windows.EventRoute.InvokeHandlersImpl(Object source, RoutedEventArgs args, Boolean reRaised)
   at System.Windows.UIElement.RaiseEventImpl(DependencyObject sender, RoutedEventArgs args)
   at System.Windows.Controls.Button.OnClick()
   at System.Windows.Controls.Primitives.ButtonBase.OnMouseLeftButtonUp(MouseButtonEventArgs e)
   at System.Windows.RoutedEventArgs.InvokeHandler(Delegate handler, Object target)
   at System.Windows.EventRoute.InvokeHandlersImpl(Object source, RoutedEventArgs args, Boolean reRaised)
   at System.Windows.UIElement.ReRaiseEventAs(DependencyObject sender, RoutedEventArgs args, RoutedEvent newEvent)
   at System.Windows.RoutedEventArgs.InvokeHandler(Delegate handler, Object target)
   at System.Windows.EventRoute.InvokeHandlersImpl(Object source, RoutedEventArgs args, Boolean reRaised)
   at System.Windows.UIElement.RaiseEventImpl(DependencyObject sender, RoutedEventArgs args)
   at System.Windows.UIElement.RaiseTrustedEvent(RoutedEventArgs args)
   at System.Windows.Input.InputManager.ProcessStagingArea()
   at System.Windows.Input.InputProviderSite.ReportInput(InputReport inputReport)
   at System.Windows.Interop.HwndMouseInputProvider.ReportInput(IntPtr hwnd, InputMode mode, Int32 timestamp, RawMouseActions actions, Int32 x, Int32 y, Int32 wheel)
   at System.Windows.Interop.HwndMouseInputProvider.FilterMessage(IntPtr hwnd, WindowMessage msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at System.Windows.Interop.HwndSource.InputFilterMessage(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndWrapper.WndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndSubclass.DispatcherCallbackOperation(Object o)
   at System.Windows.Threading.ExceptionWrapper.InternalRealCall(Delegate callback, Object args, Int32 numArgs)
   at MS.Internal.Threading.ExceptionFilterHelper.TryCatchWhen(Object source, Delegate method, Object args, Int32 numArgs, Delegate catchHandler)
   at System.Windows.Threading.Dispatcher.LegacyInvokeImpl(DispatcherPriority priority, TimeSpan timeout, Delegate method, Object args, Int32 numArgs)
   at MS.Win32.HwndSubclass.SubclassWndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam)
   at MS.Win32.UnsafeNativeMethods.DispatchMessage(MSG& msg)
   at MS.Win32.UnsafeNativeMethods.DispatchMessage(MSG& msg)
   at System.Windows.Threading.Dispatcher.PushFrameImpl(DispatcherFrame frame)
   at System.Windows.Application.RunInternal(Window window)
   at System.Windows.Application.Run()
   at ThreadingIsHard.App.Main() in d:\Source\UIAutomationSample\UIAutomationTest\obj\Debug\App.g.cs:line 0

Hm the lock is still held but how can AlterData be called while we are trying to enter the lock which was clearly not released? We do see it even in the call stack (green line)? To understand this you need to take a deep look back in history about STA threads and window message pumping. Monitor.Enter is not a replacement for WaitForSingleObject on a STA thread. Instead it calls into CoWaitForMultipleHandles which ultimately calls MsgWaitForMultipleObjectsEx. Here is the (simplified) call stack:

USER32!MsgWaitForMultipleObjectsEx
combase!CCliModalLoop::SetPointerInputModeAsAppropriateForQueueAttachmentConditions
combase!CCliModalLoop::BlockFn
combase!ClientCallRetryContext::StartTimer
combase!ClassicSTAThreadWaitForHandles
combase!CoWaitForMultipleHandles
System.Windows.Threading.DispatcherSynchronizationContext.Wait(IntPtr[], Boolean, Int32)
System.Threading.Monitor.Enter(System.Object) 
ThreadingIsHard.TestWindow.Deadlock() [d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs @ 69] 
ThreadingIsHard.TestWindow.Button_Click(System.Object, System.Windows.RoutedEventArgs) [d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs @ 52] 
System.Windows.EventRoute.InvokeHandlersImpl(System.Object, System.Windows.RoutedEventArgs, Boolean)

MsgWaitForMultipleObjectsEx is just like WaitForMultipleObjects with the important exception that the wait can also return when specific window messages are put into the window message queue. The official docs are silent about which window messages are pumped. I have tried

  • Keyboard press
  • Mouse move
  • Mouse button
  • WM_PAINT
  • BeginInvoke
  • Invoke
  • Dispatcher.Invoke

NONE of these window messages are pumped while holding a lock. Chris Brumme says about this in his blog

On lower operating systems, the CLR uses MsgWaitForMultipleHandles / PeekMessage / MsgWaitForMultipleHandlesEx and whatever else is available on that version of the operating system to try to mirror the behavior of CoWaitForMultipleHandles. The net effect is that we will always pump COM calls waiting to get into your STA. And any SendMessages to any windows will be serviced. But most PostMessages will be delayed until you have finished blocking.

Although the blog post is over 10 years old the information is still relevant. What I have seen so far all Window messages will be blocked except for COM calls as it is the case for UIAutomation calls which use COM as transport vehicle to marshal your UI automation calls across processes. If you have a custom UI Automation provider you should check twice if you are calling into state altering methods of your UI objects which can be reentrant called while your application was trying get access to a lock. While you were waiting for the lock your window could already be disposed! Bad things can and will happen. UIAutomation is not only used for integration testing but also by screen readers and tools for people who have problems reading a computer screen. It is therefore possible that these issues not only show up during tests but also on customer machines. If you get mysterious crashes at places where you would not have expected them check the call stack  for AutomationInteropProvider calls. If there are some you most likely have hit a problem due to unexpected "continuations" while taking a lock. Unfortunately there is not really a good solution known to me except to pay vigilant attention to your UI Automation provider objects which must ensure not to call any methods which could alter the state of objects. If you have pending Begin/Invoke calls pending in your message queue these messages will still not get pumped. Only the direct COM calls and their side effects can tunnel through blocking calls. Perhaps I am telling you things you already knew but this one was quite surprising. Perhaps it would be the best to stop pumping COM messages at all so your calls nicely line up with the other window messages as it should be. But I guess there is still too much legacy code out there which would be deadlocked if no message pumping occurs anymore. The current Windows 10 Build still pumps messages as always. Perhaps we could get a flag to switch over to the "new" behavior to prevent such nasty problems.


Monday, September 29, 2014 #

One common thing while debugging is to search who owns/references a handle. For x86 processes you can simply search the full address space but in x64 this approach no longer works. While debugging a hang from a memory dump I did find out that the process MainWindowHandle did change to a different window which was not the main window handle. Searching for the values of NativeWindow objects did also lead to no trace on the managed heap so it must be some unmanaged window handle.

First I look at the Process objects which could contain interesting data

0:033> !DumpHeap -type System.Diagnostics.Process -stat
Statistics:
              MT    Count    TotalSize Class Name
000007fef3c7d090        1           24 System.Diagnostics.ProcessModuleCollection
000007fef3c76470        1           24 System.Diagnostics.ProcessThreadCollection
000007fef3c62c70        1           64 System.Func`2[[System.Diagnostics.Process, System],[System.Boolean, mscorlib]]
000007fef3c47718        8         2240 System.Diagnostics.Process
000007fef3c89140       57         3192 System.Diagnostics.ProcessThread
000007fef5b44458        2         3664 System.Object[]
000007fef3c71b90      291        16296 System.Diagnostics.ProcessModule
000007fef3c871c0      161        20608 System.Diagnostics.ProcessInfo

When I enable .prefer_dml 1 in Windbg I can click on the process instance and get a list of all 8 Process objects where I can look at each one. The -mt command with the Method Table (MT) as argument is the only way to get only Process objects and nothing else. This is exactly what the blue underlined links in Windbg do when you click on them.

   
0:033> !DumpHeap /d -mt 000007fef3c47718
         Address               MT     Size
0000000003b52658 000007fef3c47718      280     
0000000003c29970 000007fef3c47718      280     
0000000003c8b680 000007fef3c47718      280     
0000000003e192e0 000007fef3c47718      280     
0000000003e19d80 000007fef3c47718      280     
0000000003e1c818 000007fef3c47718      280     
0000000003e1ce90 000007fef3c47718      280     
0000000003e520d0 000007fef3c47718      280    

By clicking through the blue links I do not need to enter any command by hand which is quite convenient.

0:033> !DumpObj /d 0000000003b52658
Name:        System.Diagnostics.Process
MethodTable: 000007fef3c47718
EEClass:     000007fef38a2238
Size:        280(0x118) bytes
File:        C:\Windows\Microsoft.Net\assembly\GAC_MSIL\System\v4.0_4.0.0.0__b77a5c561934e089\System.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
000007fef5b97de0  4002c1e       f0       System.Boolean  1 instance                1 haveProcessId
000007fef5b992b8  4002c1f       d8         System.Int32  1 instance             5848 processId
000007fef5b97de0  4002c20       f1       System.Boolean  1 instance                0 haveProcessHandle
000007fef3c4f2c8  4002c21       20 ...SafeProcessHandle  0 instance 0000000000000000 m_processHandle
000007fef5b97de0  4002c22       f2       System.Boolean  1 instance                0 isRemoteMachine
000007fef5b96508  4002c23       28        System.String  0 instance 0000000002b347d0 machineName
000007fef3c871c0  4002c24       30 ...stics.ProcessInfo  0 instance 0000000000000000 processInfo
000007fef5b992b8  4002c25       dc         System.Int32  1 instance          2035711 m_processAccess
000007fef3c76470  4002c26       38 ...sThreadCollection  0 instance 0000000000000000 threads
000007fef3c7d090  4002c27       40 ...sModuleCollection  0 instance 0000000000000000 modules
000007fef5b97de0  4002c28       f3       System.Boolean  1 instance                1 haveMainWindow
000007fef5b9a338  4002c29       b8        System.IntPtr  1 instance          11505b4 mainWindowHandle

The main question was who was owning the window handle. While examining all NativeWindow instances which wrap a window handle I could not find the handle in use by any managed window. In that case I want to search the complete address space which can be non trivial in a multi GB x64 application.

Luckily there exists some not so well known Windbg command to help you out. It is the !address command which shows you all allocated regions of memory (mapped files, loaded dlls, thread stacks, native heaps, …). It has also scripting support to execute a command for every address region. Now I only need to search for the window handle for every allocated memory region. %1 is the start address and %2 is the length of the memory region.

0:033> !address -c:"s -d %1 %2 11505b4"
00000000`00df6560  011505b4 00000000 00000004 00000000  ................
00000000`03b52710  011505b4 00000000 00000000 00000000  ................
00000000`28b57370  011505b4 00000000 0000200e 00000000  ......... ......
00000000`28ce4568  011505b4 00000000 000100d4 00000000  ................
00000000`4c603798  011505b4 00000000 00010052 00000000  ........R.......

Then I search one by one who references the pointers to the window handle

0:033> !address -c:"s -d %1 %2 28b57370"
00000000`1b6d2220  28b57370 00000000 000030f8 00000000  ps.(.....0......
00000000`1b6d2740  28b57370 00000000 000030f8 00000000  ps.(.....0......
00000000`1b6e5740  28b57370 00000000 00000000 00000000  ps.(............
00000000`1b706530  28b57370 00000000 00000003 00000000  ps.(............
00000000`203387e8  28b57370 00000000 00001fd7 00000000  ps.(............
00000000`20355450  28b57370 00000000 00000010 00000000  ps.(............
00000000`20448af8  28b57370 00000000 b60115d0 ffffffff  ps.(............
00000000`204854b0  28b57370 00000000 c15f5308 000007fe  ps.(.....S_.....
00000000`204d6a90  28b57370 00000000 00000000 00000000  ps.(............
00000000`28b53290  28b57370 00000000 00000000 00000000  ps.(............
00000000`2bb8f7a0  28b57370 00000000 00000000 00000000  ps.(............
00000000`2c0ffc40  28b57370 00000000 00000001 00000000  ps.(............

Knowing that some thread must hold the window handle I have searched the call stacks with method arguments for all stack locations. The memory region marked red is a region inside a thread stack.

!address
+        0`2b790000        0`2bb8c000        0`003fc000        Stack      [~33; 16d8.2bbc]
+        0`2bb8c000        0`2bb90000        0`00004000        Stack      [~33; 16d8.2bbc]

The chances are high that this thread is the owner of the window handle. From the call stack it gets obvious:

0:051> ~33s
ntdll!NtWaitForMultipleObjects+0xa:
00000000`77c8186a c3              ret
0:033> kb
# RetAddr           : Args to Child                                                           : Call Site
00 000007fe`fdb01430 : 00000000`00000000 00000000`004dbe58 00000000`004dc200 00000000`77c8592e : ntdll!NtWaitForMultipleObjects+0xa
01 00000000`77a31723 : 00000000`2bb8f820 00000000`2bb8f810 00000000`00000000 00000000`2bb8f860 : KERNELBASE!WaitForMultipleObjectsEx+0xe8
02 00000000`77b48f7d : 00000000`2bb8f8a0 00000000`000927c0 00000000`00000001 00000000`000010f4 : kernel32!WaitForMultipleObjectsExImplementation+0xb3
03 00000000`77b462b2 : 00000000`2bb8f9f0 00000000`20481fb0 00000000`00000001 00000000`2bb8f9e0 : user32!RealMsgWaitForMultipleObjectsEx+0x12a
04 000007fe`ff67acd6 : 00000000`2bb8f9e0 00000000`00000000 00000000`2bb8f9f0 00000000`20481fb0 : user32!MsgWaitForMultipleObjectsEx+0x46
05 000007fe`ff79a2a2 : 00000000`80010115 00000000`2bb8faf0 00000000`80010115 00000000`203f1c22 : ole32!CCliModalLoop::BlockFn+0xc2
06 000007fe`c11c1e6d : 00000000`250ab220 00000000`20481f50 00000000`00000000 00000000`20481fb0 : ole32!CoWaitForMultipleHandles+0x102
07 000007fe`c11a33ba : 00000000`00000000 00000000`20481f50 00000000`00000000 00000000`00000000 : mshtml!CDwnTaskExec::ThreadExec+0x153
08 00000000`77a259ed : 00000000`00000000 000007fe`c10d0000 00000000`00000000 00000000`00000000 : mshtml!CExecFT::ThreadProc+0x4e
09 00000000`77c5c541 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0xd
0a 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x1d

It was a hosted Internet Explorer window which has become for some tenth of a second active causing some code to break which assumed that the MainWindowHandle the main window of an application causing sporadic hangs on some automated test agents.

There is one more thing. When you are working in a bigger software shop you have some form of continuous integration and automated testing after a build has succeeded. Usually the tests are automatically deployed on several test agents to run tests in parallel. This makes it a challenge to debug sporadically failing / hanging tests. What do you do? Try to repro on your machine, add tracing, run the tests in a loop in isolation? There is one thing you can do get down to the root cause of sporadically failing tests without all of that.

[Test]
public void Test()
{
    try 
    {
         // blahh bad test 

    }
    catch(Exception)
    {
        ProcessDumper.DumpProcess(); // Take a dump of your own process or several more if you need to
        throw;
    } 
}

Too easy? If you have some hung tests which only fail 1/50 times and you cannot attach a debugger or add tracing because the error goes away if you change the timing automatically taking memory dumps when the test fails can do miracles. This only makes sense if you can open the dump in Windbg. Even long time developers do not want to get into this arcane debugging stuff because "this happens only once a year I do not want to learn all that strange stuff!". I can assure you you will constantly take dumps once you know how to read them because it is the fastest and most efficient way to diagnose random failures without much guessing what the probable root cause could be.

For your test you can simply add procdump from the SysInternals suite as resource, write the file to %temp% and you are ready to create one or more dump files.

I am a huge fan of as much data as I can get to nail a hard to find issue. Memory dumps are an important part of gathering forensic data of crashes and hangs.

If you still do not like Windbg here is a secret: You can get quite far with Visual Studio 2013 and its latest Service Packs! The only thing you must have are matching pdbs from a symbol server or your drop folder. The parallel stacks window in Visual Studio is a great way to check a memory dump with many threads. Besides this Windbg is not really good at resolving STL containers correctly in release builds.

The following code will cause an unhandled exception in C++ in release builds because of a double free call:

#include "stdafx.h"
#include <comutil.h>

bstr_t *g_globalVar;

DECLSPEC_NOINLINE void Allocate()
{
    g_globalVar = new bstr_t(L"Hello world");
    delete g_globalVar;
    _tprintf(L"Created and deleted");
}

DECLSPEC_NOINLINE void FreeASecondTime()
{
    delete g_globalVar;
    _tprintf(L"Deleted a second time");
}

int _tmain(int argc, _TCHAR* argv[])
{
    ::Sleep(10 * 1000); // Give some time to attach dumper
    Allocate();
    FreeASecondTime();

    return 0;
}

With procdump you can create a dump for e.g. first chance exceptions (you can event filter for specific exception types)

C:\>procdump -accepteula -ma -e 1  DoubleFree c:\temp\DoubleFreeDump.dmp

ProcDump v6.00 - Writes process dump files
Copyright (C) 2009-2013 Mark Russinovich
Sysinternals - www.sysinternals.com
With contributions from Andrew Richards

Process:               DoubleFree.exe (5872)
Number of dumps:       1
Hung window check:     Disabled
Exception monitor:     First Chance+Unhandled
Exception filter:      *
Terminate monitor:     Disabled
Dump file:             c:\temp\DoubleFreeDump_YYMMDD_HHMMSS.dmp


Press Ctrl-C to end monitoring without terminating the process.

[23:35:48] Exception: C0000005.ACCESS_VIOLATION

First-chance Exception.
Writing dump file c:\temp\DoubleFreeDump_140929_233548.dmp ...
Writing 10MB. Estimated time (less than) 1 seconds.
Dump written.

Dump count reached.

Now you use VS2013 with File - Open menu to open the dump file:

image

 

Now you need to press Debug with Native Only to start "debugging" and open the Call Stack Window to find the second free call:

image

 

This works with unmanaged, managed and mixed code as well. A very useful VS addition is the Parallel Stacks window which can help you a lot to find hangs in your code. For our single threaded application it looks not terribly useful but if you have 100 threads it can help you a lot.

image

What can you do if you do not have the pdbs because the build did run, test did fail, binaries and pdbs are deleted after a few days and you only have a memory dump left? It is not as bad as it sounds. Since you know the baseline of your software you can rebuild the dlls and pdbs on your local machine.

When you load the dump initially and try to get a stack trace you get:

0:000> kb
ChildEBP RetAddr  Args to Child              
WARNING: Stack unwind information not available. Following frames may be wrong.
00c8f430 003f1d05 00c8f5f0 00c8f7cc 7f82f000 DoubleFree+0x11b9c
00c8f510 003f180b 00c8f6d0 00c8f7cc 7f82f000 DoubleFree+0x11d05
00c8f5f0 003f191b 00c8f7c4 00c8f7cc 7f82f000 DoubleFree+0x1180b
00c8f6d0 003f1b0b 00000001 00c8f898 00c8f7cc DoubleFree+0x1191b
00c8f7c4 003f1d6c 00000000 00000000 7f82f000 DoubleFree+0x11b0b
00c8f898 003f2fe9 00000001 00e6cad8 00e6cb60 DoubleFree+0x11d6c
00c8f8e8 003f31dd 00c8f8fc 7717919f 7f82f000 DoubleFree+0x12fe9
00c8f8f0 7717919f 7f82f000 00c8f940 7744a22b DoubleFree+0x131dd
00c8f8fc 7744a22b 7f82f000 e2df5dd0 00000000 kernel32!BaseThreadInitThunk+0xe
00c8f940 7744a201 ffffffff 7743f20b 00000000 ntdll!__RtlUserThreadStart+0x20
00c8f950 00000000 003f10d2 7f82f000 00000000 ntdll!_RtlUserThreadStart+0x1b

Not terribly useful. But if you add to your symbol path your output folder where you newly built pdb resides.

For this test I have converted the project to a managed C++ project for reasons you will see shortly:

.sympath+  D:\Source\DoubleFree\Debug\

and then force Windbg to load it even if the pdb timestamps do not match:

.reload /i DoubleFree.exe

you get full fidelity with line numbers back:

0:000> !ClrStack
OS Thread Id: 0xd74 (0)
Child SP       IP Call Site
00e3f2f0 7743cd7c [InlinedCallFrame: 00e3f2f0] Unknown
00e3f260 00fe1522 <Module>._bstr_t.Data_t.Release(Data_t*) [c:\program files (x86)\microsoft visual studio 12.0\vc\include\comutil.h @ 773]
00e3f270 00fe1791 <Module>.FreeASecondTime() [c:\program files (x86)\microsoft visual studio 12.0\vc\include\comutil.h @ 639]
00e3f27c 00fe1181 <Module>.wmain(Int32, Char**) [d:\source\doublefree\doublefree.cpp @ 28]
00e3f2f0 00f5a135 [InlinedCallFrame: 00e3f2f0] 
00e3f2ec 00fe112e DomainBoundILStubClass.IL_STUB_PInvoke()
00e3f2f0 73dc2552 [InlinedCallFrame: 00e3f2f0] <Module>._wmainCRTStartup()
00e3f498 73dc2552 [GCFrame: 00e3f498] 

Unfortunately Visual Studio has no way to load mismatched pdbs even if you have rebuilt the dll and pdb from the same source code base.

 

image

But all is not lost. You can get Visual Studio to load mismatched pdbs:

image

How? There is a little tool called chkmatch which modifies your "new" pdb to match and "old" binary. When you have a managed application you can save the dll to e.g. c:\temp with the managed debugger extension command !SaveModule in Windbg which is the reason why I switched from a pure C++ to a managed C++ application to make it possible to dump a dll from memory to disc. The .writeMem command did not work so I had to revert to a managed host process to debug a pure C++ issue.

Then you can make the "old" exe matching with your new pdb:

C:\Windows\system32>chkmatch -m c:\temp\DoubleFree.exe c:\temp\DoubleFree.pdb
ChkMatch - version 1.0
Copyright (C) 2004 Oleg Starodumov
http://www.debuginfo.com/

Executable: c:\temp\DoubleFree.exe
Debug info file: c:\temp\DoubleFree.pdb

Executable:
TimeDateStamp: 5429d867
Debug info: 2 ( CodeView )
TimeStamp: 5429d867  Characteristics: 0  MajorVer: 0  MinorVer: 0
Size: 68  RVA: 00007aec  FileOffset: 00006eec
CodeView format: RSDS
Signature: {36317c4a-a90c-482a-a8a7-c96ac8050a2e}  Age: 1
PdbFile: D:\Source\DoubleFree\Release\DoubleFree.pdb
Debug info: 12 ( Unknown )
TimeStamp: 5429d867  Characteristics: 0  MajorVer: 0  MinorVer: 0
Size: 20  RVA: 00007b30  FileOffset: 00006f30

Debug information file:
Format: PDB 7.00
Signature: {7cec4b85-0aca-4680-b797-82ead65f95ad}  Age: 1

Writing to the debug information file...
Result: Success.

Viola. You have matching pdbs and file and line numbers back to debug your dump file with Visual Studio. A common scenario where this is useful is that if your build stores only the binaries in a repository but does not upload for build performance reasons the pdbs. When you get dumps from failed test runs from some test machines you have perhaps still the binaries but not the pdbs. In that case Windbg or the chkmatch tool can help you out to get back file and line information from otherwise unreadable call stacks.

I know that the VS product group does not want a "load mismatched pdbs" feature in VS but it would be very useful if you have a dump but not exactly matching symbols. At least for the guys which do not want to learn Windbg for such "rare" cases.