Search This Blog

Thursday 2 February 2017

Common pitfalls of the Windows Performance Toolkit

I've been playing around with several versions of the Windows Performance Toolkit for some time now and decided to gather all those little annoying things that can make your life miserable. This way, all known issues (and hopefully their solutions) will be mentioned in one place and you don't need to spend hours of googling just to find the answer amongst the comments of some long abandoned forum.
I also find it important to highlight that for most (but not all) problems below it was not me who found the solution. I merely try to collect everything in one place and this will take some time. The credit shall mainly go to Alois Kraus and Bruce Dawson. These guys did an awesome job in documenting the various strange behavior of the tools involved, like wpr, wprui, xperf & wpa. I definitely recommend you to check out the blogs of these guys.

The issues:
- x64 ETW stackwalking is broken under Windows 7
- Data for CLR event providers are not decoded correctly in WPA (Generic Events)
- Data for custom event providers are not decoded correctly in WPA (Generic Events)
- Symbol decoding does not work for native images in WPA (?!?)
- Symbol decoding does not work for managed assemblies in WPA (?!?)
- .etl file taken on Windows 8(.1?) machine requires at least Win 8(.1) for analysis
- xperf / wpr won't start if NT Kernel Logger session is already runnning (procexp, procmon etc) under Windows 7

x64 ETW stackwalking is broken under Windows 7

I read a few articles mentioning this issue before actually bumping into it. The problem is that stack walking stops as soon as it detects being inside a JIT-ted method. The result is that you get broken call stacks for all your managed stuff. It's very easy to detect this error, as many methods hit in the problematic assembly are shown as a direct child of the [Root] element in the tree. Make no mistake, the resulting .etl file is corrupted here, so you can't fix the callstacks without taking the appr. corrective action and re-profiling.

<img placeholder>
<Figure x. No method call hierarchy, only a flat list>

However this works fine for x86 code. As you can see, canvas1_Loaded() is now deep within the call hierarchy. We should see the same for x64, should there be no bug in the ETW stack walker.

<img placeholder>
<Figure x. Expected method call hierarchy>

Solution: either switch to x86 for profiling or NGEN your x64 assemblies and repeat profiling.

Data for CLR event providers are not decoded correctly in WPA (Generic Events)

Now this is a real life ruiner. I was experimenting with the CLR Runtime / Rundown Provider(s) the other day and after spending hours to map hexadecimal values to CLR ETW keywords, levels and events in WPA that actually make sense I finally gave up and asked myself the question "Is this really worth it?". There had to be a way to actually see fields that make sense. Column names were also totally useless.

Values and column names missing or make no sense

Solution: reinstall the CLR manifest, by running the following commands:

wevtutil um c:\Windows\Microsoft.NET\Framework64\v4.0.30319\CLR-ETW.man

wevtutil im c:\Windows\Microsoft.NET\Framework64\v4.0.30319\CLR-ETW.man

Note: for a 32-bit OS, just use the "Framework" folder instead of "Framework64".

Unfortunately, you have to repeat your profiling scenario, because once again, it's the .etl file that contains invalid information. After fixing the issue, it takes some time to get used to the dynamic column headers, but that should be your biggest problem in life. :-)

All values / column names are shown correctly

Data for custom event providers are not decoded correctly in WPA (Generic Events)

My other favorite. If you happen to use System.Diagnostics.Tracing.EventSource to create your own ETW event provider you might bump into this. To keep it short, avoid APi names like Start or Stop. WPA (even the one in Creators Update SDK) seems to perform some arcane magic while reading / decoding the information in the .etl file and this unfortunately seems to conflict with certain (let's call 'em "reserved") names. I might write a separate article about this problem in the future.

Values and column names are not decoded properly


Perfview can decode everything properly


This is how it should be

Solution #1: just try to name your methods in your custom event provider in a standard way, like when you work with a manifest-based generated class, i.e. EventWrite* will do.

Solution #2: if you're targeting .NET 4.6+ you can opt-in to use a self describing event format for your events. Just make sure you new up EventSource with the appropriate constructor and then WPA will decode everything correctly.

Symbol decoding does not work for native images in WPA (?!?)

In this case, WPA can typically decode the module but not the function name in a callstack.

Comparing two WPA versions

Solution: make sure Visual Studio is closed during data collection, as it seems ngen createpdb fails while it is open.

Symbol decoding does not work for managed assemblies in WPA (?!?)

This problem can be recognized by the ?!? symbols in the callstacks:

Module and function symbols are not resolved

I guess there can be several reasons, the most obvious one for me was that Visual Studio (2013, 2015, don't know about the others) was running during data capture (with WPRUI). I did not dig deeper here as closing VS and recollecting data solved this. I wonder if collecting with xperf solves this too... I leave that exercise to you. :-)

Solution: make sure Visual Studio is closed while capturing traces.

There you go


To be continued...