Friday, 11 September 2015

Live debugging with WinDBG and VMMap. Not managed, not unmanaged mystery

And so it happened. Because of the nature of the issue one of our customers has recently experienced I needed to figure out how to debug the issue in the live Windbg session. And it appeared to be quite easy and handy for the type of situation that we had.

So, the situation was like next: something was eating memory on the customer solution. When I got the memory dump and checked the memory sizes, this is what I've got:

Of 10Gb memory dump:

2Gb was managed memory
0.1Gb was unmanaged heap

and that's it. Windbg has shown only about 2Gb of managed and unmanaged memory alltogether, but what about the 8Gb? I felt so helpless, I've never seen anything like that. This mystery catched me so much that I was going to a party still thinking about what it can be.

Maybe a little drinking at a party helped, but after some tries and a couple of days the mystery was solved.

So, if you have a weird memory distribution, that's what you should do:

1. Download the beautiful VMMap

2. Attach to a w3wp process and look at the memory:


And right here I felt like I've found a way to solve the mystery. Look at Commited values. Managed heap is GC heap in memory dumps, things that you can debug nicely with !dumpheap and other handy commands. Heap is what is shown as unmanaged heap in memory dumps, which can be debugged not so nicely, but still possible.
And Private Data that takes more than a half of the memory for this process. Where is that even comes from?

3. Select the problematic area (in my case, Private Data) and observe the allocations. In my case it was obvious that there is a huge amount of allocations of 7220kb, 14440kb and another "pair" 7222 and 14444. Second value is always 2x the first.

It is related to the way how the memory is being allocated in Private Data. First it allocates the smaller amount and if it is not enough it allocates 2x more. I'm not really a queen of the details here, but this is how approximately it works.

Unsolved question for me is also why this memory is not shown up in Heap in the memory dump. It is probably a silly question for memory-gurus, but I couldn't find an answer.

4. Then it became obvious that I need to find out what exactly makes these allocations. I've read that memory is being allocated in Private Data when native VirtualAlloc() is being called. So the idea is to find those calls of VirtualAlloc that are allocating around 7220 and 14440 Kb of data.

NOTE: the values shown in VMMap are approximate! It gave me a bit of nightmares trying to figure out why I can't catch the exact memory allocations, and the answer was easy - they are just not really accurate.

5. After finding what makes the allocations (in my case it was VirtualAlloc), we need to catch the process of allocation in real time. This is how you can do it:

- Open Windbg, go to File – Attach to a process, select your w3wp process there.
- If process stops on some exception, write “g” to let it continue.
- Press Ctrl+Break or Ctrl+Pause to interfere
- Then, enter in Windbg window:

bp kernel32!virtualalloc ".if ((@edx>=0x70C000)&(@edx<=0x70F000))|((@edx>=0xE10000)&(@edx<=0xE1C000)) { .printf \"Allocating %lu bytes, which is 7,224k and is a sign of a memory leak\n\", @edx; kb 4 } .else {g}"
It will create a breakpoint on a native method that allocates around 7,224 (14,444 when doubled) kilobytes.

A bit of explanation:

Here we use kernel32!virtualalloc, because in my case that was the "guilty" method. But it can be anything else.

@edx in this case is a register, that contains the "amount to allocate" parameter passed to VirtualAlloc function.

0x70C000...0x70F000 is hexadecimal values for 7389184...7401472 bytes, which is 7216...7228 kilobytes. I tried to catch all amounts that are similar to 7224.

0xE10000...0xE1C000 are the same values, doubled. I used them too, because I saw such allocations in VMMap.

.else {g} - means that if this is just a usual VirtualAlloc with other amounts  of memory, then just continue execution.


- After you added the breakpoint, type “g” to continue.

- Wait until the process stops at the breakpoint. Note the “Allocating…” message – it is what we set in our breakpoint to print out.

- Type .loadby sos clr and press Enter. This will allow us to see the managed code.
- Type !clrstack.

Ideally, after all these actions we’ll see the stacktrace that lead to the bad memory allocation:



(my pictures are just examples, taken with another number of bytes from a running Sitecore instance)

- To see the objects of the stactrace, type !dso. It should give all the list of objects and printing out strings that you have there. This can also help to figure out where this code comes from.


Not ideally, we’ll get only some native stacktrace there. But anyway, then we can at least google for specific native calls and try to figure out where they came from.

- To delete all the breakpoints so that process continues, go to Edit – Breakpoints – Remove all.


The post lacks some pictures and details for now, so it's a post's beta-release ("cause it's beta than nothing").