Debugging an Unkillable Process

I recently encountered an issue where Microsoft Publisher would intermittently end up in an unkillable state. Publisher is started by Office interops from an NT service. The problem seems to only occur when Publisher runs in a non-interactive session (which all NT services since Vista do), and when a line feed character is present in a text shape.

The latter is not particularly interesting because it's easy enough to work around the problem with a library that implements Microsoft's Compound File Binary Format specification.[1] Changing the line feed charaters to carriage returns seems to avoid the problem code path.

I really wanted to understand why the line feed characters caused this unkillable state. Looking at the process properties with Process Monitor, I noticed one of the threads was waiting on a push lock.

And the stack trace confirmed the KeWaitForMultipleObjects call that's blocking the thread. What's interesting is the last user-mode API call: win32u.dll!NtGdiMakeFontDir. Although this function is undocumented, I think the name gives away its purpose.

I generated a crash dump, opened it in WinDbg, and looked at the stack frame for CreateScalableFontResourceW since that's the actual API call Publisher is making. (PTXT9.DLL is an Office library.)

In that memory location I noticed C:\Windows\Temp\MSV77B1.tmp and C:\Windows\Temp\MSV77B0.tmp. Presumably these are the second and third arguments, respectively, of CreateScalableFontResourceW. Walking down the stack to NtGdiMakeFontDir, I noticed C:\Windows\Temp\MSV77B0.tmp again.

According to the file tool, these files are TrueType Font data. I opened them in a font editor and found the name of the font: MSVisi Regular. I couldn't find any information about this font or why Publisher is using it. However, Process Monitor showed those files being created.

I included the full sequence to show how I believe those files are used. First, MSV77B0.tmp is created and written to. That's clearly the third argument of CreateScalableFontResourceW. The second file is created and then deleted, which I suspect is to ensure it doesn't exist. The second argument of CreateScalableFontResourceW takes a file path that should not exist, or it fails. The highlighted line is where NtGdiMakeFontDir opens the font resource file.

Looking again at the stack trace above, I believe NtGdiMakeFontDir is ultimately attempting to fill in a GDIINFO structure. Line 6 shows win32kfull.sys!IsDwnActive as the last kernel-mode API function before the push lock. Dwm probably refers to the Desktop Window Manager which handles compositing. And after looking at the documentation for XLATEOBJ_piVector, DWM seems to be needed in order to fill in part of the GDIINFO structure.

My theory is IsDwmActive needs to be called from an interactive session where DWM runs. This would explain why running the same application from a Remote Desktop session on the same box doesn't produce this wait state. If that's the case, the function is waiting on a push lock that will never signal.[2]

I already knew that replacing line feed chacters with carriage returns seemingly works around the problem. But I didn't have a way to definitively know why. So to test my theory about DWM, I wrote a fun little toy that reaches into the Publisher memory space and patches the import address table.[3] This technique is known as IAT hooking.

First I created GDIHOOK.DLL with an exported hook function that matches the calling convention of CreateScalableFontResourceW.[4] The tripwire tool starts Publisher in a suspended state and injects GDIHOOK into the Publisher process. PTXT9.DLL is also injected because it has the imported function I needed to patch, and it's not loaded until Publisher opens a document.

Tripwire searches the export address table of GDIHOOK and finds the address of CreateScalableFontResourceW_hook. It then searches the import address table of PTXT9 for the address of the call to CreateScalableFontResourceW. The hook function's address is written to the real function's import thunk.

Lastly, tripwire resumes the Publisher thread, opens a named pipe, and waits. When Publisher makes a call to CreateScalableFontResourceW, the hook function is called instead. The hook opens the other end of the named pipe and writes a signal, letting tripwire know the function was called. I want to take a moment to emphasize that this is horrible and you should never use this technique in production software.

When I run Publisher using tripwire and open a file with line feed characters, the signal is sent indicating CreateScalableFontResourceW was called. After I patch the Publisher file with carriage returns and repeat this process, tripwire waits patiently and never receives the signal. This proves the problem code path is avoided by replacing line feed chacters, and that the call to CreateScalableFontResourceW succeeds in an interactive session.[5]

1. Although libcfbf is open source, the actual patching logic has been implemented in a commercial product. But Apache POI is a good reference for writing the necessary wrapper around libcfbf. Also worth noting is the library only supports reading. I solved the writing problem by computing a window around the line feed character, and scanning the file in binary read mode to find the byte offset.

2. I believe this is a bug in Windows. But there isn't enough whiskey in the world for me to suffer through Microsoft's absurd process for reporting a bug.

3. The Image Loader and CreateProcess sections of Windows Internals were very helpful references for this project. Unfortunately I have an older edition that's missing the section on unkillable processes. But if you're mucking around in low-level APIs, then I can't recommend this book enough.

4. I may have been able to do the IAT patching from DllMain and save myself a lot of marshaling between processes. But I hadn't thought that far ahead before I wrote most of it in tripwire. In hindsight, I like this approach better anyway.

5. An interesting discovery from hooking CreateScalableFontResourceW is that returning FALSE from the hook (indicating the function failed) causes Publisher to carry on its merry way. As far as I can tell, there were no adverse side-effects.