From: Dirk Zabel on
Hi,
first of all, thanks to all who responded.
I played around with the disassembly command and got the following
infos:
the instruction '18 a0 ff 15 b0 6b' is indeed suspicious as Ivan
Brugiolo wrote. The address is at a00029fd = LeaveCrit+04x, i.e. not
inside my binary. This address ist not at the begin but in the middle
of the first instruction of LeaveCriticalSection

If I disassemble LeaveCtriticalSection, I see:
win32k!LeaveCrit:
a0002667 8b0de02318a0 mov ecx,dword ptr [win32k!gpresUser (a01823e0)]
a000266d ff15b06b17a0 call dword ptr
[win32k!_imp_ExReleaseResourceLite (a0176bb0)]

This explains the MISALIGNED_IP line in the output of the !analyze -
command.

Unfortunately, the whole stack trace does not include instructions
from user mode address space, so I cannot see what user program called
KiThreadStartup eventually.

The next lower stack frame reads

bfd6fb5c a00ad527 00000000 00000000 00000000 win32k!TimersProc+0x133
and disassembly of a00ad520 gives

a00ad520 75a6 jne win32k!RawInputThread+0x4ea (a00ad4c8)
a00ad522 e8ab53f5ff call win32k!TimersProc (a00028d2)
a00ad527 a1dc9e18a0 mov eax,dword ptr [win32k!gnRetryReadInput
(a0189edc)]

Disassembly of TimersProc begins with
win32k!TimersProc:
a00028d2 55 push ebp
a00028d3 8bec mov ebp,esp
a00028d5 83ec0c sub esp,0Ch
a00028d8 53 push ebx
a00028d9 56 push esi
a00028da 57 push edi
a00028db e8b8fdffff call win32k!EnterCrit (a0002698)
a00028e0 ba0000fe7f mov edx,offset SharedUserData (7ffe0000)
a00028e5 8b02 mov eax,dword ptr [edx]
a00028e7 f76204 mul eax,dword ptr [edx+4]
a00028ea 0facd018 shrd eax,edx,18h
a00028ee 8b350c2518a0 mov esi,dword ptr [win32k!gptmrFirst (a018250c)]
a00028f4 8bf8 mov edi,eax

i.e. TimersProc calls EnterCrit(icalSection?), some pages later I see

a00029db 99 cdq
a00029dc 68f0d8ffff push 0FFFFD8F0h
a00029e1 52 push edx
a00029e2 50 push eax
a00029e3 e8dcfcffff call win32k!_allmul (a00026c4)
a00029e8 6a00 push 0
a00029ea 52 push edx
a00029eb 50 push eax
a00029ec ff35042518a0 push dword ptr [win32k!gptmrMaster (a0182504)]
a00029f2 ff15ec6d17a0 call dword ptr [win32k!_imp__KeSetTimer
(a0176dec)]
a00029f8 e86afcffff call win32k!LeaveCrit (a0002667)
a00029fd 5f pop edi
a00029fe 5e pop esi
a00029ff 5b pop ebx
a0002a00 c9 leave
a0002a01 c3 ret


Now I don't see how any corruption of the user-mode program or
corruption of the data which user-mode program feeds to a system call
can result in an incorrect jump INSIDE some instruction of a
kernel-mode procedure -- this looks more like a hardware quirk sending
the cpu to LeaveCriticalSecion+4 instead of LeaveCriticalSecion to
me. Or is this impression too far-fetched?

Some other question, though: I could not get an overview what
processes where active when the fault occured. I had thought the
command View | Processes and Threads does this. The result is only a
window showing (transcript as ascii-art) :
[-] 000:f0f0f0f0 ntoskrnl.exe
+-000:1

This does not seem not to be a problem of this special dump, however, as I got
the same when I produced deliberately a dump from some other W2k
system running inside VirtualPC (using NotMyFault from Mark
Russinovich), loaded the resulting full memory dump into windbg and
tried the "Processes and Threads" command on this dump. I guess I am
doing wrong something simple, but what? I did setup windbg to use the
MS symbol server and the symbol cache seems to be ok.

Thank you for any comments,

- Dirk





From: Ivan Brugiolo [MSFT] on
>
> Now I don't see how any corruption of the user-mode program or
> corruption of the data which user-mode program feeds to a system call
> can result in an incorrect jump INSIDE some instruction of a
> kernel-mode procedure -- this looks more like a hardware quirk sending
> the cpu to LeaveCriticalSecion+4 instead of LeaveCriticalSecion to
> me. Or is this impression too far-fetched?


the byte-code for `call win32k!LeaveCrit (a0002667)` is `e8 6a fc ff
ff`.
It's encoded with a @eip relative offset. I'd bet that flipping one bit at
the time
in the offset you can easily get the `+4` displacement.
This looks like code-page single-bit corruption.
Unless you have ECC memory with MCA events,
it's hard to make further progress

--
--
This posting is provided "AS IS" with no warranties, and confers no rights.
Use of any included script samples are subject to the terms specified at
http://www.microsoft.com/info/cpyright.htm


"Dirk Zabel" <dzabel(a)community.nospam> wrote in message
news:8FF85CDB-71F8-4D1E-A25F-9B66B5B03841(a)microsoft.com...
> Hi,
> first of all, thanks to all who responded.
> I played around with the disassembly command and got the following
> infos:
> the instruction '18 a0 ff 15 b0 6b' is indeed suspicious as Ivan
> Brugiolo wrote. The address is at a00029fd = LeaveCrit+04x, i.e. not
> inside my binary. This address ist not at the begin but in the middle
> of the first instruction of LeaveCriticalSection
>
> If I disassemble LeaveCtriticalSection, I see:
> win32k!LeaveCrit:
> a0002667 8b0de02318a0 mov ecx,dword ptr [win32k!gpresUser
> (a01823e0)]
> a000266d ff15b06b17a0 call dword ptr
> [win32k!_imp_ExReleaseResourceLite (a0176bb0)]
>
> This explains the MISALIGNED_IP line in the output of the !analyze -
> command.
>
> Unfortunately, the whole stack trace does not include instructions
> from user mode address space, so I cannot see what user program called
> KiThreadStartup eventually.
>
> The next lower stack frame reads
>
> bfd6fb5c a00ad527 00000000 00000000 00000000 win32k!TimersProc+0x133
> and disassembly of a00ad520 gives
>
> a00ad520 75a6 jne win32k!RawInputThread+0x4ea (a00ad4c8)
> a00ad522 e8ab53f5ff call win32k!TimersProc (a00028d2)
> a00ad527 a1dc9e18a0 mov eax,dword ptr [win32k!gnRetryReadInput
> (a0189edc)]
>
> Disassembly of TimersProc begins with
> win32k!TimersProc:
> a00028d2 55 push ebp
> a00028d3 8bec mov ebp,esp
> a00028d5 83ec0c sub esp,0Ch
> a00028d8 53 push ebx
> a00028d9 56 push esi
> a00028da 57 push edi
> a00028db e8b8fdffff call win32k!EnterCrit (a0002698)
> a00028e0 ba0000fe7f mov edx,offset SharedUserData (7ffe0000)
> a00028e5 8b02 mov eax,dword ptr [edx]
> a00028e7 f76204 mul eax,dword ptr [edx+4]
> a00028ea 0facd018 shrd eax,edx,18h
> a00028ee 8b350c2518a0 mov esi,dword ptr [win32k!gptmrFirst
> (a018250c)]
> a00028f4 8bf8 mov edi,eax
>
> i.e. TimersProc calls EnterCrit(icalSection?), some pages later I see
>
> a00029db 99 cdq
> a00029dc 68f0d8ffff push 0FFFFD8F0h
> a00029e1 52 push edx
> a00029e2 50 push eax
> a00029e3 e8dcfcffff call win32k!_allmul (a00026c4)
> a00029e8 6a00 push 0
> a00029ea 52 push edx
> a00029eb 50 push eax
> a00029ec ff35042518a0 push dword ptr [win32k!gptmrMaster (a0182504)]
> a00029f2 ff15ec6d17a0 call dword ptr [win32k!_imp__KeSetTimer
> (a0176dec)]
> a00029f8 e86afcffff call win32k!LeaveCrit (a0002667)
> a00029fd 5f pop edi
> a00029fe 5e pop esi
> a00029ff 5b pop ebx
> a0002a00 c9 leave
> a0002a01 c3 ret
>
>
> Now I don't see how any corruption of the user-mode program or
> corruption of the data which user-mode program feeds to a system call
> can result in an incorrect jump INSIDE some instruction of a
> kernel-mode procedure -- this looks more like a hardware quirk sending
> the cpu to LeaveCriticalSecion+4 instead of LeaveCriticalSecion to
> me. Or is this impression too far-fetched?
>
> Some other question, though: I could not get an overview what
> processes where active when the fault occured. I had thought the
> command View | Processes and Threads does this. The result is only a
> window showing (transcript as ascii-art) :
> [-] 000:f0f0f0f0 ntoskrnl.exe
> +-000:1
>
> This does not seem not to be a problem of this special dump, however, as I
> got
> the same when I produced deliberately a dump from some other W2k
> system running inside VirtualPC (using NotMyFault from Mark
> Russinovich), loaded the resulting full memory dump into windbg and
> tried the "Processes and Threads" command on this dump. I guess I am
> doing wrong something simple, but what? I did setup windbg to use the
> MS symbol server and the symbol cache seems to be ok.
>
> Thank you for any comments,
>
> - Dirk
>
>
>
>
>


From: Jeffrey Walton on
On May 29, 11:52 pm, "Alexander Grigoriev" <a...(a)earthlink.net> wrote:
> If a single byte is corrupted, it looks like memory fault. The OP needs to
> run memory diagnostics.
>
> I dare to suggest my test fromhttp://home.earthlink.net/~alegr/download/memtest.htm

Hi Alexander,

You are probably correct. I'll add this to my war chest.

Jeff

>
> "Ivan Brugiolo [MSFT]" <ivanb...(a)online.microsoft.com> wrote in messagenews:uNuwwNhoHHA.1240(a)TK2MSFTNGP04.phx.gbl...
> >
> > The instruction being executed `18 a0 ff 15 b0 6b` looks suspicious.
> > Can you compare the code stream with a known good binary ?
> > I'd supect some form of code corruption, one example of which
> > (that I debugged recently from a crashdump) is reported below.
> > The code you are crashing at is know to not take external input,
> > and, to have been resonably stable,
>
> >
> >SNIP [Dump Analysis]
> >
> > "Dirk Zabel" <dza...(a)community.nospam> wrote in message
> >news:BCBED70E-F1C4-455F-BBE9-F742A77AEC9A(a)microsoft.com...
> >> Hi,
> >> on a Windows 2000 machine I had cooparating programs running (one is
> >> communicating with an external device via the rs232 port and exchanges
> >> data
> >> with the other using udp/ip). The computer was running for some weeks
> >> without
> >> problems, but now a blue screen occured. I had configured it to generate
> >> a
> >> full memory dump in such case. When analyzing the dump using windbg, I
> >> see
> >> this:
> >> kd> !analyze -v
> >> SNIP

> >> As far as I know, my programs cannot be responsible for any blue screen,
> >> as
> >> they are running on ring 3 (user mode). So I expected some driver to be
> >> listet in the dump. But- Hide quoted text -
>
> - Show quoted text -


From: Dirk Zabel on
Ivan Brugiolo [MSFT] schrieb:
>> Now I don't see how any corruption of the user-mode program or
>> corruption of the data which user-mode program feeds to a system call
>> can result in an incorrect jump INSIDE some instruction of a
>> kernel-mode procedure -- this looks more like a hardware quirk sending
>> the cpu to LeaveCriticalSecion+4 instead of LeaveCriticalSecion to
>> me. Or is this impression too far-fetched?
>
>
> the byte-code for `call win32k!LeaveCrit (a0002667)` is `e8 6a fc ff
> ff`.
> It's encoded with a @eip relative offset. I'd bet that flipping one bit at
> the time
> in the offset you can easily get the `+4` displacement.
> This looks like code-page single-bit corruption.
> Unless you have ECC memory with MCA events,
> it's hard to make further progress
>
Ok, I think this is exactly what happended.
The !analyze -v output said:
LAST_CONTROL_TRANSFER: from a00029fd to a000266b

disassembly around the calling instruction:
kd> u a00029e8
win32k!TimersProc+0x11e:
a00029e8 6a00 push 0
a00029ea 52 push edx
a00029eb 50 push eax
a00029ec ff35042518a0 push dword ptr [win32k!gptmrMaster (a0182504)]
a00029f2 ff15ec6d17a0 call dword ptr [win32k!_imp__KeSetTimer
(a0176dec)]
a00029f8 e86afcffff call win32k!LeaveCrit (a0002667)
a00029fd 5f pop edi
a00029fe 5e pop esi
i.e. in a00029f8 there is a call to a00026677, but the next instruction
executed was at a000266b. This happens if the displacement byte "6a" is
incorrectly read as "66", so this seems to be a flip of bit 2.
I will try the memory test Alexander recommended and take a look into
the bios settings (memory timing).

Thank you again to all who helped. I think I learned something about
interpreting windbg output.

Yours
- Dirk
From: Dirk Zabel on
Ivan Brugiolo [MSFT] schrieb:
>> Now I don't see how any corruption of the user-mode program or
>> corruption of the data which user-mode program feeds to a system call
>> can result in an incorrect jump INSIDE some instruction of a
>> kernel-mode procedure -- this looks more like a hardware quirk sending
>> the cpu to LeaveCriticalSecion+4 instead of LeaveCriticalSecion to
>> me. Or is this impression too far-fetched?
>
>
> the byte-code for `call win32k!LeaveCrit (a0002667)` is `e8 6a fc ff
> ff`.
> It's encoded with a @eip relative offset. I'd bet that flipping one bit at
> the time
> in the offset you can easily get the `+4` displacement.
> This looks like code-page single-bit corruption.
> Unless you have ECC memory with MCA events,
> it's hard to make further progress
>
Ok, I think this is exactly what happended.
The !analyze -v output said:
LAST_CONTROL_TRANSFER: from a00029fd to a000266b

disassembly around the calling instruction:
kd> u a00029e8
win32k!TimersProc+0x11e:
a00029e8 6a00 push 0
a00029ea 52 push edx
a00029eb 50 push eax
a00029ec ff35042518a0 push dword ptr [win32k!gptmrMaster (a0182504)]
a00029f2 ff15ec6d17a0 call dword ptr [win32k!_imp__KeSetTimer
(a0176dec)]
a00029f8 e86afcffff call win32k!LeaveCrit (a0002667)
a00029fd 5f pop edi
a00029fe 5e pop esi
i.e. in a00029f8 there is a call to a00026677, but the next instruction
executed was at a000266b. This happens if the displacement byte "6a" is
incorrectly read as "66", so this seems to be a flip of bit 2.
I will try the memory test Alexander recommended and take a look into the
bios settings (memory timing).

Thank you again to all who helped. I think I learned something about
interpreting windbg output.

Yours
- Dirk