Prev: NASM HelloWorld - DOS
Next: ELF loading
From: Bx.C / x87asm on 18 Aug 2007 04:27 >> Have you ever tried it ? Wouldn't it slow down everything ? >> I always followed the recommendation of CPU manuals and >> align GDT/IDT to cache bounds. > I have a number of apps that happen to have GDT at arbitrary offsets > (multiple of 16, 2 and even odd). All worked. As for slowing down, > once the segment descriptor's are cached in the CPU segment registers > (their hidden parts). I wouldn't expect any noticeable slowdown. it wouldn't slow down "everything", just mov commands that load a segment register.... set up a full GDT (every entry filled) and try a couple thousand loads of DS, ES, FS, and GS using the next GDT entry each load time the couple thousand loads and see what you come up with, with GDT aligned as follows: 16-byte (GDT base ends in 0) 8 byte but not 16 byte (GDT base ends in 8) 4-byte but not 8 byte (GDT base ends in 4 or C) 2-byte but not 4-byte (GDT base ends in 2, 6, A, or E) 1-byte but not 2-byte (GDT base ends in 1, 3, 5, 7, 9, B, D, or F) the task is trivial, but i'm too lazy to do it myself (and it doesn't matter to me because I usually align GDT to a paging boundary as a matter of preference) -- Bx.C / x87asm
From: Wolfgang Kern on 18 Aug 2007 04:28 Alexei A. Frounze wrote: [about VM86 speed ...] >> Right Alex, but if you check it within protected rings PL>0 then you >> may see VM86 faster than my method because of the RM<->PM32 links >> will take much more time then. > I didn't get it. Can you elaborate? Let's see if I can explain it in short words, mmh, may wont work ..:) so better a timing example: My link routines reside together with the stack in the HMA region, so both RM and PM32 can access data and call everything there by sharing one single stack with CS (BASE = 1.MB; TOP = end of HMA). I don't use paging, but if I would, it had the HMA area physical mapped anyway. I call BIOS services form within PM32sys this way: _______ :use32 66 e8 xx xx CALL near Enter_RM :use16 mov ds,ax ;if required mov es,ax ; .... ;call a BIOS-function ;and/or do something faster in true RM .... ; e8 xx xx CALL near Leave_RM :use32 Im_back: ;jmp/ret/cont... ______ Enter_RM: ;(call from PM32sys (high 16 address bits are zero) ;arent procedures (no frame,no arg,no C-convention) CLI LIDT [RM_IDT] MOV eax,CR0 AND eax,-2 MOV cr0,eax 66 EA xx xx 00 18 JMPF 0018: dw PM16 :use16 PM16: mov ax,0FFFF mov ss,ax EA xx xx FFFF JMPF FFFF: dw RM16 RM16: STI RET _______ Leave_RM: ;(call from RM) CLI LIDT [PM_IDT] MOV eax,CR0 OR eax,1 ;or al,1 may do as well, but... MOV CR0,eax 66 EA xx xx xx xx JMPF 0028: dq PM32 PM32: MOV eax,08 MOV SS,eax STI MOV DS,eax ;just in case MOV ES,eax ; RET _______ The summary of what's in use for my PM32>-RM->PM32 switch sequence: near call/ret pairs: 2 non task nor stack altering far-jumps: 3 CR0 accesses: 2+2 SEG-reg loads: 4 (6 if required) EAX/AX load/and/or 4 CLI/STI pairs 2 LIDT 2 And if I now raw calculate a worst case cycle-count: 2*(3+5) +3*16 +2*4+2*2 +6*4 +4 +2*(4+4) +2*35 = 190 cycles(AMD K7) for the whole back and forward switch. I measured just 90..135 in alive tests. The BIOS then can access data and I/O unrestricted, without being delayed by I/O permission-, priviledge checks and segment/paging translation that occure within VM86. So there this 'lost' 190 cycles (vs.2*>240 PL3<->VM86-task switches :) are easy gained back. If you now try to use my method from PL3 to RM and back it must have a vald TASKsegment with matching SS:ESP pairs which then will ask to be a special dedicated task for this, and you may add 'some' cycles for first going PL3 to PL0 before and reverse for coming back. So then you may end up with almost close to VM86 tasks timing. __ wolfgang
From: Wolfgang Kern on 18 Aug 2007 04:39 "Alexei A. Frounze" wrote: >> Mmh.. if I look at the IA32 GDTR/IDTR layout it is possible >> to have the tables at an odd address ... >> Have you ever tried it ? Wouldn't it slow down everything ? >> >> I always followed the recommendation of CPU manuals and >> align GDT/IDT to cache bounds. > I have a number of apps that happen to have GDT at arbitrary offsets > (multiple of 16, 2 and even odd). All worked. As for slowing down, > once the segment descriptor's are cached in the CPU segment registers > (their hidden parts). I wouldn't expect any noticeable slowdown. Interesting, but as Bx.C said it would slow down all seg-reg loads. I better keep onto my alignment, at least there :) __ wolfgang
From: Alexei A. Frounze on 18 Aug 2007 12:40 On Aug 18, 1:28 am, "Wolfgang Kern" <nowh...(a)never.at> wrote: > Alexei A. Frounze wrote: > > [about VM86 speed ...] > > >> Right Alex, but if you check it within protected rings PL>0 then you > >> may see VM86 faster than my method because of the RM<->PM32 links > >> will take much more time then. > > I didn't get it. Can you elaborate? > > Let's see if I can explain it in short words, mmh, may wont work ..:) > so better a timing example: > > My link routines reside together with the stack in the HMA region, > so both RM and PM32 can access data and call everything there by > sharing one single stack with CS (BASE = 1.MB; TOP = end of HMA). > I don't use paging, but if I would, it had the HMA area physical > mapped anyway. > > I call BIOS services form within PM32sys this way: > > _______ > :use32 > 66 e8 xx xx CALL near Enter_RM > :use16 > mov ds,ax ;if required > mov es,ax ; > ... ;call a BIOS-function > ;and/or do something faster in true RM > ... ; > e8 xx xx CALL near Leave_RM > :use32 > Im_back: > ;jmp/ret/cont... > ______ > > Enter_RM: ;(call from PM32sys (high 16 address bits are zero) > ;arent procedures (no frame,no arg,no C-convention) > CLI > LIDT [RM_IDT] > MOV eax,CR0 > AND eax,-2 > MOV cr0,eax > 66 EA xx xx 00 18 JMPF 0018: dw PM16 I think the two above instructions should be swapped. > :use16 > PM16: > mov ax,0FFFF > mov ss,ax > EA xx xx FFFF JMPF FFFF: dw RM16 > RM16: > STI > RET > _______ > Leave_RM: ;(call from RM) > CLI > LIDT [PM_IDT] > MOV eax,CR0 > OR eax,1 ;or al,1 may do as well, but... > MOV CR0,eax > 66 EA xx xx xx xx JMPF 0028: dq PM32 > PM32: > MOV eax,08 > MOV SS,eax > STI > MOV DS,eax ;just in case > MOV ES,eax ; > RET > _______ > The summary of what's in use for my PM32>-RM->PM32 switch sequence: > > near call/ret pairs: 2 > non task nor stack altering far-jumps: 3 > CR0 accesses: 2+2 > SEG-reg loads: 4 (6 if required) > EAX/AX load/and/or 4 > CLI/STI pairs 2 > LIDT 2 > > And if I now raw calculate a worst case cycle-count: > 2*(3+5) +3*16 +2*4+2*2 +6*4 +4 +2*(4+4) +2*35 = 190 cycles(AMD K7) > for the whole back and forward switch. > I measured just 90..135 in alive tests. And this is also your interrupt latency. Most importantly, since in the above picture I see no reference to the PIC, I assume that you don't reprogram it (at least, across RM<->PM switches), which probably means that in PM you share the same interrupt vectors between exceptions and IRQs: int 8: IRQ0 (timer), #DF int 9: IRQ1 (keyboard), FPU overrun 386- int 10: IRQ2 (8259 PIC's cascade IRQ), #TS int 11: IRQ3 (COM2/4), #NP int 12: IRQ4 (COM1/3), #SS int 13: IRQ5 (LPT/SB/?), #GP int 14: IRQ6 (FDD), #PF int 15: IRQ7 (LPT/SB/?), #DF If this is so, then either handling of the two things is complicated by distinguishing between IRQs and exceptions (which worsens interrupt latency further) or exception handling is generally non-functional in PM. I'd still want to catch #DF, #NP, #SS and #GP even when TSS and paging aren't used. Do you not use all of these: timer, COMs, LPT/SB/? Or do you hope to never get any exception in this range? > The BIOS then can access data and I/O unrestricted, without being > delayed by I/O permission-, priviledge checks and segment/paging > translation that occure within VM86. > So there this 'lost' 190 cycles (vs.2*>240 PL3<->VM86-task switches :) > are easy gained back. Now, what exactly do you mean by "PL3<->VM86-task switches"? Do you actually perform a task switch (jump/call/int/iret to/from TSS)? > If you now try to use my method from PL3 to RM and back it must have > a vald TASKsegment with matching SS:ESP pairs which then will ask > to be a special dedicated task for this, and you may add 'some' > cycles for first going PL3 to PL0 before and reverse for coming back. > So then you may end up with almost close to VM86 tasks timing. Hold on. I want to know answers to the above questions. Alex
From: Wolfgang Kern on 19 Aug 2007 10:26
Alexei A. Frounze wrote: [about VM86 speed ...] >> Let's see if I can explain it in short words, mmh, may wont work ..:) >> so better a timing example: >> My link routines reside together with the stack in the HMA region, >> so both RM and PM32 can access data and call everything there by >> sharing one single stack with CS (BASE = 1.MB; TOP = end of HMA). >> I don't use paging, but if I would, it had the HMA area physical >> mapped anyway. >> I call BIOS services form within PM32sys this way: [...the example...] >> I measured just 90..135 in alive tests. > And this is also your interrupt latency. ?? Don't know what it got to with INT But right, if I'd use SW-interrupts they would take ~90 cycles(naked). > Most importantly, since in the above picture I see no reference to the > PIC, I assume that you don't reprogram it (at least, across RM<->PM > switches), which probably means that in PM you share the same > interrupt vectors between exceptions and IRQs: > int 8: IRQ0 (timer), #DF > int 9: IRQ1 (keyboard), FPU overrun 386- > int 10: IRQ2 (8259 PIC's cascade IRQ), #TS > int 11: IRQ3 (COM2/4), #NP > int 12: IRQ4 (COM1/3), #SS > int 13: IRQ5 (LPT/SB/?), #GP > int 14: IRQ6 (FDD), #PF > int 15: IRQ7 (LPT/SB/?), #DF My TRUE realmode IDT is still at 0000:0000 and I have all IRQs reprogrammed (to 50...5f), and handle all INT 00..7f (includes all exceptions) by the system for both PM32 and true Real mode anyway [real mode exceptions differ from PM quite a bit]. But this wont/mustn't be altered to call BIOS functions because I use INT6D instead of INT10 (even there is no exception 10h in RM). > If this is so, then either handling of the two things is complicated > by distinguishing between IRQs and exceptions (which worsens interrupt > latency further) or exception handling is generally non-functional in > PM. I'd still want to catch #DF, #NP, #SS and #GP even when TSS and > paging aren't used. I catch all exceptions for RM, PM16/32 and even Big Real and enter or just display the debugger box if it can't be handled by the system. > Do you not use all of these: timer, COMs, LPT/SB/? All of them can be used if present and desired. > Or do you hope to never get any exception in this range? NO! I'm not at all a believer :) >> The BIOS then can access data and I/O unrestricted, without being >> delayed by I/O permission-, priviledge checks and segment/paging >> translation that occure within VM86. >> So there this 'lost' 190 cycles (vs.2*>240 PL3<->VM86-task switches :) >> are easy gained back. > Now, what exactly do you mean by "PL3<->VM86-task switches"? Swap stack from PL3 to PL0, swap again to enter VM86, grant pages ... and the whole back to return. > Do you actually perform a task switch (jump/call/int/iret to/from TSS)? Not neccessarily a full hardware task-switch, but it needs at least some priviledge-checked reads and writes to/from the current Task-segment. Perhaps I missed a faster way, so how long will it take you to call any 16 bit BIOS function from PM32 using VM86 ? (I may once need because om long mode..) >> If you now try to use my method from PL3 to RM and back it must have >> a vald TASKsegment with matching SS:ESP pairs which then will ask >> to be a special dedicated task for this, and you may add 'some' >> cycles for first going PL3 to PL0 before and reverse for coming back. >> So then you may end up with almost close to VM86 tasks timing. > Hold on. I want to know answers to the above questions. OK. :) __ wolfgang |