Prev: Problem with WinUsb_ReadPipe
Next: What is the difference between using IOCTL to write/read IO ports vs. modifying IOPM and using _INP/OUTP_ ??
From: Alberto on 2 Sep 2009 12:42 At 4Ghz, one microsecond would cover 4,000 cycles. The Intel Optimization Guide says that an OUT instruction, for example, has a latency less than 225 cycles. From the processor's point of view, 1.67 microseconds could handle quite a few I/O instructions! On the other hand, a 33Mhz PCI bus would cover only 33 such cycles per microsecond. That's more in line with the 1.67 microseconds per I/O that the OP saw. But then, that should limit memory access speed too, no ? I don't know off my hat whether a PCI bus I/O cycle is any slower than a memory cycle - I'm not sure the bus knows anything beyond the fact that these are two different address spaces. My intuition is that at PCI bus level there should be no difference between memory and I/O cycles, but I may be wrong. If those cycles go through subtractive decoding in the south bridge, that might slow the I/O instructions down, but otherwise I don't see how it could be that much slower. The poster said it was a PCI bus card, no ? Not an ISA board. I would write a program that loops forever issuing INs or OUTs - at instruction level, no HAL, no software in between - put my scope on the bus, and take a good peek. Then I would write a loop that reads and/or writes PCI memory, and again, get a good timing. Anything above that must be software overhead! Alberto. On Sep 2, 1:37 am, Tim Roberts <t...(a)probo.com> wrote: > L337 <vern.engineer...(a)gmail.com> wrote: > > >Brief question for someone who sort of understands the Windows IO > >Manager and IO subsystem model. I have a PCI device that operates at > >0x300 IO mapped address. To use this card, we used a DLL written in C > >and used in VB6 back in Win 98. In WinXP, I now use Userport which > >claims it modifys Permission MAP of the Windows Subsystem to let all > >user mode programs run at Ring 0. > > No, that's not what it does. It does modify the I/O permission map, but > the effect of that is to tell the processor that I/O instructions are > allowed to be executed at ring 3 without trapping to kernel mode. Your > user mode code is still in ring 3. > > >When I use the C dll to do writes and reads, I achieve a speed of 1.65 > >uS between writes. Decent speed. But I wonder if I can get any > >better/faster then this? > > No. The I/O port instructions simply do not run any faster than that. The > low I/O ports are ISA compatible, and are limited by some of the > motherboard components to essentially ISA bus speeds. > -- > Tim Roberts, t...(a)probo.com > Providenza & Boekelheide, Inc.
From: L337 on 2 Sep 2009 13:53 On Sep 1, 3:47 pm, "Maxim S. Shatskih" <ma...(a)storagecraft.com.no.spam> wrote: > > have said. But why is it so slow when I do that? Is there THAT much > > overhead? > > What is the METHOD_xxx code of the IOCTL? > > Also note that IO ports are not designed to be fast and sustain major data flows. Major data flows are all via DMA these days. IO ports are a) declared obsolete by MS since around 1999 b) only used for control/status registers. > > Also, you can use KERNRATE or Intel vTune to find the bottleneck. Probably this will be the handle validation in NtDeviceIoControlFile or such. > > Also, if you want more speed accessing IO ports, at least use REP INSW and REP OUTSW opcodes, which are READ/WRITE_PORT_BUFFER_USHORT in the kernel. The IOCTL should also be "write port X with data buffer D of length L". This is actually the first thing to start with. > > -- > Maxim S. Shatskih > Windows DDK MVP > ma...(a)storagecraft.comhttp://www.storagecraft.com Wow so many replies, thanks a bunch everyone. Lets see where do I start. I was first using portio from the DDK but that seemed a bit slow. That uses the METHOD_BUFFERED for the IOCTL. Then I went and used porttalk's source code and that code looks something like this in the kernel mode driver: switch ( irpSp->Parameters.DeviceIoControl.IoControlCode ) case IOCTL_WRITE_PORT_UCHAR: if (inBufLength >= 3) { KdPrint( ("PORTTALK: IOCTL_WRITE_PORT_UCHAR(0x%X,0x %X)",ShortBuffer[0], CharBuffer[2]) ); WRITE_PORT_UCHAR((PUCHAR)ShortBuffer[0], CharBuffer [2]); } else ntStatus = STATUS_BUFFER_TOO_SMALL; pIrp->IoStatus.Information = 0; /* Output Buffer Size */ ntStatus = STATUS_SUCCESS; break; In the IOCTL header, it looked like this: void outportb(unsigned short PortAddress, unsigned char byte) { unsigned int error; DWORD BytesReturned; unsigned char Buffer[3]; unsigned short * pBuffer; pBuffer = (unsigned short *)&Buffer[0]; *pBuffer = PortAddress; Buffer[2] = byte; error = DeviceIoControl(PortTalk_Handle, IOCTL_WRITE_PORT_UCHAR, &Buffer, 3, NULL, 0, &BytesReturned, NULL); if (!error) printf("Error occured during outportb while talking to PortTalk driver %d\n",GetLastError()); } Then of course in the main() code we just do a "outportb(0x378, 0xFF);". So bottom line is, it seems like the consensus with my PCI Card is that I can not get ANY better in terms of speed even if I built some type of a proper device driver? Since the assembly instruction routines INP/OUTP are faster than all the overhead? My PCI card is a simple (no-interrupt) card operating at 33MHz at the 2.3 spec standard. Read transaction complete in about 6 PCI cycles and writes complete in about 3 cycles. The board thats attached below does all kinds of functions from ADC, reading registers from a mixed signal device, etc. so that is why there is turn-around times. I have not been able to figure out how to map the address to Memory Space or issue a CBE of 6(mem read) or CBE of 7(mem write) on the PCI bus. I take it this won't improve my speed anyway than the IO space.
From: L337 on 2 Sep 2009 13:56 On Sep 1, 3:47 pm, "Maxim S. Shatskih" <ma...(a)storagecraft.com.no.spam> wrote: > > have said. But why is it so slow when I do that? Is there THAT much > > overhead? > > What is the METHOD_xxx code of the IOCTL? > > Also note that IO ports are not designed to be fast and sustain major data flows. Major data flows are all via DMA these days. IO ports are a) declared obsolete by MS since around 1999 b) only used for control/status registers. > > Also, you can use KERNRATE or Intel vTune to find the bottleneck. Probably this will be the handle validation in NtDeviceIoControlFile or such. > > Also, if you want more speed accessing IO ports, at least use REP INSW and REP OUTSW opcodes, which are READ/WRITE_PORT_BUFFER_USHORT in the kernel. The IOCTL should also be "write port X with data buffer D of length L". This is actually the first thing to start with. > > -- > Maxim S. Shatskih > Windows DDK MVP > ma...(a)storagecraft.comhttp://www.storagecraft.com Sorry, the actual IOCTL function codes looks like this: #define PORTTALK_TYPE 40000 #define IOCTL_READ_PORT_UCHAR \ CTL_CODE(PORTTALK_TYPE, 0x904, METHOD_BUFFERED, FILE_ANY_ACCESS) #define IOCTL_WRITE_PORT_UCHAR \ CTL_CODE(PORTTALK_TYPE, 0x905, METHOD_BUFFERED, FILE_ANY_ACCESS)
From: L337 on 2 Sep 2009 14:08 On Sep 2, 9:42 am, Alberto <more...(a)terarecon.com> wrote: > At 4Ghz, one microsecond would cover 4,000 cycles. The Intel > Optimization Guide says that an OUT instruction, for example, has a > latency less than 225 cycles. From the processor's point of view, 1.67 > microseconds could handle quite a few I/O instructions! > > On the other hand, a 33Mhz PCI bus would cover only 33 such cycles per > microsecond. That's more in line with the 1.67 microseconds per I/O > that the OP saw. > > But then, that should limit memory access speed too, no ? I don't know > off my hat whether a PCI bus I/O cycle is any slower than a memory > cycle - I'm not sure the bus knows anything beyond the fact that these > are two different address spaces. My intuition is that at PCI bus > level there should be no difference between memory and I/O cycles, but > I may be wrong. If those cycles go through subtractive decoding in the > south bridge, that might slow the I/O instructions down, but otherwise > I don't see how it could be that much slower. The poster said it was a > PCI bus card, no ? Not an ISA board. > > I would write a program that loops forever issuing INs or OUTs - at > instruction level, no HAL, no software in between - put my scope on > the bus, and take a good peek. Then I would write a loop that reads > and/or writes PCI memory, and again, get a good timing. > > Anything above that must be software overhead! > > Alberto. > > On Sep 2, 1:37 am, Tim Roberts <t...(a)probo.com> wrote: > > > > > L337 <vern.engineer...(a)gmail.com> wrote: > > > >Brief question for someone who sort of understands the Windows IO > > >Manager and IO subsystem model. I have a PCI device that operates at > > >0x300 IO mapped address. To use this card, we used a DLL written in C > > >and used in VB6 back in Win 98. In WinXP, I now use Userport which > > >claims it modifys Permission MAP of the Windows Subsystem to let all > > >user mode programs run at Ring 0. > > > No, that's not what it does. It does modify the I/O permission map, but > > the effect of that is to tell the processor that I/O instructions are > > allowed to be executed at ring 3 without trapping to kernel mode. Your > > user mode code is still in ring 3. > > > >When I use the C dll to do writes and reads, I achieve a speed of 1.65 > > >uS between writes. Decent speed. But I wonder if I can get any > > >better/faster then this? > > > No. The I/O port instructions simply do not run any faster than that.. The > > low I/O ports are ISA compatible, and are limited by some of the > > motherboard components to essentially ISA bus speeds. > > -- > > Tim Roberts, t...(a)probo.com > > Providenza & Boekelheide, Inc.- Hide quoted text - > > - Show quoted text - Well I have done this already to the best I could. Using instruction assembly IN/OUT with the C dll, I achieved 1.65 uS between writes and reads. But when I use a IOCTL call, device driver to send a IRP and then down the stack, it takes about 8-10 uS as I have measured on the scope between each write or read. In my code I'm basically just repeating the commands one after the other. My 40MHz FPGA device down below handles it no problem. But the issue is the reads. There are lots of ADCs, multi-meter measuremts (GPIB), and analog stuff that require some time to measure and latch-in. I would prefer 1-2 uS that I have now but not the 8-10 uS retry times. I just want to know that I am doing everything right, that my timings are consistent. That 1-2uS is the best I can do bypassing HAL, and the 8-10uS is considered "normal" when doing METHOD_BUFFERED IOCTL WRITE_PORT_UCHAR commands...etc.etc. Oh, and this is with WinXP and a P4 2.2GHz system and a 32-bit PCI 2.3 bus. thanks.
From: L337 on 2 Sep 2009 18:12
I just used DebugView to look at the IOCTL calls the timings between each PCI IO transaction are consistent with what I'm seeing on the scope. PORTTALK: IOCTL_WRITE_PORT_UCHAR(0x300,0xFF) // at 0.00000000 sec STATUS_SUCCESS RETURNED // at 0.00000299 sec (2.99 uS) , this is when ntStatus = STATUS_SUCCESS; PORTTALK: IOCTL_WRITE_PORT_UCHAR(0x300,0xAA) // at 0.00001238 sec (1.238 uS) , the next IO WRITE STATUS_SUCCESS RETURNED // at 0.00001454 sec (1.454 uS) , here is when it completes again So it looks like, it takes about 2-3 microseconds from the kernel making the call to when the PCI actually delivers the command. Actually, I can't be specific, since I don't know precisely where to trigger. But looking at this timing, I would say that too much time is spent between calling the consecutive writes. As we can see, from 2.99uS to 1.238 uS time later, is when the second requests actually starts. Wow this is taking too long on the software side. Perhaps, I can write a faster code ? There isn't much difference in the way the generic Genport winDDK driver vs. porttalk IOCTL calls. They both look identical in function to me in the way they process the and pass the IOCTL to the kernel mode driver. The way I see it is that I want to do fast back-to-back burst IO writes and IO reads. |