From: Alberto on

At 4Ghz, one microsecond would cover 4,000 cycles. The Intel
Optimization Guide says that an OUT instruction, for example, has a
latency less than 225 cycles. From the processor's point of view, 1.67
microseconds could handle quite a few I/O instructions!

On the other hand, a 33Mhz PCI bus would cover only 33 such cycles per
microsecond. That's more in line with the 1.67 microseconds per I/O
that the OP saw.

But then, that should limit memory access speed too, no ? I don't know
off my hat whether a PCI bus I/O cycle is any slower than a memory
cycle - I'm not sure the bus knows anything beyond the fact that these
are two different address spaces. My intuition is that at PCI bus
level there should be no difference between memory and I/O cycles, but
I may be wrong. If those cycles go through subtractive decoding in the
south bridge, that might slow the I/O instructions down, but otherwise
I don't see how it could be that much slower. The poster said it was a
PCI bus card, no ? Not an ISA board.

I would write a program that loops forever issuing INs or OUTs - at
instruction level, no HAL, no software in between - put my scope on
the bus, and take a good peek. Then I would write a loop that reads
and/or writes PCI memory, and again, get a good timing.

Anything above that must be software overhead!


Alberto.





On Sep 2, 1:37 am, Tim Roberts <t...(a)probo.com> wrote:
> L337 <vern.engineer...(a)gmail.com> wrote:
>
> >Brief question for someone who sort of understands the Windows IO
> >Manager and IO subsystem model.  I have a PCI device that operates at
> >0x300 IO mapped address.  To use this card, we used a DLL written in C
> >and used in VB6 back in Win 98.  In WinXP, I now use Userport which
> >claims it modifys Permission MAP of the Windows Subsystem to let all
> >user mode programs run at Ring 0.
>
> No, that's not what it does.  It does modify the I/O permission map, but
> the effect of that is to tell the processor that I/O instructions are
> allowed to be executed at ring 3 without trapping to kernel mode.  Your
> user mode code is still in ring 3.
>
> >When I use the C dll to do writes and reads, I achieve a speed of 1.65
> >uS between writes.  Decent speed.  But I wonder if I can get any
> >better/faster then this?
>
> No.  The I/O port instructions simply do not run any faster than that.  The
> low I/O ports are ISA compatible, and are limited by some of the
> motherboard components to essentially ISA bus speeds.
> --
> Tim Roberts, t...(a)probo.com
> Providenza & Boekelheide, Inc.

From: L337 on
On Sep 1, 3:47 pm, "Maxim S. Shatskih"
<ma...(a)storagecraft.com.no.spam> wrote:
> > have said.  But why is it so slow when I do that?  Is there THAT much
> > overhead?
>
> What is the METHOD_xxx code of the IOCTL?
>
> Also note that IO ports are not designed to be fast and sustain major data flows. Major data flows are all via DMA these days. IO ports are a) declared obsolete by MS since around 1999 b) only used for control/status registers.
>
> Also, you can use KERNRATE or Intel vTune to find the bottleneck. Probably this will be the handle validation in NtDeviceIoControlFile or such.
>
> Also, if you want more speed accessing IO ports, at least use REP INSW and REP OUTSW opcodes, which are READ/WRITE_PORT_BUFFER_USHORT in the kernel. The IOCTL should also be "write port X with data buffer D of length L". This is actually the first thing to start with.
>
> --
> Maxim S. Shatskih
> Windows DDK MVP
> ma...(a)storagecraft.comhttp://www.storagecraft.com


Wow so many replies, thanks a bunch everyone. Lets see where do I
start.

I was first using portio from the DDK but that seemed a bit slow. That
uses the METHOD_BUFFERED for the IOCTL. Then I went and used
porttalk's source code and that code looks something like this in the
kernel mode driver:

switch ( irpSp->Parameters.DeviceIoControl.IoControlCode )
case IOCTL_WRITE_PORT_UCHAR:
if (inBufLength >= 3) {
KdPrint( ("PORTTALK: IOCTL_WRITE_PORT_UCHAR(0x%X,0x
%X)",ShortBuffer[0], CharBuffer[2]) );
WRITE_PORT_UCHAR((PUCHAR)ShortBuffer[0], CharBuffer
[2]);
} else ntStatus = STATUS_BUFFER_TOO_SMALL;
pIrp->IoStatus.Information = 0; /* Output Buffer Size */
ntStatus = STATUS_SUCCESS;
break;

In the IOCTL header, it looked like this:

void outportb(unsigned short PortAddress, unsigned char byte)
{
unsigned int error;
DWORD BytesReturned;
unsigned char Buffer[3];
unsigned short * pBuffer;
pBuffer = (unsigned short *)&Buffer[0];
*pBuffer = PortAddress;
Buffer[2] = byte;

error = DeviceIoControl(PortTalk_Handle,
IOCTL_WRITE_PORT_UCHAR,
&Buffer,
3,
NULL,
0,
&BytesReturned,
NULL);

if (!error) printf("Error occured during outportb while talking to
PortTalk driver %d\n",GetLastError());
}


Then of course in the main() code we just do a "outportb(0x378,
0xFF);".



So bottom line is, it seems like the consensus with my PCI Card is
that I can not get ANY better in terms of speed even if I built some
type of a proper device driver? Since the assembly instruction
routines INP/OUTP are faster than all the overhead?

My PCI card is a simple (no-interrupt) card operating at 33MHz at the
2.3 spec standard. Read transaction complete in about 6 PCI cycles
and writes complete in about 3 cycles. The board thats attached below
does all kinds of functions from ADC, reading registers from a mixed
signal device, etc. so that is why there is turn-around times.

I have not been able to figure out how to map the address to Memory
Space or issue a CBE of 6(mem read) or CBE of 7(mem write) on the PCI
bus. I take it this won't improve my speed anyway than the IO space.
From: L337 on
On Sep 1, 3:47 pm, "Maxim S. Shatskih"
<ma...(a)storagecraft.com.no.spam> wrote:
> > have said.  But why is it so slow when I do that?  Is there THAT much
> > overhead?
>
> What is the METHOD_xxx code of the IOCTL?
>
> Also note that IO ports are not designed to be fast and sustain major data flows. Major data flows are all via DMA these days. IO ports are a) declared obsolete by MS since around 1999 b) only used for control/status registers.
>
> Also, you can use KERNRATE or Intel vTune to find the bottleneck. Probably this will be the handle validation in NtDeviceIoControlFile or such.
>
> Also, if you want more speed accessing IO ports, at least use REP INSW and REP OUTSW opcodes, which are READ/WRITE_PORT_BUFFER_USHORT in the kernel. The IOCTL should also be "write port X with data buffer D of length L". This is actually the first thing to start with.
>
> --
> Maxim S. Shatskih
> Windows DDK MVP
> ma...(a)storagecraft.comhttp://www.storagecraft.com

Sorry, the actual IOCTL function codes looks like this:

#define PORTTALK_TYPE 40000

#define IOCTL_READ_PORT_UCHAR \
CTL_CODE(PORTTALK_TYPE, 0x904, METHOD_BUFFERED, FILE_ANY_ACCESS)

#define IOCTL_WRITE_PORT_UCHAR \
CTL_CODE(PORTTALK_TYPE, 0x905, METHOD_BUFFERED, FILE_ANY_ACCESS)


From: L337 on
On Sep 2, 9:42 am, Alberto <more...(a)terarecon.com> wrote:
> At 4Ghz, one microsecond would cover 4,000 cycles. The Intel
> Optimization Guide says that an OUT instruction, for example, has a
> latency less than 225 cycles. From the processor's point of view, 1.67
> microseconds could handle quite a few I/O instructions!
>
> On the other hand, a 33Mhz PCI bus would cover only 33 such cycles per
> microsecond. That's more in line with the 1.67 microseconds per I/O
> that the OP saw.
>
> But then, that should limit memory access speed too, no ? I don't know
> off my hat whether a PCI bus I/O cycle is any slower than a memory
> cycle - I'm not sure the bus knows anything beyond the fact that these
> are two different address spaces. My intuition is that at PCI bus
> level there should be no difference between memory and I/O cycles, but
> I may be wrong. If those cycles go through subtractive decoding in the
> south bridge, that might slow the I/O instructions down, but otherwise
> I don't see how it could be that much slower. The poster said it was a
> PCI bus card, no ? Not an ISA board.
>
> I would write a program that loops forever issuing INs or OUTs - at
> instruction level, no HAL, no software in between - put my scope on
> the bus, and take a good peek. Then I would write a loop that reads
> and/or writes PCI memory, and again, get a good timing.
>
> Anything above that must be software overhead!
>
> Alberto.
>
> On Sep 2, 1:37 am, Tim Roberts <t...(a)probo.com> wrote:
>
>
>
> > L337 <vern.engineer...(a)gmail.com> wrote:
>
> > >Brief question for someone who sort of understands the Windows IO
> > >Manager and IO subsystem model.  I have a PCI device that operates at
> > >0x300 IO mapped address.  To use this card, we used a DLL written in C
> > >and used in VB6 back in Win 98.  In WinXP, I now use Userport which
> > >claims it modifys Permission MAP of the Windows Subsystem to let all
> > >user mode programs run at Ring 0.
>
> > No, that's not what it does.  It does modify the I/O permission map, but
> > the effect of that is to tell the processor that I/O instructions are
> > allowed to be executed at ring 3 without trapping to kernel mode.  Your
> > user mode code is still in ring 3.
>
> > >When I use the C dll to do writes and reads, I achieve a speed of 1.65
> > >uS between writes.  Decent speed.  But I wonder if I can get any
> > >better/faster then this?
>
> > No.  The I/O port instructions simply do not run any faster than that..  The
> > low I/O ports are ISA compatible, and are limited by some of the
> > motherboard components to essentially ISA bus speeds.
> > --
> > Tim Roberts, t...(a)probo.com
> > Providenza & Boekelheide, Inc.- Hide quoted text -
>
> - Show quoted text -


Well I have done this already to the best I could. Using instruction
assembly IN/OUT with the C dll, I achieved 1.65 uS between writes and
reads. But when I use a IOCTL call, device driver to send a IRP and
then down the stack, it takes about 8-10 uS as I have measured on the
scope between each write or read.

In my code I'm basically just repeating the commands one after the
other.

My 40MHz FPGA device down below handles it no problem. But the issue
is the reads. There are lots of ADCs, multi-meter measuremts (GPIB),
and analog stuff that require some time to measure and latch-in. I
would prefer 1-2 uS that I have now but not the 8-10 uS retry times.


I just want to know that I am doing everything right, that my timings
are consistent. That 1-2uS is the best I can do bypassing HAL, and
the 8-10uS is considered "normal" when doing METHOD_BUFFERED IOCTL
WRITE_PORT_UCHAR commands...etc.etc. Oh, and this is with WinXP and a
P4 2.2GHz system and a 32-bit PCI 2.3 bus.

thanks.

From: L337 on
I just used DebugView to look at the IOCTL calls the timings between
each PCI IO transaction are consistent with what I'm seeing on the
scope.

PORTTALK: IOCTL_WRITE_PORT_UCHAR(0x300,0xFF) // at 0.00000000
sec
STATUS_SUCCESS RETURNED // at
0.00000299 sec (2.99 uS) , this is when ntStatus = STATUS_SUCCESS;
PORTTALK: IOCTL_WRITE_PORT_UCHAR(0x300,0xAA) // at 0.00001238
sec (1.238 uS) , the next IO WRITE
STATUS_SUCCESS RETURNED // at
0.00001454 sec (1.454 uS) , here is when it completes again

So it looks like, it takes about 2-3 microseconds from the kernel
making the call to when the PCI actually delivers the command.
Actually, I can't be specific, since I don't know precisely where to
trigger. But looking at this timing, I would say that too much time
is spent between calling the consecutive writes. As we can see, from
2.99uS to 1.238 uS time later, is when the second requests actually
starts. Wow this is taking too long on the software side. Perhaps, I
can write a faster code ? There isn't much difference in the way the
generic Genport winDDK driver vs. porttalk IOCTL calls. They both look
identical in function to me in the way they process the and pass the
IOCTL to the kernel mode driver.

The way I see it is that I want to do fast back-to-back burst IO
writes and IO reads.