Prev: sch_gpio: fix compilation warning "ignoring return value of gpiochip_remove"
Next: [PATCH] timbgpio: Fix build.
From: Mike Hayward on 3 Mar 2010 04:40 I'm not sure who is working on block io these days, but hopefully an active developer can steer this feedback toward folks who are as interested in io performance as I am :-) I've spent the last several years or so developing a user space distributed storage system and I've recently gotten down to some io performance tuning. Surprisingly, my results indicate that the O_NONBLOCK flag produces no noticable effect on read or writev to a Linux block device. I always perform aligned ios which are a multiple of the sector size which also allows the use of O_DIRECT if desired. For testing, I've been using 2.6.22 and 2.6.24 kernels (fedora core and ubuntu distros) on both x86_64 and 32 bit arm architectures and get similar results on every variation of hardware and kernel tested, so I figure the behavior may still exist in the most recent kernels. To extract the following data, I used the following set of system calls in a loop driven by poll, surrounding read and write calls immediately with time checks. fd = open( filename, O_RDWR | O_NONBLOCK | O_NOATIME ); gettimeofday( &time, 0 ); read( fd, pos, len ); writev( fd, iov, count ); poll( pfd, npfd, timeoutms ); Byte counts are displayed in hex. On my core 2 duo laptop, for example, io to or from the buffer cache typically takes 100 to 125 micro seconds to transfer 64k. ---------------------------------------------------------------------- BUFFER CACHE NOT FULL, NONBLOCKING 64K WRITES AS EXPECTED write fd:3 0.000117s bytes:10000 remain:0 write fd:3 0.000115s bytes:10000 remain:0 write fd:3 0.000116s bytes:10000 remain:0 write fd:3 0.000118s bytes:10000 remain:0 write fd:3 0.000125s bytes:10000 remain:0 write fd:3 0.000126s bytes:10000 remain:0 write fd:3 0.000101s bytes:10000 remain:0 ---------------------------------------------------------------------- READING AND WRITING, BUFFER CACHE FULL read fd:3 0.006351s bytes:10000 remain:0 write fd:3 0.001235s bytes:200 remain:0 write fd:3 0.002477s bytes:200 remain:0 read fd:3 0.005010s bytes:10000 remain:0 write fd:3 0.001243s bytes:200 remain:0 read fd:3 0.005028s bytes:10000 remain:0 write fd:3 0.000506s bytes:200 remain:0 write fd:3 0.000106s bytes:10000 remain:0 write fd:3 0.000812s bytes:200 remain:0 write fd:3 0.000108s bytes:10000 remain:0 write fd:3 0.000807s bytes:200 remain:0 write fd:3 0.002652s bytes:200 remain:0 write fd:3 0.000107s bytes:10000 remain:0 write fd:3 0.000141s bytes:10000 remain:0 write fd:3 0.002232s bytes:200 remain:0 These are not worst-case, but rather best case results. For an example of more worse case results, using a usb flash device, frequently (about once a second or so) under heavier load I see reads or writes blocked for 500ms or more when vmstat and top report more than 90% idle / wait. 500ms to perform a 512 byte "non blocking" io with a nearly idle cpu is an eternity in computer time; more than 10,000 times longer than it should take to memcpy all or even a portion of the data or return EAGAIN. I discovered this because, even though they succeed, all of these "non" blocking system calls are blocking so much so that they easily choke my process non blocking socket io. As a work around to this failed attempt at nonblocking disk io, I now intend to implement a somewhat more complex solution using aio or scsi generic to prevent block device io from choking network io. I think this O_NONBLOCK behavior has aspects that could probably be classified as both a documentation and a kernel defect depending upon whether the existing open(2) man page documents the intended behavior of read and write or not. If O_NONBLOCK is meaningful whatsoever (see man page docs for semantics) against block devices, one would expect a nonblocking io involving an unbuffered page to return either a partial result if a prefix of the io can be completed immediately, or EAGAIN, schedule an io against the device, then trigger a blocking select or poll type call after the relevant page at the offending file descriptor cursor becomes available in the buffer cache. The timing and results of each read or write call speak for themselves. Specifying O_NONBLOCK does not convert unbuffered ios to async buffer cache ios as expected; typically blocking ios (i.e unbuffered reads or sustained writes to a full, dirty buffer cache) definitely block in my app, whether or not O_NONBLOCK is specified. I've spent a tremendous amount of time building and benchmarking a program based upon the Linux documentation for the previously mentioned system calls only to find out the kernel doesn't behave as specified. To save someone else from my fate, if O_NONBLOCK doesn't prevent reads and writes to block devices from blocking, it should be documented in the man page, and preferably also return an error when supplied as a flag to open or fcntl for a block device. That's the easy solution. The harder solution would be to make the system calls actually be non blocking when O_NONBLOCK is specified. Furthermore, I've also noticed these kernels also allow O_NONBLOCK and O_DIRECT to be simultaneously specified against a block device even though this is not logically even possible since, by definition, the buffer cache is not involved and the process will have to wait for the io to synchronously complete. This flag incompatibility should probably be documented for clarity and it would be straight forward for it to return an error if these contradictory behaviors are simultaneously specified, unintentionally of course. Thoughts anyone? - Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Alan Cox on 3 Mar 2010 06:50 > If O_NONBLOCK is meaningful whatsoever (see man page docs for > semantics) against block devices, one would expect a nonblocking io It isn't... The manual page says "When possible, the file is opened in non-blocking mode" . Your write is probably not blocking - but the memory allocation for it is forcing other data to disk to make room. ie it didn't block it was just "slow". O_NONBLOCK on a regular file does influence how it responds to leases and mandatory locks. > probably be documented for clarity and it would be straight forward > for it to return an error if these contradictory behaviors are > simultaneously specified, unintentionally of course. and risk breaking existing apps. > Thoughts anyone? Alan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Mike Hayward on 3 Mar 2010 15:10 Hi Alan, > > If O_NONBLOCK is meaningful whatsoever (see man page docs for > > semantics) against block devices, one would expect a nonblocking io > > It isn't... Thanks for the reply. It's good to get confirmation that I am not all alone in an alternate non blocking universe. The linux man pages actually had me convinced O_NONBLOCK would actually keep a process from blocking on device io :-) > The manual page says "When possible, the file is opened in non-blocking > mode" . Your write is probably not blocking - but the memory allocation > for it is forcing other data to disk to make room. ie it didn't block it > was just "slow". Even though I know quit well what blocking is, I am not sure how we define "slowness". Perhaps when we do define it, we can also define "immediately" to mean anything less than five seconds ;-) You are correct that io to the disk is precisely what must happen to complete, and last time I checked, that was the very definition of blocking. Not only are writes blocking, even reads are blocking. The docs for read(2) also says it will return EAGAIN if "Non-blocking I/O has been selected using O_NONBLOCK and no data was immediately available for reading." There is no doubt the kernel is blocking the process whether or not O_NONBLOCK is specified. Look again at the timings I sent; the flag doesn't affect io at all. I think we can probably agree that reading from an empty buffer cache should by definition return EAGAIN within a few microseconds if it isn't going to block the process. But it doesn't. I can easily make a process "run slowly" for an entire half of a second or longer just trying to perform a 512 byte "non blocking" read on a system with a virtually idle cpu. Writing is no different from reading when the buffer cache cannot immediately service either kind of request (i.e. all pages are dirty, writing a page not in the cache, and there is no more free ram). If a process can't run while the kernel performs io to a device to service a writev call, it is by definition blocking said process. I certainly concur that blocking is also both slow and not very immediate :-) Why is blocking io an issue? As an example, time non blocking reads to a drive and it takes say 5ms to return from a 64k read. Run several processes simultaneously doing the same thing and it takes say 10ms to service each "non blocking" read request. Do a couple hundred ios per second in each process and you'll soon find out your processes (or threads) have nearly zero time at the cpu despite the fact that the system is virtually idle and you are performing 100% "linux non blocking" device io. I've been doing unix io for a very long time and can assure you that this is precisely why most high performance io applications use asynchronous io libraries or multiple threads. It isn't that they are necessarily compute intensive, but if read and write are going to blocking your process, how else can you simultaneously execute ios to different devices or perform computation while waiting on device io? ---------------------------------------------------------------------- There is currently and quite literally no point in specifying O_NONBLOCK in Linux when opening a block device to affect anything other than locking semantics, since it doesn't do anything. ---------------------------------------------------------------------- I'm not arguing that linux either should or should not provide non blocking read and write calls, but pointing out that the documentation claims it does when clearly O_NONBLOCK doesn't do anything related to io, at least not with a block device. Probably it doesn't do anything related to read or write against file systems either. > > probably be documented for clarity and it would be straight forward > > for it to return an error if these contradictory behaviors are > > simultaneously specified, unintentionally of course. > > and risk breaking existing apps. Changing anything risks breaking an app somewhere :-) You are right, I completely agree it isn't appropriate to remove it since it's meaning has been overloaded and it affects locking semantics with O_DIRECT. Perhaps the man pages are partly derived from POSIX specs and non blocking read and write calls are where linux eventually wants to be? Updating the docs to describe it's actual behavior as it stands (or rather, lack thereof) should be fairly low impact on existing apps. How much effort do you think it would take to build consensus to update the man pages? Accurate man pages don't really break code and should really cut down on a lot of confusion, emails, and wasted effort going forward. Do you think we should post a documentation defect as opposed to a kernel defect? - Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Alan Cox on 3 Mar 2010 16:30 > blocking. Not only are writes blocking, even reads are blocking. The > docs for read(2) also says it will return EAGAIN if "Non-blocking I/O > has been selected using O_NONBLOCK and no data was immediately > available for reading." The read case is more clearly blocking. We don't implement non blocking disk I/O in that sense, although AIO sort of does and threads are very cheap for I/O tasks. > There is no doubt the kernel is blocking the process whether or not > O_NONBLOCK is specified. Look again at the timings I sent; the flag > doesn't affect io at all. I think we can probably agree that reading > from an empty buffer cache should by definition return EAGAIN within a > few microseconds if it isn't going to block the process. That might make sense in its own way but there would then be no reason for the I/O ever to complete. Non blocking tends to mean "don't wait for some external non kernel event" (eg serial data arriving, hitting a button) > I've been doing unix io for a very long time and can assure you that > this is precisely why most high performance io applications use > asynchronous io libraries or multiple threads. It isn't that they are > necessarily compute intensive, but if read and write are going to > blocking your process, how else can you simultaneously execute ios to > different devices or perform computation while waiting on device io? The big challenge is that you may need to do disk I/O in many situations you don't expect. Eg to find out which disk block in the cache you want to see is available might require disk I/O itself. You would end up with an implementation model in the kernel that was essentially if (O_NDELAY) { try_op if blocking create thread } which would badly underperform threading it in the first place. Unix perhaps never got it entirely right, but we inherited that model. VMS SYS$QIO v SYS$QIOW is a good deal more elegantly structured. > claims it does when clearly O_NONBLOCK doesn't do anything related to > io, at least not with a block device. Probably it doesn't do anything > related to read or write against file systems either. Correct - except for things like mandatory locks where it has a real meaning. > Perhaps the man pages are partly derived from POSIX specs and non > blocking read and write calls are where linux eventually wants to be? > Updating the docs to describe it's actual behavior as it stands (or > rather, lack thereof) should be fairly low impact on existing apps. I've not read the SuS entries on this for a while. There was some discussion a while ago on what was needed to create an behaviour where as soon as something blocked it created a thread that continued to perform the I/O side and returned an error. It's not an easy problem to solve and it's not clear that solving it is actually worth it versus using threads and making sure our thread implemntation is fast and has fast synchronization primitives. > How much effort do you think it would take to build consensus to > update the man pages? Accurate man pages don't really break code and > should really cut down on a lot of confusion, emails, and wasted > effort going forward. Do you think we should post a documentation > defect as opposed to a kernel defect? I would go one further... post a documentation patch to: linux-man(a)vger.kernel.org for discussion merging. Alan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: M vd S on 4 Mar 2010 20:50 > > > If O_NONBLOCK is meaningful whatsoever (see man page docs for > > > semantics) against block devices, one would expect a nonblocking io > > > > It isn't... > > Thanks for the reply. It's good to get confirmation that I am not all > alone in an alternate non blocking universe. The linux man pages > actually had me convinced O_NONBLOCK would actually keep a process > from blocking on device io :-) > You're even less alone, I'm running into the same issue just now. But I think I've found a way around it, see below. > > The manual page says "When possible, the file is opened in non-blocking > > mode" . Your write is probably not blocking - but the memory allocation > > for it is forcing other data to disk to make room. ie it didn't > block it > > was just "slow". > > Even though I know quit well what blocking is, I am not sure how we > define "slowness". Perhaps when we do define it, we can also define > "immediately" to mean anything less than five seconds ;-) > > You are correct that io to the disk is precisely what must happen to > complete, and last time I checked, that was the very definition of > blocking. Not only are writes blocking, even reads are blocking. The > docs for read(2) also says it will return EAGAIN if "Non-blocking I/O > has been selected using O_NONBLOCK and no data was immediately > available for reading." > The read(2) manpage reads, under NOTES: "Many file systems and disks were considered to be fast enough that the implementation of O_NONBLOCK was deemed unnecessary. So, O_NONBLOCK may not be available on files and/or disks." The statement ("fast enough") maybe only reflects the state of affairs at that time - 10 ms seek time takes an eternity at 3 GHz, and times 100k it takes an eternity IRL as well. I would define "immediately" if the data is available from kernel (or disk) buffers. I need to do vast amounts (100k+) of scattered and unordered small reads from harddisk and want to keep my seeks short through sorting them. I have done some measurements and it seems perfectly possible to derive the physical disk layout from statistics on some 10-100k random seeks, so I can solve everything in userland. But before writing my own I/O scheduler I'd thought to give the kernel and/or SATA's NCQ tricks a shot. Now the problem is how to tell the kernel/disk which data I want without blocking. readv(2) appearantly reads the requests in array order. Multithreading doesn't sound too good for just this purpose. posix_fadvise(2) sounds like something: "POSIX_FADV_WILLNEED initiates a non-blocking read of the specified region into the page cache." But there's appearantly no signalling to the process that an actual read() will indeed not block. readahead(2) blocks until the specified data has been read. aio_read(2) appearantly doesn't issue a real non blocking read request, so you will get the unneeded overhead of one thread per outstanding request. mmap(2) / madvise(2) / mincore(2) may be a way around things (although non-atomic), but I haven't tested it yet. It might also solve the problem that started this thread, at least for the reading part of it. Writing a small read() like function that operates through mmap() doesn't seem too complicated. As for writing, you could use msync() with MS_ASYNC to initiate a write. I'm not sure how to find out if a write has indeed taken place, but at least initiating a non-blocking write is possible. munmap() might then still block. Maybe some guru here can tell beforehand if such an approach would work? Cheers, M. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
|
Next
|
Last
Pages: 1 2 3 Prev: sch_gpio: fix compilation warning "ignoring return value of gpiochip_remove" Next: [PATCH] timbgpio: Fix build. |