From: David Schwartz on 6 May 2010 14:19 On May 6, 4:40 am, Rainer Weikusat <rweiku...(a)mssgmbh.com> wrote: > There is none. POLLHUP and POLLERR are two revents-only values which > report possibly interesting connection state changes other than 'data > to read available' or 'bufferspace to write data into > available'. Since both errors and 'hangups' can occur during normal > I/O operations, the respective input- and output-handlers need to be > able to deal with these conditions, anyway, and there is no reason to > care specially for either of both. Because of this, I usually do > something like > > if (revents & ~(POLLIN | POLLOUT)) revents |= POLLIN; > > and let the ordinary input handler deal with it. I sometimes wonder why 'poll' didn't do that by default and let you ignore IN if HUP or ERR is set if you want to. Either way would work, but settings POLLIN would be more consistent with how 'select' behaves. I wonder if this was felt to be a deficiency in 'select'. In any event, it's a minor issue. You can certainly treat HUP/ERR the same way as IN if you want. Sometimes it makes your code more efficient not to, but it does complicate handling half-open connections if you need to. The point is that both 'select' and 'poll' have small wrinkles that can bite a first-time user. None of this is a big deal though, every interface is like that. DS
From: Ersek, Laszlo on 6 May 2010 15:40 On Thu, 6 May 2010, Rainer Weikusat wrote: > "Ersek, Laszlo" <lacos(a)caesar.elte.hu> writes: >> On Wed, 5 May 2010, Rainer Weikusat wrote: > > [...] > >>> and usually map anything 'weird' which might be returned in revents >>> to POLLIN. If it is an EOF, read will detect that. The same is true >>> for any kind of error condition. >> >> (I kind of feel a contradiction between this and: >> >> ----v---- >> From davids(a)webmaster.com Wed May 5 12:49:19 2010 >> Date: Wed, 5 May 2010 03:49:19 -0700 (PDT) >> From: David Schwartz <davids(a)webmaster.com> >> Newsgroups: comp.unix.programmer >> Subject: Re: experienced opinions >> >> [snip] >> >> In fact, the only common error I see with select is thinking that >> poll' will return writability or readability if the connection closes >> or errors. > > [...] > > There is none. POLLHUP and POLLERR are two revents-only values which > report possibly interesting connection state changes other than 'data > to read available' or 'bufferspace to write data into > available'. Since both errors and 'hangups' can occur during normal > I/O operations, the respective input- and output-handlers need to be > able to deal with these conditions, anyway, and there is no reason to > care specially for either of both. Because of this, I usually do > something like > > if (revents & ~(POLLIN | POLLOUT)) revents |= POLLIN; > > and let the ordinary input handler deal with it. Oh, now I see it. I misunderstood your topmost sentence: >> On Wed, 5 May 2010, Rainer Weikusat wrote: >>> and usually map anything 'weird' which might be returned in revents to >>> POLLIN. [...] I interpreted this as "and usually map anything 'weird' which might be returned in revents to POLLIN [in *events*]", while what you actually meant was "and usually map anything 'weird' which might be returned in revents to POLLIN [in *revents*]". The second form corresponds to the code you posted. The first (misinterpreted) form does match what David describes as an error, doesn't it? The first form says "I'll just set POLLIN in /events/ and I'll get POLLIN in /revents/ too if anything weird happens". Stuck in the select() mindset, I am :) Thanks again, lacos
From: Ersek, Laszlo on 6 May 2010 18:06 On Tue, 4 May 2010, David Schwartz wrote: > On May 4, 10:01�am, "Ersek, Laszlo" <la...(a)caesar.elte.hu> wrote: > >> I like to understand function specifications in depth before calling >> said functions. > > As I recall, 'select' actually takes three FD sets. Does anyone know > precisely what that third set is for? SUSv4 XSH 2.10.11 "Socket Receive Queue" [0] and onwards describes "out-of-band data" and "out-of-band data mark". The word "segment" is used in a logical sense, not TCP segment. The select() spec [1] says: ----v---- The pselect() function shall examine the file descriptor sets whose addresses are passed in the /readfds/, /writefds/, and /errorfds/ parameters to see whether some of their descriptors are ready for reading, are ready for writing, or have an exceptional condition pending, respectively. [...] If a socket has a pending error, it shall be considered to have an exceptional condition pending. Otherwise, what constitutes an exceptional condition is file type-specific. For a file descriptor for use with a socket, it is protocol-specific except as noted below. [...] If a descriptor refers to a socket, the implied input function is the /recvmsg()/ function with parameters requesting normal and ancillary data, such that the presence of either type shall cause the socket to be marked as readable. The presence of out-of-band data shall be checked if the socket option SO_OOBINLINE has been enabled, as out-of-band data is enqueued with normal data. [...] [...] A socket shall be considered to have an exceptional condition pending if a receive operation with O_NONBLOCK clear for the open file description and with the MSG_OOB flag set would return out-of-band data without blocking. (It is protocol-specific whether the MSG_OOB flag would be used to read out-of-band data.) A socket shall also be considered to have an exceptional condition pending if an out-of-band data mark is present in the receive queue. Other circumstances under which a socket may be considered to have an exceptional condition pending are protocol-specific and implementation-defined. ----^---- <rant> The text lists "state -> exceptional condition" implications. When one checks a bit in the third fd_set, he needs the reverse direction: "exceptional condition -> what state?". In my interpretation, at least the following situations are possible: (1) Pending error -- use getsockopt(sock, SOL_SOCKET, SO_ERROR, ...). Continuing with TCP in mind, which should - support out-of-band data, - support the out-of-band data mark, - enqueue out-of-band data at the end of the queue, - not place ancillary-data-only segments in the queue (that is, segments with neither normal nor out-of-band data), (2) An out-of-band data mark is present in the Receive Queue. (Regardless of whether SO_OOBINLINE was set.) Out-of-band data may not be readable without blocking when the third fd_set fires -- only the mark may be present. Supposing TCP over IP(v4), a large TCP segment may be fragmented. When the first fragment (containing the TCP header and thus the urgent pointer) is processed, the mark may be placed immediately in the queue. logical segment | hole to be filled | mark | expected logical segment with normal data | with normal data | | with out-of-band data For me this means that the third fd_set can't be used at all in a select() that is meant to block, as such a select may return immediately, and a subsequent blocking receive call may still block (or a nonblocking one may still return with -1/EAGAIN (or EWOULDBLOCK), resulting in spinning). (The figure above is intended to display the in-line enqueueing of out-of-band data, but it really doesn't matter now -- it would only effect how that data would be available (or how that data would become lost) once we got to the mark.) My approach is to set SO_OOBINLINE. This allows me to work with the first two sets only. Errors are returned with read()/write(). The finalization of a pending connect() is signalled as writability, and the result can be queried via SO_ERROR. The presence of an out-of-band data *mark* without the OOB data itself won't wake the select(). When woken and FD_ISSET() reports readability, out-of-band data is simply checked for with sockatmark() before each read(). No read() will coalesce normal and OOB data. If sockatmark() returns 1, the next byte to read() is the urgent byte. (No MSG_OOB is needed.) If the kernel processes multiple TCP headers with urgent pointers before I get to call sockatmark(), each single urgent byte but the last one will be inlined in the normal data stream. (I seem to remember that I experimented with this on Linux, and this was the case even with SO_OOBINLINE turned off. An urgent byte was only dropped if SO_OOBINLINE was turned off and I actively read past the mark, before the mark was moved. So I didn't really care about race conditions.) (I'm not sure how one could do without SO_OOBINLINE. A TCP segment carrying a single urgent byte wouldn't wake a select() that didn't pass a third fd_set. On the other hand, a select() passing a third fd_set could lead to indefinite spinning, or indefinite blocking.) The problem with sockatmark() is that it was first introduced in SUSv3. I wrote my port forwarder for SUSv1. I think I worked it around by calling select() in a non-blocking manner right before and after the read(), to see if there was an exceptional condition pending that ceased due to the read(). (If the "before" condition was true due to an error, then the "after" condition wasn't checked, because read() returned that error first.) I chose this as the default workaround. I (hopefully) implemented this before-after check "primitive" with recv() too. Once the SO_OOBINLINE socket signalled readability, I temporarily turned off SO_OOBINLINE, and tried to read a single byte with recv(..., MSG_PEEK | MSG_OOB), then turned SO_OOBINLINE back on. A successful receive meant "at the mark" (and left the urgent byte over to the normal, subsequent read), -1/EINVAL meant "no mark", and -1/EAGAIN (or EWOULDBLOCK) meant "mark nearby". (This was no problem, because a before-after "mark nearby" -> "no mark" transition, due to the normal, in-lined read(), was interpreted as "urgent data consumed" just the same.) Looking back, perhaps I could have found a third way: recvmsg() can report on output, in msg_flags, whether it consumed out-of-band data. Since OOB (logical) segments don't coalesce with normal (logical) segments, recvmsg() would have consumed a sole byte at these times. I'm not sure how I would have had to fiddle with SO_OOBINLINE, though. I never considered SIOCATMARK. (If anyone cares, the code is "forward3.c" under [2]; most recently, it's been running on my workstation since Mar 25 to log SOAP.) I would never touch out-of-band data again; not with a ten foot pole. --o-- If we're already talking about what socket functions precisely do, can anyone judge whether both Solaris' and Linux' handling of accept() are POSIX conformant, when accept() fails with -1/EMFILE (or due to another resource scarcity)? On Linux, such an event throws away the pending connection (I think the peer gets a FIN instead of an RST because the kernel pre-completes the handshake). On Solaris, the incoming connection remains pending, and one must not FD_SET the listening socket before the next close(), or else the select()-accept() loop will spin. Any constructive criticism is greatly appreciated. </rant> Cheers, lacos [0] http://www.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_10_11 [1] http://www.opengroup.org/onlinepubs/9699919799/functions/select.html [2] http://lacos.hu
From: David Schwartz on 6 May 2010 23:37 On May 6, 3:06 pm, "Ersek, Laszlo" <la...(a)caesar.elte.hu> wrote: > I would never touch out-of-band data again; not with a ten foot pole. Agreed. It was a bad idea, poorly executed, that has festered since. > If we're already talking about what socket functions precisely do, can > anyone judge whether both Solaris' and Linux' handling of accept() are > POSIX conformant, when accept() fails with -1/EMFILE (or due to another > resource scarcity)? On Linux, such an event throws away the pending > connection (I think the peer gets a FIN instead of an RST because the > kernel pre-completes the handshake). On Solaris, the incoming connection > remains pending, and one must not FD_SET the listening socket before the > next close(), or else the select()-accept() loop will spin. The 'select'/'accept' loop should spin in that case. This is why you must perform operations that reduce resource consumption before operations that increase them. And if you make no forward progress, you must implement a rate-limiter. And if you detect resource exhaustion, you must do something about it! If the implementation tells you that you have too many open files, it is not sensible to react by trying to open more files. I find both behaviors sensible, FWIW. I like Solaris' behavior because, knowing the behavior, it's easier to code around it sensibly. Linux's behavior works better if you don't consider the case specifically. Generally, in the case where you can't accept any more incoming connections, you need/want to close any attempts as quickly as possible. Sensible clients will interpret this as an overload condition. DS
From: Ersek, Laszlo on 7 May 2010 07:26
On Thu, 6 May 2010, David Schwartz wrote: > Generally, in the case where you can't accept any more incoming > connections, you need/want to close any attempts as quickly as possible. > Sensible clients will interpret this as an overload condition. Thank you for the advice. I considered closing the server socket and setting it up again, so as to "flush" all connection requests pending in the listen queue. However, I was afraid that I might not be able to re-bind the same local address (even with SO_REUSEADDR, another process might "steal" the port), and then all would be lost. I implemented a primitive "rate limiter": no more connections accepted until a living socket is closed. A client waiting for an acknowledgement of its connect() may time out, but for the caliber at hand this approach seemed workable. How could the Solaris implementation be made refuse new connections if the "rate limiter" is in effect? Simply by setting up a low backlog value with the initial listen()? Or by manipulating the backlog dinamically with repeated listen() calls during peaking loads? (I don't know if that's possible at all. The listen() spec in the SUSv4 [0] doesn't seem to disallow it or to define an error condition for it. It could be an interesting experiment to see whether, with say 16 connections pending and waiting for an accept(), a listen(srv, 10) would immediately reset the last six connection requests; in effect flushing the listen queue and protecting further clients from waiting.) Thank you, lacos [0] http://www.opengroup.org/onlinepubs/9699919799/functions/listen.html |