Prev: Threading with Activerecord
Next: Aliasing classes
From: Raul Parolari on 30 Jun 2010 10:52 I have some final results on the problem described. First, I must correct something on my initial post; I had stated that there was an occasional high delay between the moment of the 'select' call and the 'select' return (ie, although the timeout set in the select was 50msec, the delay could be of 5 seconds). Actually, tracing across all calls in that section of code, the delay occurs between the select return and the 'recvfrom'. The details of what follows may be of interest to anyone using Ruby for fast communication. Test environment: - pure Ruby 1.9.2 (no gems, just the 'socket' library) on an ubuntu machine (lots of memory) - Ruby sends 4 Udp msgs per second to a micro-controller - The micro (C/assembler) responds (Udp) within a 10-30 milliseconds range - So it is 4 msgs sent and 4 responses rcvd every second This is what I saw since midnight in one of the systems (the symbol '<->' means 1 msg sent and response; the symbol '!!' was inserted to grep all abnormal results): # Time as Hour:Min:Sec:Msec; the 'delay_sel_rcv' (the time between return of 'select' and 'recvfrom') value is in Seconds # log from midnight; all perfect until 1:21 am 01:21:19:914: <->: !! delay_sel_rcv=10.006525661 01:21:29:928: <->: !! delay_sel_rcv=10.010217133 01:21:39:937: <->: !! delay_sel_rcv=10.004327574 01:21:49:954: <->: !! delay_sel_rcv=10.011541082 01:21:59:972: <->: !! delay_sel_rcv=10.005877574 01:22:05:973: <->: !! delay_sel_rcv= 5.998151639 # then all ok unti: 02:22:27:374: <->: !! delay_sel_rcv=10.008022384 02:22:37:394: <->: !! delay_sel_rcv=10.008430684 02:22:47:401: <->: !! delay_sel_rcv=10.004019076 02:22:57:409: <->: !! delay_sel_rcv=10.005836859 02:23:07:580: <->: !! delay_sel_rcv=10.008476556 02:23:17:610: <->: !! delay_sel_rcv=10.007506338 02:23:27:642: <->: !! delay_sel_rcv=10.007311141 02:23:37:655: <->: !! delay_sel_rcv=10.008225368 02:23:47:685: <->: !! delay_sel_rcv=10.018187389 # then all ok until 04:24:08:873: <->: !! delay_sel_rcv=10.006355125 ... We can see from the above: - the first 80 minutes (from midnight to 01:21) went fine - then we see several delays of 10 seconds, in the same minute (each 10 seconds apart from the other) - for 1 hour all was pefect again, exchanging some 12,000 messages with perfect timing - then we have 9 delays of 10 seconds (again separated by 10 seconds) - for 1 hour all went fine again; then the cycle repeats This pattern can only indicate (in my view) the garbage collector, which Ruby seems to run for 10 seconds several times in the same minute or so. I could not put the calls to GC.disable/enable (to have the final proof), around the select/recvfrom (not to interfere with a real experiment that was moving heavy equipment). Notice that, if it is the GC, disabling/enabling it will only shift the problem from one area of the communication handler to another (and thus having a similar impact on the applications using the comm handler). Interestingly, this problem does not happen within 1 computer; I used the identical Ruby program but replacing the Firmware with a Ruby simulator (with same machines, same Udp and the same binary strings exchanged); in a test of 10 hours, I only saw occasional "delays" betwen select and recvfrom, but in the order of 100 milliseconds, and never of 5/10 seconds. This would seem to indicate an inefficiency in the Udp stack (when used across computers). My conclusion is that if you want a predictable delay (with values spread across a 'tight bell' curve, not just increasing the timeout to cope with 'everything'), you must use (for that section of the software) a compiled language; at least until the technology of garbage collector changes. I hope that this is useful to others who use Ruby for high speed communication (and the ones working on garbage collectors). -- Last note: one year ago I met in a party a JPL engineer working on the Mars exploration program; he was admirative of Ruby, but after some jokes on the expressivity and beauty in old and new languages, he added that they would never use scripting languages because "we don't want the garbage collector to enter in action just when we should to begin to slow down the spacecraft near Mars and miss the landing! in fact, we don't even use C++, as we did not find it totally predictable; so we will still use C for years to come". I never knew how well I would learn to appreciate his point Raul Parolari raulparolari(a)gmail.com -- Posted via http://www.ruby-forum.com/.
From: Roger Pack on 30 Jun 2010 11:32 > - for 1 hour all went fine again; then the cycle repeats > > This pattern can only indicate (in my view) the garbage collector, which > Ruby seems to run for 10 seconds several times in the same minute or so. You can test your hypothesis by running it against jruby with a "constant time" GC. If you're using 1.9 you could also test your hypothesis by using the GC::Profiler http://en.wikibooks.org/wiki/Ruby_Programming/Reference/Objects/IO/GC/Profiler Also the "exact" 5 second delays makes me wonder if it isn't a DNS lookup. Try your code with BasicSocket.do_not_reverse_lookup=true -r -- Posted via http://www.ruby-forum.com/.
From: Raul Parolari on 30 Jun 2010 13:40 Roger Pack wrote: > You can test your hypothesis by running it against jruby with a > "constant time" GC. > If you're using 1.9 you could also test your hypothesis by using the > GC::Profiler > http://en.wikibooks.org/wiki/Ruby_Programming/Reference/Objects/IO/GC/Profiler > > Also the "exact" 5 second delays makes me wonder if it isn't a DNS > lookup. > > Try your code with > BasicSocket.do_not_reverse_lookup=true Three observations: a) I do not think that there can be a DNS lookup between a 'select(x, y, z)' and a 'socket.recvfrom'. I also clarify that the 2 statements are almost sequential; after the select, you verify that the there is a 'read condition' and then you issue the 'recvfrom'. The whole section takes less than 10 lines of code (with printouts, etc), and it should not generate any external query. b) for the same reason I did not need a profiler (in a real system moving machinery) to evaluate the time spent between those 2 statements. Just computing the difference between the 2 times ('Time.now' when select and recvfrom return), and then printing when the difference was > a threshold was enough. As I said, this happened regularly (after many thousands of messages were exchanged with perfect timing), producing several 'processing gaps' of either 5 or 10 seconds every hour (approx). This occurred only when UDP was used over Ethernet between 2 machines (with the apps inside the same computer I never saw the problem). c) I could not try jruby. Although Java may not be immune to perhaps similar problems (see problems found in the Android phone). My conclusion is that dynamic allocation and deallocation of memory for tasks that occur every 100 msec or so is to be minimized (if possible; else, move that task to a C program). Still, more fundamentally, I think that we need a different garbage collector architecture for Ruby. All the games disabling/enabling Gc, tinkering, restructuring code, etc, are not the solution (for me). -- Posted via http://www.ruby-forum.com/.
From: Roger Pack on 30 Jun 2010 16:26 >> Try your code with >> BasicSocket.do_not_reverse_lookup=true > > Three observations: > > a) I do not think that there can be a DNS lookup between a 'select(x, y, > z)' and a 'socket.recvfrom'. It might, see http://www.ensta.fr/~diam/ruby/online/ruby-doc-stdlib/libdoc/socket/rdoc/classes/Socket.html#M001258 Why don't you try it and find out? > I could not try jruby. Did it not install? Also did you try 1.9.1 with it's GC::Profiler? -r -- Posted via http://www.ruby-forum.com/.
From: Joel VanderWerf on 30 Jun 2010 17:03
Roger Pack wrote: >>> Try your code with >>> BasicSocket.do_not_reverse_lookup=true >> Three observations: >> >> a) I do not think that there can be a DNS lookup between a 'select(x, y, >> z)' and a 'socket.recvfrom'. > > It might, see > http://www.ensta.fr/~diam/ruby/online/ruby-doc-stdlib/libdoc/socket/rdoc/classes/Socket.html#M001258 > > Why don't you try it and find out? Seconded: it's worth a try. Also, if you really believe this is GC-related, you can prove it by GC.disable before the select, GC.enable after the recvfrom. It seems highly unlikely that GC would be eating up 10 seconds. |