Prev: unsubscribe
Next: MD5 16 octet - how to compute?
From: Derek Cannon on 18 Apr 2010 06:17 Sure, I'll post HTML examples. In this non-simplified version, there are 20 columns per row which are: availability, course_reference_number, subject, course_number, section, campus, credit_hours, title, days, time, cap, registered, remaining, xl_cap, xl_registered, xl_remaining, professor, date, location, attributes Here's three specific examples of the HTML that cover all the possibilities (normal class, course with TBA day, and labs): <!--1. Normal class--> <TR> <TD class="dddefault"><ABBR title="Not available for registration">NR</ABBR></TD> <TD class="dddefault"><A href="https://ggc.gabest.usg.edu/pls/B400/bwckschd.p_disp_listcrse?term_in=201008&subj_in=ACCT&crse_in=2101&crn_in=80983" onmouseover="window.status='Detail'; return true" onfocus="window.status='Detail'; return true" onmouseout="window.status=''; return true" onblur="window.status=''; return true">80983</A></TD> <TD class="dddefault">ACCT</TD> <TD class="dddefault">2101</TD> <TD class="dddefault">01</TD> <TD class="dddefault">A</TD> <TD class="dddefault">3.000</TD> <TD class="dddefault">Intro to Financial Accounting</TD> <TD class="dddefault">MW</TD> <TD class="dddefault">08:00 am-09:15 am</TD> <TD class="dddefault">30</TD> <TD class="dddefault">0</TD> <TD class="dddefault">30</TD> <TD class="dddefault">0</TD> <TD class="dddefault">0</TD> <TD class="dddefault">0</TD> <TD class="dddefault"><ABBR title="To Be Announced">TBA</ABBR></TD> <TD class="dddefault">08/23-12/09</TD> <TD class="dddefault">A 1880</TD> <TD class="dddefault"> </TD> </TR> <TR> <!--2. No day/time separation due to class days being "TBA":--> <TR> <TD class="dddefault"><ABBR title="Closed">C</ABBR></TD> <TD class="dddefault"><A href="https://ggc.gabest.usg.edu/pls/B400/bwckschd.p_disp_listcrse?term_in=201008&subj_in=BUSA&crse_in=4700&crn_in=81085" onmouseover="window.status='Detail'; return true" onfocus="window.status='Detail'; return true" onmouseout="window.status=''; return true" onblur="window.status=''; return true">81085</A></TD> <TD class="dddefault">BUSA</TD> <TD class="dddefault">4700</TD> <TD class="dddefault">01</TD> <TD class="dddefault">A</TD> <TD class="dddefault">3.000</TD> <TD class="dddefault">Selected Topics in Business</TD> <TD colspan="2" class="dddefault"><ABBR title="To Be Announced">TBA</ABBR></TD> <TD class="dddefault">0</TD> <TD class="dddefault">0</TD> <TD class="dddefault">0</TD> <TD class="dddefault">0</TD> <TD class="dddefault">0</TD> <TD class="dddefault">0</TD> <TD class="dddefault"><ABBR title="To Be Announced">TBA</ABBR></TD> <TD class="dddefault">08/23-12/09</TD> <TD class="dddefault"><ABBR title="To Be Announced">TBA</ABBR></TD> <TD class="dddefault"> </TD> </TR> <!--3. Class with lab. First row is class, second row is lab details --> <TR> <TD class="dddefault"><ABBR title="Not available for registration">NR</ABBR></TD> <TD class="dddefault"><A href="https://ggc.gabest.usg.edu/pls/B400/bwckschd.p_disp_listcrse?term_in=201008&subj_in=CHEM&crse_in=1151K&crn_in=80073" onmouseover="window.status='Detail'; return true" onfocus="window.status='Detail'; return true" onmouseout="window.status=''; return true" onblur="window.status=''; return true">80073</A></TD> <TD class="dddefault">CHEM</TD> <TD class="dddefault">1151K</TD> <TD class="dddefault">01</TD> <TD class="dddefault">A</TD> <TD class="dddefault">4.000</TD> <TD class="dddefault">Survey of Chemistry I w/Lab</TD> <TD class="dddefault">MF</TD> <TD class="dddefault">11:00 am-12:15 pm</TD> <TD class="dddefault">20</TD> <TD class="dddefault">0</TD> <TD class="dddefault">20</TD> <TD class="dddefault">0</TD> <TD class="dddefault">0</TD> <TD class="dddefault">0</TD> <TD class="dddefault">David Pursell (<ABBR title="Primary">P</ABBR>)</TD> <TD class="dddefault">08/23-12/09</TD> <TD class="dddefault">A 1400</TD> <TD class="dddefault"> </TD> </TR> <TR> <TD class="dddefault"> </TD> <TD class="dddefault"> </TD> <TD class="dddefault"> </TD> <TD class="dddefault"> </TD> <TD class="dddefault"> </TD> <TD class="dddefault"> </TD> <TD class="dddefault"> </TD> <TD class="dddefault"> </TD> <TD class="dddefault">W</TD> <TD class="dddefault">11:00 am-01:45 pm</TD> <TD class="dddefault"> </TD> <TD class="dddefault"> </TD> <TD class="dddefault"> </TD> <TD class="dddefault"> </TD> <TD class="dddefault"> </TD> <TD class="dddefault"> </TD> <TD class="dddefault">David Pursell (<ABBR title="Primary">P</ABBR>)</TD> <TD class="dddefault">08/23-12/09</TD> <TD class="dddefault">A 1290</TD> <TD class="dddefault"> </TD> </TR> -- Posted via http://www.ruby-forum.com/.
From: Derek Cannon on 18 Apr 2010 06:21 > There are lots of ways to identify more precisely which part of the HTML > you want, using CSS selectors. Most easily, if the rows are inside > <table id='courses'> then seomthing like 'table#courses tr' could do it. Since my original post, I've been playing around with the code some more. I made a new way of getting courses that automatically filters out "non-course" rows. The code is: table = doc.css("tr").collect { |row| row.css(".dddefault").collect { |column| column.text.strip } } This way, "non-courses" appear as empty arrays. I still don't know how to neatly get rid of the empty arrays... I tried .compact! but that doesn't seem to work. >doc = Nokogiri::HTML(open(url)) >raw_course_list = doc.css("tr").collect { |row| > t_row = row.css("td").collect { |column| column.text.strip } > t_row.insert(2, "") if (t_row[1] == "TBA") >}.reject{ |i| i.size != 4 } Excellent example, I think this is much better than what I had earlier. I guess I could now replace your reject with i.empty?, right? PS - I changed raw_course_list to table to make it more readable. -- Posted via http://www.ruby-forum.com/.
From: Ehsanul Hoque on 18 Apr 2010 07:42 > This way, "non-courses" appear as empty arrays. I still don't know how > to neatly get rid of the empty arrays... I tried .compact! but that > doesn't seem to work. Try #flatten! instead. #compact! just gets rid of nil entries, and an empty array is not the same as nil. - Ehsan _________________________________________________________________ Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox. http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1
From: David A. Black on 18 Apr 2010 10:23 Hi -- On Sun, 18 Apr 2010, Derek Cannon wrote: >> There are lots of ways to identify more precisely which part of the HTML >> you want, using CSS selectors. Most easily, if the rows are inside >> <table id='courses'> then seomthing like 'table#courses tr' could do it. > > Since my original post, I've been playing around with the code some > more. I made a new way of getting courses that automatically filters out > "non-course" rows. The code is: > > table = doc.css("tr").collect { |row| > row.css(".dddefault").collect { |column| > column.text.strip > } > } > > This way, "non-courses" appear as empty arrays. I still don't know how > to neatly get rid of the empty arrays... I tried .compact! but that > doesn't seem to work. > >> doc = Nokogiri::HTML(open(url)) >> raw_course_list = doc.css("tr").collect { |row| >> t_row = row.css("td").collect { |column| column.text.strip } >> t_row.insert(2, "") if (t_row[1] == "TBA") >> }.reject{ |i| i.size != 4 } > > Excellent example, I think this is much better than what I had earlier. > I guess I could now replace your reject with i.empty?, right? > > PS - I changed raw_course_list to table to make it more readable. I think having the condition and the reject be the last things in the code are going to make it hard to follow it later. I wouldn't rule out doing something a tiny bit more procedural but maybe a little easier to parse visually, like this: doc = Nokogiri::HTML(open(url)) table = [] doc.css("tr").each do |row| cells = row.css("td").map {|cell| cell.text.strip } next unless cells.size == 4 next unless cells[1] == "TBA" cells.insert(2, "") table << cells end You could also extract some methods, and end up with something like: table = doc.css("tr"). select {|row| valid_row?(row) }. map {|row| prepare_row(row) } (The above is all untested.) David -- David A. Black, Senior Developer, Cyrus Innovation Inc. THE Ruby training with Black/Brown/McAnally COMPLEAT Coming to Chicago area, June 18-19, 2010! RUBYIST http://www.compleatrubyist.com
From: Phrogz on 18 Apr 2010 13:49
On Apr 18, 1:23 am, Derek Cannon <novelltermina...(a)gmail.com> wrote: > [...] > Earlier, someone on the forum showed me a very elegant way to collect > this information (I use Nokogiri). It was: > > doc = Nokogiri::HTML(open(url)) > > raw_course_list = doc.css("tr").collect { |row| > row.css("td").collect { |column| > column.text.strip > } > > } > [...] > This works perfectly, except in 3 main cases. > > *** Problem 1: The <tr> does not contain course information. (It's some > irrelevant part of the HTML). In this case, I did the following: > raw_course_data.reject! { |i| i.size != 4 }, would filtered out > non-courses. Note: no tables without course data had the size of one > with course data (in the non-simplified version, the size is actually > much larger). > > So, already I think it's ugly coding! It firsts loads ALL <tr> contents > into arrays, then rejects them after creation. > [...] Generalized, you have an array of values and you want to map a subset of them to new array. There are (at least) four patterns you can use to handle this sort of situation: 1) Map the unwanted elements to a 'broken' value and then reject the broken values later. (What you are doing now.) This can be hard if you don't have a way of creating a broken value. For example, you might be mapping all values directly to an object, but you don't have enough information for the object constructor and no way of making up clearly spurious values. Further, it's inefficient as you do the work and use the memory of creating the object only to throw it out later. 2) Map the unwanted elements to nil and compact the array afterwards. In your case, you'd need to look at the TDs in your row and decide if you wanted to map the row to the mapping of them or nil. This is convenient in terms of one-liners, but still slightly inefficient because you're creating an intermediary array packed with nils that you don't want. (You should be clear, though, that computational inefficiency is not always more important than programmer convenience of code clarity.) 3) Instead of using map (or the same effect under the longer name 'collect', as Robert apparently likes) to create a new array from your original, explicitly create the new array and push values only as valid. This is basically the same as above, but without the nil values and the later compact. For example: raw_course_list = [] doc.css("tr").each { |row| tds = row.css("td") if tds.have_the_values_I_want raw_course_list << tds.map{ |col| ... } end } 4) Use map (collect) on the array as in #1 or #2, but before that do a pass through your source array and sanitize it. Sanitization might be mapping values to nil and then compacting (thus very similar to #2), or fixing values (as in your TBA or continued description case). This feels cleaner, but note that this has you doing one (or two, in the case of map+compact) passes on your data before you get around to mapping it. Here's (very roughly) what I might do given what you wrote: # Assuming you're using Ruby 1.9 course_info = [] trs = doc.css('tr') trs.each.with_index{ |row,i| tds = row.css('td') title = ... prof = ... days = ... times = ... desc = ... next_row = trs[i+1] if next_row && next_row.is_a_continuation? # Add content from next_row to description # If needed, invalidate next_row so it will be skipped elsif title && prof && days # If you have all the information you need course_info << Course.new( title, prof, days ) end } Regardless of the approach you use, remember that even though you're annoyed that you are 'processing' (in one form or another) invalid entries, you have to touch every row to find out if you like it or not. It's up to you for how you detect which are invalid and handle them. |