Prev: unsubscribe
Next: MD5 16 octet - how to compute?
From: Derek Cannon on 18 Apr 2010 03:23 Hello everyone. It's me: Derek, again! Sorry for writing a novel here, but I'd really appreciate some help. I'm still working on the same program -- a way to show valid course combinations for my school schedule, using an HTML file that contains all the courses for the semester. I have a rough draft copy of it working, but I'd like to see an example of a more elegant coding style than my own. Here's a (simplified) example of the data I'm working with: <tr> <td>Intro to Programming<td> #title <td>MW</td> #days <td>9:00am-10:30am</td> #time <td>Dr. Smith</td> # professor </tr> <tr> <td>Intro to Knitting<td> <td>TR</td> <td>9:00am-10:30am</td> <td>Dr. Mittens</td> </tr> Earlier, someone on the forum showed me a very elegant way to collect this information (I use Nokogiri). It was: doc = Nokogiri::HTML(open(url)) raw_course_list = doc.css("tr").collect { |row| row.css("td").collect { |column| column.text.strip } } This would give me an array of arrays in the format [[courseA,data,data], [courseB, data, data]]. E.g., in this case it would yield: [["Intro to Programming", "MW", "9:00am-10:30am", "Dr. Smith"], ["Intro to Knitting", "TR", "9:00am-10:30am", "Dr. Mittens"]] This works perfectly, except in 3 main cases. *** Problem 1: The <tr> does not contain course information. (It's some irrelevant part of the HTML). In this case, I did the following: raw_course_data.reject! { |i| i.size != 4 }, would filtered out non-courses. Note: no tables without course data had the size of one with course data (in the non-simplified version, the size is actually much larger). So, already I think it's ugly coding! It firsts loads ALL <tr> contents into arrays, then rejects them after creation. *** Problem 2: In a few cases, some courses do not have specified days and times yet. In those cases, the course days reads "TBA" (to be announced), and there is no column for time. Thus, the array of such courses is 1 less than the normal expected case. E.g.: <tr> <td>Algebra<td> <td>TBA</td> # notice there is now 1 <td> for day/time now <td>Dr. Calculator</td> </tr> Thus, I create ANOTHER time that Ruby goes back over the elements of raw_course_list again. This time, the code is put right before problem 1's fix: raw_course_list.each { |i| if i.size == 3 i.insert(2, "") end } So again, if an array has a size of 3, I figure it's a valid course, just with no time assigned, so I create a blank element between the day and professor, just to satisfy the Course class, which these array elements of the outer array will ultimately become. E.g. of call: Course.new(title, day, time, professor) *** Problem 3: Some rows of the HTML are actually a continued description of the course in the row above. For example, a course that has a lab might look like this: <tr> <td>Chemistry /w Lab<td> <td>TR</td> <td>9:00am-10:30am</td> <td>Dr. Chemicals</td> </tr> <tr> <td><td> # Empty, since the above row provides the course name <td>R</td> # Day of the lab <td>11:00am-12:30pm</td> # Time of the lab <td>Dr. Chemicals</td> # Lab professor </tr> The good news is it's the same length as a normal class. So for this, I add a bit more code to problem 2's code (the each block), changing the each method to .each_with_index: raw_course_list.each_with_index { |i, index| if i.size == 3 i.insert(2, "") end # NEW CODE FOR LABS (still working out the kinks, but hopefully I won't need this) # lab will always have a size of 4 and a empty first element so: if i[0].empty? # add all the data from the lab to the previous course: raw_course_list[index-1].push(i.each { |element| element }) # then remove lab from raw_course_list raw_course_list.pop(index) # index has to go back one to avoid skipping an element (since we popped one) index -= 1 end } ================================== So there you have it. Can anyone think of a way where I can improve the quality and elegance of this code? -- Posted via http://www.ruby-forum.com/.
From: Elliot Crosby-McCullough on 18 Apr 2010 03:48 [Note: parts of this message were removed to make it a legal post.] On 18 April 2010 08:23, Derek Cannon <novellterminator(a)gmail.com> wrote: > Hello everyone. It's me: Derek, again! Sorry for writing a novel here, > but I'd really appreciate some help. > > I'm still working on the same program -- a way to show valid course > combinations for my school schedule, using an HTML file that contains > all the courses for the semester. > > I have a rough draft copy of it working, but I'd like to see an example > of a more elegant coding style than my own. > > Here's a (simplified) example of the data I'm working with: > > <tr> > <td>Intro to Programming<td> #title > <td>MW</td> #days > <td>9:00am-10:30am</td> #time > <td>Dr. Smith</td> # professor > </tr> > <tr> > <td>Intro to Knitting<td> > <td>TR</td> > <td>9:00am-10:30am</td> > <td>Dr. Mittens</td> > </tr> > > Earlier, someone on the forum showed me a very elegant way to collect > this information (I use Nokogiri). It was: > > doc = Nokogiri::HTML(open(url)) > > raw_course_list = doc.css("tr").collect { |row| > row.css("td").collect { |column| > column.text.strip > } > } > > This would give me an array of arrays in the format > [[courseA,data,data], [courseB, data, data]]. > > E.g., in this case it would yield: > [["Intro to Programming", "MW", "9:00am-10:30am", "Dr. Smith"], ["Intro > to Knitting", "TR", "9:00am-10:30am", "Dr. Mittens"]] > > This works perfectly, except in 3 main cases. > > *** Problem 1: The <tr> does not contain course information. (It's some > irrelevant part of the HTML). In this case, I did the following: > raw_course_data.reject! { |i| i.size != 4 }, would filtered out > non-courses. Note: no tables without course data had the size of one > with course data (in the non-simplified version, the size is actually > much larger). > > So, already I think it's ugly coding! It firsts loads ALL <tr> contents > into arrays, then rejects them after creation. > > *** Problem 2: In a few cases, some courses do not have specified days > and times yet. In those cases, the course days reads "TBA" (to be > announced), and there is no column for time. Thus, the array of such > courses is 1 less than the normal expected case. > > E.g.: > > <tr> > <td>Algebra<td> > <td>TBA</td> # notice there is now 1 <td> for day/time now > <td>Dr. Calculator</td> > </tr> > > Thus, I create ANOTHER time that Ruby goes back over the elements of > raw_course_list again. This time, the code is put right before problem > 1's fix: > > raw_course_list.each { |i| > if i.size == 3 > i.insert(2, "") > end > } > > So again, if an array has a size of 3, I figure it's a valid course, > just with no time assigned, so I create a blank element between the day > and professor, just to satisfy the Course class, which these array > elements of the outer array will ultimately become. E.g. of call: > Course.new(title, day, time, professor) > > *** Problem 3: Some rows of the HTML are actually a continued > description of the course in the row above. For example, a course that > has a lab might look like this: > > <tr> > <td>Chemistry /w Lab<td> > <td>TR</td> > <td>9:00am-10:30am</td> > <td>Dr. Chemicals</td> > </tr> > <tr> > <td><td> # Empty, since the above row provides the course name > <td>R</td> # Day of the lab > <td>11:00am-12:30pm</td> # Time of the lab > <td>Dr. Chemicals</td> # Lab professor > </tr> > > The good news is it's the same length as a normal class. So for this, I > add a bit more code to problem 2's code (the each block), changing the > each method to .each_with_index: > > raw_course_list.each_with_index { |i, index| > if i.size == 3 > i.insert(2, "") > end > > # NEW CODE FOR LABS (still working out the kinks, but hopefully I > won't need this) > # lab will always have a size of 4 and a empty first element so: > if i[0].empty? > # add all the data from the lab to the previous course: > raw_course_list[index-1].push(i.each { |element| element }) > > # then remove lab from raw_course_list > raw_course_list.pop(index) > > # index has to go back one to avoid skipping an element (since we > popped one) > index -= 1 > end > } > > ================================== > > So there you have it. Can anyone think of a way where I can improve the > quality and elegance of this code? > -- > Posted via http://www.ruby-forum.com/. > > Do you have any control over the HTML at all? Some semantic HTML classes would go a long way to simplifying this.
From: Derek Cannon on 18 Apr 2010 03:56 > Do you have any control over the HTML at all? Some semantic HTML > classes would go a long way to simplifying this. The HTML comes from a page on my schools website :( -- Posted via http://www.ruby-forum.com/.
From: Brian Candler on 18 Apr 2010 04:59 Derek Cannon wrote: > Here's a (simplified) example of the data I'm working with: Can you post a complete sample page somewhere? There are lots of ways to identify more precisely which part of the HTML you want, using CSS selectors. Most easily, if the rows are inside <table id='courses'> then seomthing like 'table#courses tr' could do it. But otherwise, you can select the table based on its location in the page relative to other elements (nth, nth-child). -- Posted via http://www.ruby-forum.com/.
From: gf on 18 Apr 2010 05:44
For your contemplation: doc = Nokogiri::HTML(open(url)) raw_course_list = doc.css("tr").collect { |row| t_row = row.css("td").collect { |column| column.text.strip } t_row.insert(2, "") if (t_row[1] == "TBA") }.reject{ |i| i.size != 4 } This isn't guaranteed to work because we're dealing with pieces of the HTML you are parsing. Give us a sample with ALL the variations in it and we can stop shooting in the dark. |