Fork me on GitHub

Using Hpricot to take steps to rid the world of abusive cut and paste

I’ve been working on a little script to pull data from the extremely cool and useful weather feeds from the NOAA. One XML entity described the general weather conditions. <weather> contains a short string like “Sunny” or “Thunderstorms”. I needed to figure out some sort of spec for this field, as in what was its maximum length and what were the possible strings. The closest thing I could find was this, which though it contains all the info I needed it was in a completely useless format. In the good old days (two months ago even). I would have taken the half an hour to copy and paste all the text into a flat file, and then use some sort of macro over and over again to format it properly. No more.

Thanks to _why’s Hpricot Ruby can do all the parsing and formatting for me – with very little code.

First, looking at the source of the document I noticed that at least the formatting was consistent. All of the possible weather fields were within <td> tags with the class “graybackgound” and were delimited by ’|’. Its this easy:

require 'rubygems'
require 'hpricot'
require 'open-uri'

doc = Hpricot(open("http://www.weather.gov/data/current_obs/weather.php"))

types = []
doc.search('td.graybackground').each do |el|
   types.concat el.inner_html.gsub(/<p.+p>/,'').split('|')
end
types.collect! { |t| t.strip }
types.sort { |a,b| a.length <=> b.length }.each do |t| 
  puts t.length.to_s <<  " : " << t << "\n" unless t.empty?
end

to get this output:

3 : Fog
4 : Haze
4 : Fair
4 : Snow
4 : Sand
4 : Rain
4 : Hail
4 : Dust
5 : Smoke
5 : Windy
5 : Clear
7 : Drizzle
8 : Rain Fog
8 : Overcast
8 : Fog/Mist
8 : Snow Fog
9 : Rain Snow
9 : Snow Rain
10 : Sand Storm
10 : Heavy Snow
. . .

No cutting and pasting receptive text from the web ever again!

Leave a Reply

About

QuirkeyBlog is Aaron Quint's perspective on the ongoing adventure of Code, Life, Work and the Web.

For more information check Quirkey.com

QuirkeyBlog is proudly powered by WordPress

Categories