Using Hpricot to take steps to rid the world of abusive cut and paste

I’ve been working on a little script to pull data from the extremely cool and useful weather feeds from the NOAA. One XML entity described the general weather conditions. <weather> contains a short string like “Sunny” or “Thunderstorms”. I needed to figure out some sort of spec for this field, as in what was its maximum length and what were the possible strings. The closest thing I could find was this, which though it contains all the info I needed it was in a completely useless format. In the good old days (two months ago even). I would have taken the half an hour to copy and paste all the text into a flat file, and then use some sort of macro over and over again to format it properly. No more.

Thanks to _why’s Hpricot Ruby can do all the parsing and formatting for me – with very little code.

First, looking at the source of the document I noticed that at least the formatting was consistent. All of the possible weather fields were within <td> tags with the class “graybackgound” and were delimited by ’|’. Its this easy:

require 'rubygems'
require 'hpricot'
require 'open-uri'

doc = Hpricot(open("http://www.weather.gov/data/current_obs/weather.php"))

types = []
doc.search('td.graybackground').each do |el|
   types.concat el.inner_html.gsub(/<p.+p>/,'').split('|')
end
types.collect! { |t| t.strip }
types.sort { |a,b| a.length <=> b.length }.each do |t| 
  puts t.length.to_s <<  " : " << t << "\n" unless t.empty?
end

to get this output:

3 : Fog
4 : Haze
4 : Fair
4 : Snow
4 : Sand
4 : Rain
4 : Hail
4 : Dust
5 : Smoke
5 : Windy
5 : Clear
7 : Drizzle
8 : Rain Fog
8 : Overcast
8 : Fog/Mist
8 : Snow Fog
9 : Rain Snow
9 : Snow Rain
10 : Sand Storm
10 : Heavy Snow
. . .

No cutting and pasting receptive text from the web ever again!

Comments are closed.

About

QuirkeyBlog is Aaron Quint's perspective on the ongoing adventure of Code, Life, Work and the Web.

twitter/@aq.

instagram/@quirkey.

github/quirkey.

QuirkeyBlog is proudly powered by WordPress

Categories