Sunday, November 12, 2006

WWW::Mechanize get current page

So I have always done a lot of screen scraping, and typically whatever language I was working with I would build a framework to get the job done. I built one for java, which was a nightmare. Next I created one in php. It was a lot simpler, but just took to much time to really do right. When I moved to ruby I was supprised to find the WWW::Mechanize library. It did everything I had been building into these other frameworks. The nice thing about mechanize is that it takes care of following redirects, and parsing the html into an easy to follow structure. Something I would always build into my frameworks was the ability to psuedo-submit forms on the page. Typically in the form of (php example):

$cForm = $page=>forms[2]
$cForm=>login = 'bob';
$cForm=>password = 'testpass';
$cForm.submit();

You can do very similiar things in mechanize, but the thing that stumped me for to long was how you got the current url of the page. Turns out it isn't that hard, but it is poorly documented.

(ruby example)

agent = WWW::Mechanize.new
agent.user_agent = 'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1)'
form = browse.forms[1]
form.fields.find {|f| f.name == 'location'}.value = 'MT'
page1 = @agent.submit(form, form.buttons.first)
agent.page.uri.to_s

Note the last line. This should always return the page that the agent is "at" in the browser paradigm, even after multiple redirects

1 comment:

Anonymous said...
This comment has been removed by a blog administrator.