So I have always done a lot of screen scraping, and typically whatever language I was working with I would build a framework to get the job done. I built one for java, which was a nightmare. Next I created one in php. It was a lot simpler, but just took to much time to really do right. When I moved to ruby I was supprised to find the WWW::Mechanize library. It did everything I had been building into these other frameworks. The nice thing about mechanize is that it takes care of following redirects, and parsing the html into an easy to follow structure. Something I would always build into my frameworks was the ability to psuedo-submit forms on the page. Typically in the form of (php example):
$cForm = $page=>forms[2]
$cForm=>login = 'bob';
$cForm=>password = 'testpass';
$cForm.submit();
You can do very similiar things in mechanize, but the thing that stumped me for to long was how you got the current url of the page. Turns out it isn't that hard, but it is poorly documented.
(ruby example)
agent = WWW::Mechanize.new
agent.user_agent = 'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1)'
form = browse.forms[1]
form.fields.find {|f| f.name == 'location'}.value = 'MT'
page1 = @agent.submit(form, form.buttons.first)
agent.page.uri.to_s
Note the last line. This should always return the page that the agent is "at" in the browser paradigm, even after multiple redirects
Sunday, November 12, 2006
WWW::Mechanize get current page
Labels:
mechanize,
php,
rails,
ruby,
screen scraping,
www,
www::mechanize
Subscribe to:
Post Comments (Atom)
1 comment:
Post a Comment