Skip to content

Ruby web crawling with anemone

I recently bumped into Anemone, a site-crawling gem written by Chris Kite. We at Humbucker have our own high-performance crawlers written in various languages and geared towards different objectives, but Chris’ solution definitely offers an easy to setup and use alternative that is worth having a look at.

As an example, I wrote a script that will take a domain, and start downloading every page, saving the website folder tree on the Disk. Bear in mind that Anemone saves downloaded pages to memory by default, or on one of the non-relational DBs all the cool kids use like MongoDB, so the folder tree bit is not really needed, I just did it because I happened to need to download a whole site and view it locally.

Here goes:

%W[rubygems anemone].each {|r| require r}
site_root = ""
# Create the root folder
folder = URI.parse(site_root).host
Anemone.crawl(site_root) do |anemone|
  anemone.on_every_page do |page|
    filename = page.url.request_uri.to_s
    filename = "/index.html" if filename == "/" # Make sure the file name is valid
    folders = filename.split("/")
    filename = folders.pop
    FileUtils.mkdir_p(File.join(".",folder,folders)) # Create the current subfolder
    print "Downloading '#{page.url}'..."".",folder,folders,filename),"w") {|f| f.write(page.body)}
    puts "done."

Categories: Code, Tricks.

Tags: , , ,

Comment Feed

3 Responses

  1. I tried it but I get the following error:

    undefined method `body’ for # (NoMethodError),

    If I use other attribute (code, url,..) it works fine.

    Could you tell me where the problem can be?

    • Hi Fernando, as you can see here, body is in fact a valid page attribute, meaning that I am not really sure why what you are describing is happening.

      Try rescuing the f.write(page.body) block so that you can see which operations are actually failing. Replace line 17 in this post’s snippet with:".",folder,folders,filename),"w") do |f| 
        rescue Exception => e
          puts "An error has occured while processing #{page.url}:"
         puts e.message
  2. Thanks for the demonstration, there is very little info about anemone on the web thanks for adding more to it.

Some HTML is OK

or, reply to this post via trackback.