Monday, October 21, 2013

Crawling the Web with Cassandra and Nutch


So, you want to harvest a massive amount of data from the internet? What better storage mechanism than Cassandra? This is easy to do with Nutch.

Often people use Hbase behind Nutch. This works, but it may not be an ideal solution if you are (or want to be) a Cassandra shop. Fortunately, Nutch 2+ uses the Gora abstraction layer to access its data storage mechanism. Gora supports Cassandra. Thus, with a few tweaks to the configuration, you can use Nutch to harvest content directly into Cassandra.

Read more here

Leave a Reply

All Tech News IN © 2011 & Main Blogger .