Nutch is cool and pretty simple to start. Lots of tutorials over the Internet.
The problem I had was I couldn’t make nutch to index or even store anchor text. No argue but I just needed it or them.
Now here is the twist.
- Modify
$NUTCH_HOME/conf/solrindex-mapping.xml
adding<field dest="anchor" source="anchor"/>
- Either modify
$NUTCH_HOME/conf/nutch-default.xml
changingdb.ignore.internal.links
tofalse
or overwrite it in$NUTCH_HOME/conf/nutch-site.xml
This simple but it indeed took me sometime to figure out.