About one year ago I wrote an article for Linux Magazin about a personal search engine:
* Suchmaschinenserver im Eigenbau (Ausgabe 02/2016)
* Go Find It (Issue #186 / May 2016)
In this article I used Ubuntu 14.04 and Nutch 1.9. Now Nutch 1.13 is available. Time for a short update.
First the bad news. Ubuntu 17.04 and Debian 9 both delivering version 3.6.2 of Solr. Which is not compatible with Nutch 1.13.
If you want to use the latest version of Nutch you have to install Solr by hand.
With Solr 3.6.2 and Nutch 1.13 you get an error like this:
org.apache.solr.common.SolrException.log org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0xe0 (at char #1, byte #-1)
If you came here because you googled this error message: Short answer your „indexer-solr“ plugin is not compatible with your solr version, you can try to use the „indexer-solr“ plugin from Nutch 1.12 with Nutch 1.13. But at the moment I don’t know wether there are any other problems coming with this solution.
Best way for me at the moment is the following quick setup:
Install Solr
apt‐get install solr‐tomcat
Download and install Nutch
wget http://archive.apache.org/dist/nutch/1.12/apache-nutch-1.12-bin.tar.gz tar vfx apache-nutch-1.12-bin.tar.gz mv apache-nutch-1.12 /opt/ ln -s /opt/apache-nutch-1.12 /opt/nutch
Configurate Solr for Nutch
mv /etc/solr/conf/schema.xml /etc/solr/conf/schema.xml.orig cp /opt/nutch/conf/schema.xml /etc/solr/conf/schema.xml /etc/init.d/tomcat8 restart
Note: The value content stored=“true“ is now default and don’t need to be changed.
Configurate Nutch
Edit or create „/opt/nutch/conf/nutch‐site.xml“ with following content:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>http.agent.name</name> <value>My Search Agent</value> </property> <property> <name>file.content.ignored</name> <value>false</value> </property> <property> <name>db.update.purge.404</name> <value>true</value> </property> <property> <name>indexer.max.title.length</name> <value>150</value> </property> <property> <name>fetcher.server.delay</name> <value>0.0</value> </property> <property> <name>solr.server.url</name> <value>http://localhost:8080/solr/</value> </property> <property> <name>indexingfilter.order</name> <value>indexer-solr</value> </property> </configuration>
Example Setup
Now I create a simple example setup for the crawler. The following setup allows the crawler to browse only my block and don’t follow any „external“ links.
Edit „/opt/nutch/conf/regex-urlfilter.txt“ and add the following line:
+^(http|https)://www.mogilowski.net
On the end of the file i chaned „+.“ to „-.“ to deny all other URLs.
Create directories and seed
Now i prepare the directories and create a seed file with the start URL.
mkdir /opt/nutch/IntranetCrawler mkdir /opt/nutch/urls echo "http://www.mogilowski.net" > /opt/nutch/urls/seed.txt
Let’s go
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ /opt/nutch/bin/crawl --index /opt/nutch/urls/ /opt/nutch/IntranetCrawler/ 1
Check the results
Go to „http://YOUR_SERVER:8080/solr/“ and make a select in the solr admin with „*.*“. You should get the results of the first run as XML.
Notes
For more informations about the nutch config or how to access the solr data take a look at:
* Suchmaschinenserver im Eigenbau (Ausgabe 02/2016)
* Go Find It (Issue #186 / May 2016)
Security !
By default Solr is public accessable (read/write) without passwords on all IPs !!