For our recent online shop project, we required a full-text, multi-criteria product search. Lucene, the popular Java search engine, is an ideal candidate for this functionality. But in order to meet the high performance requirement, we had to extend its usage beyond standard full-text search. This posting describes our solution including index switching and using Lucene as a simple NoSQL database.
We use the open-source search server Solr for real-time search on data stored in a Hadoop cluster. For our terabyte-scale dataset, we had to implement distributed search on multiple Lucene index partitions (shards). This article describes our solution to manage 40 independent Solr instances without human interaction, including transparent failover and automatic backup of the Lucene index.
In the previous part of this article series we focused on the efficient storage of log data in Hadoop. We described how to store the data in Hadoop’s MapFiles, and we tweaked the configuration settings for increased data storage capacity and greater retrieval speed. Today, we discuss how to perform near real-time searches on up to 36.6 billion log messages by a clever combination of Hadoop with Lucene and Solr.