Our team has developed a system for storing and processing huge amounts of log data using Hadoop. The challenge was to handle Gigabytes of log messages every day and enable querying the 30+ Terabyte archive with instantaneous results. In this first part of our blog series we explain our motivation for using Hadoop and contrast this solution with the traditional relational database approach.
Series ‘Scalable Log Data Management with Hadoop’
In part 1 of this article series we described the various challenges of dealing with large amounts of logging data in a heavily distributed software ecosystem. After evaluating different approaches, we quickly selected Hadoop as the technology of our choice. In this article we will describe how some pitfalls we had to solve when using Hadoop to store log messages.
In the previous part of this article series we focused on the efficient storage of log data in Hadoop. We described how to store the data in Hadoop’s MapFiles, and we tweaked the configuration settings for increased data storage capacity and greater retrieval speed. Today, we discuss how to perform near real-time searches on up to 36.6 billion log messages by a clever combination of Hadoop with Lucene and Solr.
We use the open-source search server Solr for real-time search on data stored in a Hadoop cluster. For our terabyte-scale dataset, we had to implement distributed search on multiple Lucene index partitions (shards). This article describes our solution to manage 40 independent Solr instances without human interaction, including transparent failover and automatic backup of the Lucene index.