With Geomesa and GeoWave two technologies based on Hadoop will be compared which are specialized in the efficient storage and retrieval of geotemporal data. Both technologies use Apache Accumulo as backend — a key-value store following the BigTable Design (PDF) — and GeoTools for handling geodata.
The revenue gained with Big Data solutions rose by 66% up to 73.5 billion euro world-wide and 59% up to 6.1 billion in Germany over the past year. One of the core technologies used is Hadoop which creates the base for a broad and rich eco-system containing distributed databases, data and graph processing libraries, query and workflow engines and much more. In one of our former blog posts, we have described how we use Hadoop for storing log messages. Since then, a lot has happened in the Hadoop universe and ecosystem. With the start of our new Big Data series, we want to cover those changes and show best practices in the Big Data world.
For our recent online shop project, we required a full-text, multi-criteria product search. Lucene, the popular Java search engine, is an ideal candidate for this functionality. But in order to meet the high performance requirement, we had to extend its usage beyond standard full-text search. This posting describes our solution including index switching and using Lucene as a simple NoSQL database.
We use the open-source search server Solr for real-time search on data stored in a Hadoop cluster. For our terabyte-scale dataset, we had to implement distributed search on multiple Lucene index partitions (shards). This article describes our solution to manage 40 independent Solr instances without human interaction, including transparent failover and automatic backup of the Lucene index.
In the previous part of this article series we focused on the efficient storage of log data in Hadoop. We described how to store the data in Hadoop’s MapFiles, and we tweaked the configuration settings for increased data storage capacity and greater retrieval speed. Today, we discuss how to perform near real-time searches on up to 36.6 billion log messages by a clever combination of Hadoop with Lucene and Solr.
In part 1 of this article series we described the various challenges of dealing with large amounts of logging data in a heavily distributed software ecosystem. After evaluating different approaches, we quickly selected Hadoop as the technology of our choice. In this article we will describe how some pitfalls we had to solve when using Hadoop to store log messages.
Our team has developed a system for storing and processing huge amounts of log data using Hadoop. The challenge was to handle Gigabytes of log messages every day and enable querying the 30+ Terabyte archive with instantaneous results. In this first part of our blog series we explain our motivation for using Hadoop and contrast this solution with the traditional relational database approach.