Geomesa vs. GeoWave: A Benchmark for Geotemporal Point Data

With Geomesa and GeoWave two technologies based on Hadoop will be compared which are specialized in the efficient storage and retrieval of geotemporal data. Both technologies use Apache Accumulo as backend — a key-value store following the BigTable Design (PDF) — and GeoTools for handling geodata.

Although the technologies are also able to deal with complex geo-objects like line sequences and polygons, we only look at their performance regarding point data in this article. In order to get a better feeling for the measured times, we also look at PostGIS as an established (non-Big Data) reference system.

This article is based on the master thesis by Marcel Jacob: “Effiziente Haltung und Abfrage geotemporaler Punktdaten im Apache Hadoop-Ökosystem” (Efficient storage and retrieval of geotemporal point data in the Apache Ecosystem), which has been supervised in cooperation with the University of Leipzig and mgm technology partners.

A Benchmark to compare these systems

For a comparison of the technologies to each other, a detailed geotemporal benchmark has been created. It contains queries from four areas:

  • Spatial: the queries specify the geo locations of the data.
  • Temporal: the queries specify points of time of the data.
  • Geotemporal: the queries specify the locations and points of time of the data.
  • General: neither temporal nor geographical aspects matter in these queries.

All in all, the benchmark includes 21 queries. Three of them will be examined more closely in this article.

  • Query 10: Find all events from 45 to 60 degrees of northern latitude and from 10 to 25 degree of east longitude during the first three weeks of June in 2015.
  • Query 18: Find the geographically next 10 events with their location and date in connection with inquiries of the USA between 2004 and 2014 starting in Washington (k-NN-query).
  • Query 19: Find all events in which Germany is one of the actors.

Which test data has been used?

To examine the systems behaviors with growing amounts of data, two different packs of data — each in four variants — have been used. They are based on the GDELT Event Database with up to 58 attributes per data set. Data has been extracted from the complete data set until the test data in csv format reached 1, 3.3, 10, and 33.3 GB. Additionally, a second data set with the same sizes has been generated, which only contains the fields that are important for the benchmark (11 attributes per data set) and in which the geo-coordinates are distributed according to the population density.

Number of points in the data sets
dataset A
1 GB raw data
B
3.3 GB raw data
C
10 GB raw data
D
33.3 GB raw data
GDELT data 3.6 m 11.9 m 36 m 119.9 m
GDELT data 10 m 33.3 m 100.8 m 335.7 m

Results

The test system consists of three servers with:

  • Intel Core i7-4770 (4 Cores/8 Threads, 3.4 GHz)
  • 16 GB RAM
  • 256 GB SSD

Tested versions:

  • PostGIS: 2.1 (PostgreSQL 9.3)
  • Geomesa: 1.1.0-rc.6
  • GeoWave: 0.9.0-SNAPSHOT

For the Hadoop cluster one of the servers has been used as a master node, the other two as worker nodes.

Batch-Import Times

1. Batch-Import times for GDELT data

2. Batch-Import times for synthetic data

1) With PostGIS, the data has been imported via the COPY command. Geometry has been set up (ST_MakePoint) and spatial (GIST) and temporal (normal) indices have been created using standard configuration.

2) For Geomesa and GeoWave respectively a MapReduce job has been executed in order to import the data. The following parameters have been chosen:

Service Parameter Wert
HDFS
dfs.replication
dfs.datanode.synconclose
2
true
YARN / MAPREDUCE
yarn.nodemanager.resource.memory-mb
mapreduce.map.memory.mb
mapreduce.map.java.opts
12288 MB
4096 MB
3584 MB
Accumulo
tserver_max_heapsize
table.split.threshold
tserver.memory.maps.max
4 GB
512 MB
4096 MB

As a consequence of the chosen parameters, five mappers could run in parallel in the test cluster.

The measurements for batch-import times demonstrate that with growing amounts of data GeoWave performs best compared to the other technologies. Each measurement has been repeated five times. The standard deviation of Geomesa’s runtime dropped to 9 % at the smallest data set and 4.5 % at the biggest dataset. For GeoWave these values were 5 % at the smallest and 1.2 % at the biggest data set. The values for PostGIS range from 2.8% for the smallest data set to 7.1% at the biggest. Further analysis should be performed to gain insight regarding the large values.

GeoWave stores data in an Accumulo table. Geomesa uses three Accumulo tables internally in the tested version for storing the data redundantly: Two tables with different geotemporal indices and one for an index of attributes (in this case the date attribute, therefore a time index). This factor could explain the longer runtime to some extent

Storage Space Consumption

3. Storage space consumption for GDELT data

4. Storage space consumption for synthetic data

GeoWave requires the lowest disk space for storing and indexing the data. The results for Geomesa and GeoWave are given without HDFS replication. Here too, the higher consumption of storage space of Geomesa compared to GeoWave can be explained with the redundant storage in several tables to some extent.

Queries of the Benchmark

The following measurements are separated in the first iteration, in which no results have been present in the cache yet, and the following iterations.

Query 10: Find all events from 45 to 60 degrees of northern latitude and from 10 to 25 degree of east longitude during the first three weeks of June in 2015

5. Query 10 result for GDELT data

6. Query 10 result for synthetic data

The execution times for this query with all test data remain very short for Geomesa and GeoWave with up to a maximum of 7.5 seconds. PostGIS, however, needs a very long time for the synthetic data (up to 465 seconds). In terms of the GDELT data, PostGIS provides the shortest execution time. One possible explanation could be the different distribution of data points in the synthetic data. For this type of query the time savings in the following iterations are very small despite caching.

Query 18: Find the geographically next 10 events with their location and date in connection with inquiries of the USA between 2004 and 2014 starting in Washington (k-NN-query).

7. Query 18 result for GDELT data

8. Query 18 result for synthetic data

Query 18 is probably the most interesting query of the benchmark, because it demonstrates a particular behavior of Geomesa and GeoWave at increasing amounts of data. While the query times grow continually with PostGIS as one would expect, this is not generally the case with Geomesa and GeoWave regarding the GDELT data. This is due to the implementation of k-NN queries in the technologies. Geomesa is able to answer the query without problems. GeoWave, however, does not provide a ready to use function for answering the query. Since Geomesa is based on an iterative process, which provides its results geohash after geohash, it doesn’t have to read the complete data set (often).

Since for small amounts of data in the same area the data density is rather small, more areas (geohashes) have to be examined before the desired number of k-results are found. In case of large amounts of data, tendentially fewer areas (geohashes) have to be examined for k results and the result can be returned faster. Since GeoWave can’t answer k-NN query, a self-made implementation with a similar idea to Geomesa has been realized. Here too, a time saving from data set C to D can be observed — even if it is lower than with Geomesa. This behavior could not be observed with the synthetic data, because the synthetic data is quite more homogeny distributed.

Query 19: Find all events in which Germany is one of the actors.

9. Query 19 result for GDELT data

10. Query 19 result for synthetic data

Query 19 with its attribute filter represents a general use case, for which the systems have not been optimized. It queries an attribute that has not been indexed. The result is a full table scan. In doing so, weaknesses of Geomesa and GeoWave are revealed. Since PostGIS is faster than Geomesa and GeoWave — even for data set D — without parallelization, the potential for optimization is evident. Especially if PostGIS also uses an index!

Conclusion

Let’s summarize the measurements for all queries:

  • It can be concluded that for small data sets — such as A — PostGIS returns results most rapidly for many queries. Geomesa and GeoWave demonstrate their strengths only when it comes to larger amounts of data.
  • In terms of geographical queries, Geomesa scores higher compared to GeoWave. k-NN-queries are easier to implement with Geomesa since GeoWave does not support this kind of query yet. If queries involve time intervals, it is recommended for Geomesa and GeoWave to create an additional index for this time component.
  • For geotemporal queries there is no clear winner. However, the following trend can be noticed: GeoWave is more performant for smaller areas and short time intervals while Geomesa is more performant for larger areas and longer timer intervals.

For the last three queries of the benchmark, which relate to more commonplace tasks, PostGIS returns results most rapidly — even in case of large sets of data. There is room for improvement for Geomesa and GeoWave.

If queries require grouping and sorting, this has be realized by means of other technologies like Apache Spark for instance. In addition, it could be interesting to investigate the horizontal scaling behavior of both technologies in a future article.

In the end, the choice of technologies should depend on the requirements of the project. In this article, Geomesa and GeoWave have been observed in terms of point data only. If this matches the requirements, both technologies deliver good results already.

Share

Leave a Reply

*