The Big Data technology landscape is expanding. Choosing the right frameworks from a growing range of technologies is increasingly becoming a challenge in projects. Helpful for the selection process are performance comparisons of frequently used operations such as cluster analyses. This is exactly what a cooperation between mgm and the University of Leipzig has been about. In the context of the master thesis “Skalierbares Clustering geotemporaler Daten auf verteilten Systemen” (“Scalable clustering of geotemporal data in distributed systems”) of Paul Röwer at the chair of Prof. Dr. Martin Middendorf for parallel processing and complex systems, the k-means algorithm has been implemented for four open source technologies of the Apache Software Foundation — Hadoop, Mahout, Spark, and Flink. Benchmarks have been carried out which compare runtime and scalability.
The revenue gained with Big Data solutions rose by 66% up to 73.5 billion euro world-wide and 59% up to 6.1 billion in Germany over the past year. One of the core technologies used is Hadoop which creates the base for a broad and rich eco-system containing distributed databases, data and graph processing libraries, query and workflow engines and much more. In one of our former blog posts, we have described how we use Hadoop for storing log messages. Since then, a lot has happened in the Hadoop universe and ecosystem. With the start of our new Big Data series, we want to cover those changes and show best practices in the Big Data world.