The Big Data technology landscape is expanding. Choosing the right frameworks from a growing range of technologies is increasingly becoming a challenge in projects. Helpful for the selection process are performance comparisons of frequently used operations such as cluster analyses. This is exactly what a cooperation between mgm and the University of Leipzig has been about. In the context of the master thesis “Skalierbares Clustering geotemporaler Daten auf verteilten Systemen” (“Scalable clustering of geotemporal data in distributed systems”) of Paul Röwer at the chair of Prof. Dr. Martin Middendorf for parallel processing and complex systems, the k-means algorithm has been implemented for four open source technologies of the Apache Software Foundation — Hadoop, Mahout, Spark, and Flink. Benchmarks have been carried out which compare runtime and scalability.
Read the rest of this entry » |