Searching OpenStreetMap Geospatial Data with Solr

We are currently experiencing a Geospatial Revolution that changes in how we navigate from A to B and how we search for locations like a specific sight or restaurants nearby. Geospatial search technology provides such information. This article shows how commercial applications can utilize geospatial search, e.g. for real estate search (qualifing real estates by their distance to the nearest kindergartens, schools, doctors, etc.), calculating building density in cities and so on.

Let’s think for a moment, who offers geospatial information? Google search sometimes finds geospatial results, but it’s only partially possible to search for them specifically, even in Google Maps. And one may not want to become too dependent from a commercial provider like Google.

Like Wikipedia, the trailblazer for knowledge, the OpenStreetMap service has been around for a long time and by now offers extensive data with excellent quality. Everyone can register at OpenStreetMap and improve maps using a simple-to-use Editor. Many volunteers have already contributed with GPS trackers and so quite comprehensive information is already available. A simple visit to www.openstreetmap.org allows everyone to get an overview of the available information for example by checking the map of their place of birth or residence. The maps are often so detailed that they even show the outlines of houses. Why this is important, will be discussed later on.

OpenStreetMap of Nuremberg with zoom to the "Bundesagentur für Arbeit" which clearly shows the outline of the separate buildings. Many details are not even plotted into the maps.

Please note that the data quality of OpenStreetMap is continuously improving. Knowledge of geospatial details is not lost but added to and refined all the time. As the data is also used in commercial applications (licensed), there is also a strong business interest in the ongoing maintenance of the data. The more people use OpenStreetMap, the higher the will to contribute themselves, to report mistakes or to add further information.

An even more exciting option than the maps themselves is the opportunity to download the source data from OpenStreetMap to use them for other purposes. In the following chapters, we will show how to prepare and use them in different contexts.

Indexing Geospatial Data

The OpenStreetMap website offers the complete map data for download. However, there are only downloads of either rectangular extracts or the whole earth available both of which are problematic. The data of the whole earth is extremely vast while the data of a rectangle is often not completely consistent as certain areas at the edges of the rectangle overlap.

Luckily there are also consistent extracts for e.g. whole countries like Germany, separate federal states or even administrative regions like German “Regierungsbezirke”. These are, however, only created periodically and so may not automatically reflect the status depicted on the maps.

Download formats are XML (packed) or ProtocolBuffer. The latter is an extremely compressed format created by Google and initially used for message exchange. There are parsers for many widely used languages. However, the XML format is better suited for experimenting as humans can read it easier and process it semantically.

Selection of a Solution for Indexing and Search

Once the source data of OpenStreetMap is locally available, it has to be stored in a way that allows it to be searched by distances (“geospatial search”). Some database systems like PostgreSQL, Oracle or Microsoft SQL Server already offer suitable functionalities for indexing and search.

As we are mainly interested in searching (and faceting), our solution uses a software that is specifically optimized for this use case, i.e. Apache Solr. Since version 4.0, Solr contains extensive options to search geospatial data. Solr is able to represent the indexed OpenStreetMap data as XML source text. Visualization as maps requires more infrastructure and is (so far) not part of our solution.

Creation of a Schema

Similar to relational databases, which rely on a definition of entities and relations, Solr requires a so called “schema” for data indexing. However, the data in Solr is “flat” – contrary to relational systems – which simply put means that there is only one table.

Solr can work with different data formats; in our case “string” is the most relevant apart from the actual coordinates, as it allows many attributes (see below) to be mapped. The support offered for geospatial coordinates is quite exciting and there are two different approaches:

  • LatLonType: This data type can store one (or many) points within the Solr index. Storing is done in the form of two float values so they can be retrieved really fast.
  • SpatialRecursivePrefixTreeFieldType: This data type is much more flexible and uses so-called Geohashes to store data. This requires more storage space but in return offers more search functionality. And it also adds an option to store geometrical objects like polygons or lines in addition to points.

We worked with the second data type in our solution as OpenStreetMap data also contains polygons (e.g. contours of woods, etc.) and it turned out to be quite interesting to include them into the search.

This is the part of the Solr scheme responsible for geospatial search:

<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="geo" type="location_rpt" indexed="true" stored="true" />
<fieldType name="location_rpt"
    class="solr.SpatialRecursivePrefixTreeFieldType"
    spatialContextFactory="com.spatial4j.core.context.jts.\
                           JtsSpatialContextFactory"
    geo="true" distErrPct="0.01" maxDistErr="0.0009" units="degrees" />

We work with a maximum error of 1% in distances and a maximum absolute error of about 100 meters which is absolutely sufficient for our purposes. Improving these values leads to a significantly larger index and slower search.

Converting the Data

Once the schema is complete, the existing OpenStreetMap data can be imported into Solr. The usual approach would be to write (or configure) a DataImportHandler for the import of XML data.

Unfortunately this is not feasible in our case as there are various data types in the OpenStreetMap XML some of which reference one another, i.e. the order of the import is important and has to be de-referenced. In addition, we do not want to import all the data into the Solr index, but to filter it in advance (see below).

A fragment of an OpenStreetMap files looks like this:

<node id="26212605" lat="49.588555" lon="11.0014352" version="7"
      timestamp="2013-01-08T17:55:18Z" changeset="14577549" uid="479256"
      user="geodreieck4711"/>
<way id="2293021" version="12" timestamp="2013-01-08T17:55:13Z"
      changeset="14577549" uid="479256" user="geodreieck4711">
  <nd ref="26212605"/>
  <nd ref="9919314"/>
  <nd ref="2101861553"/>
  <nd ref="10443807"/>
  <tag k="bicycle" v="designated"/>
  <tag k="cycleway" v="segregated"/>
  <tag k="foot" v="designated"/>
  <tag k="highway" v="cycleway"/>
  <tag k="segregated" v="yes"/>
  <tag k="surface" v="asphalt"/>
</way>

The code shows that the nodes (i.e. the points) are referenced in so-called “ways”. This kind of reference is, however, not part of Solr and so our software needs to resolve this reference. As OpenStreetMap for Germany contains a few million nodes, de-referencing is not an easy to solve problem. We chose an approach using a key-value database (LevelDB).

Polygons or lines are displayed in the so-called Well-Known-Text-Format (WKT) in Solr. The respective objects have to be converted from the OpenStreetMap format to WKT. We learned in our project that this is not quite so simple. WKT uses the orientation of the polygons to decide about inside and outside areas of the polygons. This, however, does not matter in the least for OpenStreetMap. Many polygons there use an incorrect orientation. So, all polygons have to be converted to the correct orientation (anti-clockwise), which can be achieved by calculating the area.

The implementation of so-called relations is even more difficult than that of the above mentioned paths. Relations can connect paths with each other and so create non-continuous polygons. Luckily, there is also a multi-polygon form in WKT and Solr. Polygons can also contain “holes”. This can be mapped in WKT, but is irrelevant for our use case (e.g. there could be a hole in a wood if there is a lake, however, the hole is irrelevant for all distance calculations from outside).

Filtering

OpenStreetMap contains an unbelievable amount of interesting data for all kinds of questions. However, some objects (e.g. the location of fire hydrants or the color of park benches) are irrelevant when trying to qualify an address or for vicinity search. To keep memory requirements for indexing minimal we filtered them out.

The same applies for certain, not immediately relevant contours like e.g. buildings, agricultural roads, etc. The solution can, however, be configured to use these attributes as well should the need arise.

The actual implementation of the filter consists of two steps: objects to be considered are defined via positive list. Some may later on be identified as irrelevant and will then be removed via negative list. As certainly not all attributes are relevant, the irrelevant are removed dynamically.

The result of these steps is data ready for direct indexing by Solr.

Indexing with Apache Solr

The next step is indexing the relevant data. This is processed as a batch job that can be parallelized as CPU is the limiting factor for geospatial data (especially contours). On a standard PC with 8 GB RAM and a 2.5 GHz Quad-Core processor all data on Germany can be indexed in approximately a day.

We added some further optimizations like caching, commit-interval, etc. but describing them would exceed the scope of this paper.

Once this step is completed, we have a Solr-indexed version of OpenStreetMap data available to be used for searches.

Process flow describing the integration of OpenStreetMap data into the index server.

Examples of Search in Geospatial Data

Proximity search is a fairly basic and general search which Solr can implement without any problems. For example let’s take a look at the geospatial coordinated of our Nuremberg office (49.447206,11.102245) and search all objects within a radius of 1 kilometer (max.) . The results is a Solr search like: {!geofilt sfield=geo pt=49.447206,11.102245 d=1}.

After only a short while (much less than 1 sec), we get results, however, there are 630 of them with only the first 10 displayed (this can be configured quite easily). The results are not sorted by distance as this is currently a quite memory intensive task in Solr and not an advisable solution for the high amount of data in the index. It is, however, easy to reduce the distance and rerun the search.

Faceting is one of the most interesting aspects of the search when using Solr. This means simulating additional search criteria and showing the number of results if these were applied. If for example we facet via the field amenity the result is the following:

<lst name="amenity">
  <int name="recycling">22</int>
  <int name="restaurant">22</int>
  <int name="telephone">18</int>
  <int name="pub">16</int>
  <int name="kindergarten">13</int>
  <int name="post_box">13</int>
  <int name="fast_food">9</int>
  <int name="school">9</int>
  <int name="vending_machine">8</int>
  <int name="place_of_worship">7</int>
  <int name="biergarten">5</int>
  <int name="cafe">5</int>
  <int name="university">4</int>
  <int name="bank">3</int>
  <int name="fuel">3</int>
  <int name="pharmacy">3</int>
  <int name="doctors">2</int>
  <int name="hospital">2</int>
</lst>

You can facet not only by amenity but also by any other indexed field, like e.g. landuse etc. It is also possible to facet functionally, e.g. to calculate the sum of surfaces (in the future).

The options for the qualification of data that result are quite extensive and can be used for exploration as well as statistics.

Faceting is not possible in classic databases and this is another reason why we chose to use Solr in our solution.

“In the vicinity of/close to”…

As said before, the data of all of Germany can be indexed with standard hardware. It is easy to try and check which kinds of interesting information this will yield for one’s own place of residence (quite often something new) or to check out holiday destinations. If you enter the addresses of friends you can surprise them with your local knowledge. Depending on the accuracy needed, you only have to have a bit of time and some GBs of hard drive storage on hand.

The information can also be used in other ways. Companies can for example collect important addresses in their vicinity (pharmacies, doctors, etc.) With OpenStreetMap’s license this can also be offered as a service (“We prepared the most important addresses in the vicinity for your employees”).

Search for Sights in the Vicinity

Other private use cases like for example searching for sights in nature or for a restaurant in the vicinity can also easily be mapped. Here faceting is once again useful as it allows results to be qualified by type or country cuisine.

Supermarkets (e.g. faceted by chain/brand) can also be listed as many diligent contributors have added opening times and URLs of the respective shops as well.

Search for Real Estate

Real estate search has become a highly competitive market since the first dedicated portals came online. Nearly all real estate is by now geospatially coded, even if that is not immediately on display on the internet (due to commissions).

OpenStreetMap data can be used to create distinguishing factors. Data can already be enriched during indexing. It is for example possible to calculate the distance to the nearest school, kindergarten, doctor or supermarket and to add this to the search index. This allows searches for real estate that returns only results that answer to certain criteria, e.g. only offers where a kindergarten and a supermarket is within a radius of 1 km. In many cases a two-phase process makes sense where in the search in the first phase is used for qualifying (“augmenting”) the data.

Statistics

The search index itself can also be used as a data source. A statistically interesting questing for a discount supermarket could for example be how many Aldi supermarkets are within a 1 km perimeter of a Lidl supermarket (very specific to Germany and parts of Europe). This can be restricted to federal states or certain cities.

However, there are many more statistics that can be derived, e.g. find out which cities have the smallest distance from flats to woods or parks or the average distance from flats to care facilities, hospitals or doctor for a certain city or cities.

A restaurant manager looking to open a new location could for example check beforehand whether there are already any restaurants with similar cuisine in the area. Local population density and median income (from third sources) might also play an important role.

Classification of Third-Party Geospatial Data

The process can be used to augment and create statistics of third party data mined from e.g. Open Data portals which have recently become ever more common and popular.

As it can be quite difficult to prepare results that allows interpretation given the amount and diversity of the data, modern methods of artificial intelligence (“Machine Learning”) can be used. Apache Mahout offers many ready-made algorithms which can also be run temporarily on cloud servers via the Hadoop platform used. These servers can be powered off once the calculation is made (which keeps cost down).

“Internet of Things”

Currently everyone is talking about the “Internet of Things” where many devices currently without internet access will in the future get IP addresses. This creates further interesting opportunities.

If the location of a “thing” is known, it can register/sign in from this position and offer its services which can then be found using the respective platform for geospatial data. On the other side, these “things” (if they are for example mobile”) can also use a search platform to find other interesting points in their current vicinity.

Nearly unlimited possibilities will arise over time and lead to many new application unimaginable right now.

Local News

News will also more and more be overlaid with positional data. This allows the creation of services searching for news concerning a certain region or the vicinity of the current location. This could create interesting use cases for traffic information, on road blocks or demonstrations, or also with concerning concerts or other events and venues.

Mobile Services

Mobile applications can be implemented quite easily. Imagine a user looking for a supermarket: As soon as the device recognizes (via geospatial search) that a supermarket is within a certain perimeter of the current location (and is open) it notifies the user. This type of solution can be implemented for many other facilities (like toilets) as well.

People in Close Proximity

If people report their own location (e.g. via a closed community), they can be notified if there are other people in the vicinity. This could also be coupled to the vicinity of certain facilities. It would for example be possible to notify a golf player of a certain golf club that there are currently at least two persons in the vicinity of the golf court with whom he can then immediately start a game.

Building Density of Cities

The average distance between buildings within cities can be calculated as many buildings are also modeled in OpenStreetMap. In correlation with green spaces it might be possible to deduce indicators for quality of living.

Rendering of Maps

OpenLayers is another powerful (Open Source) component of OpenStreetMap to be used for the rendering and display of maps. However, it cannot be fed directly from the Solr index but requires a PostgreSQL database, as explained in the OpenStreetMap Wiki page on PostgreSQL. The installation can be worthwhile if data has to be verified or if visualization of maps is required. If it is, however, only rarely needed, a link to OpenStreetMaps’ public website can be implemented with much less effort.

Summary

We showed how our solution allows us to implement a geospatial information system on common hardware using open source data (OpenStreetMap) and software (Apache Solr).

The system is quite flexible and can be used universally in interactive mode as well as for mass data enrichment of already existing data. Taking current developments (GPS integrated into mobile phones, internet of things) into account, geospatial information systems will in the short and middle run gain in importance.

With our solution, we are well prepared to meet these challenges.

Share

Leave a Reply

*

7 Responses to “Searching OpenStreetMap Geospatial Data with Solr”

  1. juan says:

    Hi
    Is the source code available?

    • Dr. Christian Winkler says:

      Hi Juan,

      the source code consists of a number of different scripts and is unfortunately not available to the public. If you have specific questions don’t hesitate to contact us.

      Regards
      Christian

  2. Antonio says:

    hi
    thanks for your guide.
    I read all but don’t understand how i can import OSM data in my solr and how i can index the relevant data.

    • Dr. Christian Winkler says:

      Hi Antonio,

      first you have to create an appropriate schema.xml depending on what should be in your index.

      Afterwards you have to transform the OSM data (either XML or ProtocolBuffer) to satisfy your schema and probably perform some of the transformations described in the article. This can then be indexed by Solr.

      It is no out-of-the box solution. Depending on your requirements it might be a lot of effort.

      Regards
      Christian

  3. Antonio says:

    My problem is very simple.
    I must implement a geocoding(reverse geocoding) service using as input data:
    1)OSM data on the names of the streets
      2)data from postgres DB1
      3)data from postgres DB2.

    input params :
    - Address (string typed by the user) – required
    - Land on which filter data -optional
    - On which city to filter the data -optional
    - Type of optional data (can be a road, but also the train station, bus stop TPL, etc.).

    outup params :
    - Number of documents found
    - Paging Parameters (start recforPage docs)
    - Type object (road, bus stop etc.).
    - Coordinates of the centroid on the road / found object (lat long WGS84)
    - Full name of the found object (may not correspond to the input typed by the user)
    - country
    - city
    - postalcode
    - houseaddress..

    My first question is : “How can I use Apache Solr to make queries about these 3 different data sources?”
    than : “How i can transform and index the XML OSM data to satisfy my schema??”

    I’m sorry but this is the first time I am using apache solr and OMS data..
    Thanks
    Antonio

    • Dr. Christian Winkler says:

      Hi Antonio,

      you do not want to perform a geospatial search but a full text search in the OSM data. Our approach has a completely different aim and is not suitable for your requirement.

      You should try to extract the text from OSM data and use Solr as a full text search engine in this domain. Additional parameters like bus stops, amenities etc. can still be kept and used for facetting etc.

      If you need additional consulting fell free to contact us.

      Regards
      Christian

  4. MRM says:

    How can I index user defined features using Apache Solr. More specifically, I have the following question:
    http://stackoverflow.com/questions/23545678/adding-search-in-django-using-apache-solr