 
 OpenStreetMap Data Facts with osm4scala
Introduction
This article is the result of a list of checks and proofs that I needed to do for different projects. As well, I used it to understand better the distribution of data in osm pbf files. The code is public at angelcervera/osm-facts GitHub repo.
You can browse, read and download the result of every “fact” proof, but you can also folk the project and use it for your own. To access all this data I’m using angelcervera/osm4scala and Spark. At the moment, this library does not support Spark out of the box, so I extracted all OSMData Blobs using one of the examples in the library and upload all to my home made cluster with Hadoop.
Fact 1: Blocks overlapping.
The pbfs osm format is based in blocks of data. Every block of data contains a data of one type: Ways, Nodes, Relations, etc. To save spaces, all nodes are stored as a delta from a lat/lon offset. Every block contains a different offset.
This first “Fact” is related with this way to save space. If nodes are stored as deltas from a common offset per block, then grouping nearest point in the same block makes sense. This was my first assumption, so this fact is checking it.
As a result, I will prove that my assumption is wrong and nodes are not grouped by proximity.
In this first fact, I’m going to:
- Generate a file with blobs bounding boxes.
- Implement a small visualization with leaflet and the previous file: Check here for Faroe Islands
In the BlobHeader, there is a indexdata that can contain information about the following blob, as the bounding box that
contains all geo data. But this information is not standardized and is optional.
The same problem with the HeaderBBox that is optional in HeaderBlock.
So to know the bounding box, it is necessary iterate over all records in the  OsmData object.
To execute the spark job that create the javascript array to use in the map:
  ./bin/spark-submit \
    --class com.acervera.osmfacts.fact1.Fact1Driver \
    --master yarn \
    --deploy-mode cluster \
    --num-executors 5 \
    --executor-cores 4 \
    --driver-java-options='-Dosm-facts.input=hdfs:///user/angelcervera/osm/blocks/planet -Dosm-facts.local-file-js-bounding=/home/angelcervera/planet/bboxes.js' \
    ~/spark-facts-assembly-0.1-SNAPSHOT.jar
Fact 2: Unique IDs.
Every block contains unique types of entities. I supposed that Ids are unique between entities.
So in this fact I will prove that:
- There are not Idduplications between blocks (example: Same node in different blocks )
- The same Idis not used by two different types of entities (example: Way and node with the same Id )
To execute the spark job to count number of duplicates:
  ./bin/spark-submit \
    --class com.acervera.osmfacts.fact2.Fact2Driver \
    --master yarn \
    --deploy-mode cluster \
    --num-executors 5 \
    --executor-cores 4 \
    ~/spark-facts-assembly-0.1-SNAPSHOT.jar hdfs:///user/angelcervera/osm/blocks/planet
Fact 3: Not all connections between ways are at the ends of the way
To define a network, the best way is to define vertices and edges. In the case of OSM, the equivalence could be
edges = ways and vertices = nodes at the ends of the way.
In this fact, we are going to demonstrate that this is not the case in the osm files, because there are nodes used as connections between ways that are not at the ends.
  ./bin/spark-submit \
    --class com.acervera.osmfacts.fact3.Fact3Driver \
    --master yarn \
    --deploy-mode cluster \
    --num-executors 5 \
    --executor-cores 4 \
    ~/spark-facts-assembly-0.1-SNAPSHOT.jar hdfs:///user/angelcervera/osm/blocks/planet /home/angelcervera/planet/connections_not_at_the_ends
In the demo you can see how there are four ways (red, black, green and fuchsia) that all have vertices at the ends, but the blue way contains the intersection in the middle.
Fact 4: Small % of nodes are shared between ways.
The idea of store nodes as a different entity and not as a part of the way is a good idea if a high percentage of them are shared between ways (Intersections). In this fact, we discover that this amount of data is so small that adding complexity to the format to avoid replications of less that 2% of the data makes no sense.
  ./bin/spark-submit \
    --class com.acervera.osmfacts.fact4.Fact4Driver \
    --master yarn \
    --deploy-mode cluster \
    --num-executors 5 \
    --executor-cores 4 \
    ~/spark-facts-assembly-0.1-SNAPSHOT.jar \
    hdfs:///user/angelcervera/osm/blocks/planet
Feroe Islands metrics
- Size: 1.5 M
- Total entities: 166929
- Error: 0
- Nodes: 153468 => 91.93609259026293% of entities
- Ways: 13303 => 7.969256390441445% of entities
- Relations: 158 => 0.09465101929562868% of entities
- Intersections: 1112 => 0.7245810201475226% of nodes
Spain metrics
- Size: 569.1 M
- Total entities: 86641722
- Error: 0
- Nodes: 78742432 => 90.88281047784346% of entities
- Ways: 7637442 => 8.814970228777309% of entities
- Relations: 261848 => 0.30221929337923364% of entities
- Intersections: 1516302 => 1.9256479149640693% of nodes
Planet metrics
- Size: 33.5 G
- Total entities: 3976885170
- Error: 0
- Nodes: 3596320083 => 90.43057391068699% of entities
- Ways: 375990384 => 9.454393776222611% of entities
- Relations: 4574703 => 0.11503231309039783% of entities
- Intersections: 99191632 => 2.7581424820578184% of nodes
Fact 5: Node Ids integrity
In this fact, to be sure that there is not inconsistency between node ids, I’m going to extract all node Ids that are part of a way and compare them with all nodes defined.
The result, as expected, data set is fine.
  ./bin/spark-submit \
    --class com.acervera.osmfacts.fact5.Fact5Driver \
    --master yarn \
    --deploy-mode cluster \
    --num-executors 5 \
    --executor-cores 4 \
    ~/spark-facts-assembly-0.1-SNAPSHOT.jar \
    hdfs:///user/angelcervera/osm/blocks/planet \
    hdfs:///user/angelcervera/osm/fact5/nodes
Fact 6: Nodes ids declared before to be used
When try to process the data set sequentially, it is important to define de node before to be used in a way. In this fact, I’m checking that nodes are always defined before to be used.
In this fact, I’m not going to use Spark because the sequential nature of this check. I’m going to iterate all data and I’m going to use a the Roaring Bitmaps data structure to store defined nodes ids in memory.
Even in the sequential process, osm4scala is processing the full planet in less than 50 minutes.
As expected, no ids used before define them. As well, looks like all nodes blocks are before way blocks.
java \
    -Xms4G \
    -Xmx8G  \
    -cp scala-facts-assembly-0.1-SNAPSHOT.jar \
    com.acervera.osmfacts.fact6.Fact6Driver \
    /home/angelcc/Downloads/osm/planet-200309.osm.pbf