Nick Stallman
2018-10-15 21:32:24 UTC
Hi
I've been a on and off contributor for ~9 years or so and have recently
taken a look at OSM again for work due to Google's recent pricing changes.
I've set up a fresh server with a planet extract just fine.
In doing this I noticed that all the tools handling PBF files are horribly
inefficient. With PBF files being in essentially raw chronological order it
means extracting any kind of meaningful data out of them requires scanning
the entire file (even if you are only interested in a small region) and
vast quantities of random reads.
Judging from the PBF wiki page, all the work was done ~8 years ago and
included the foresight to have fields for indexing but from what I can find
nothing has been done about that since. Adding an index seems like a
logical step which would reduce processing times for many common operations
drastically. Some tools do make their own index or cache but it needs to be
done for each tool and is sub optimal.
Is there any reason for this? I've done a bit of googling but I can't find
any recent discussions about this issue.
I'm a little tempted to find the time to create an indexed format myself if
needed and submit patches to the relevant tools so they can benefit from
it. An initial thought would be to sort the input pbf file by geohash so
each PBF Blob has it's own unique geohash. Precision could be added when
the block size limit is reached and that Blob split in to 32 new Blobs. I'm
not sure if it would be beneficial to store the Blobs for nodes, ways and
relations in close proximity or not, there is probably some experimentation
needed.
So indexdata could simply be the geohash that the Blob contains, and in the
case of ways and relations it could also contain a preload hint of the Blob
hashes that contain referenced IDs so they can also be loaded in to RAM
prior to processing the block.
With this scheme, if you needed to make a country extract it would be too
easy, Blobs could simply be copied as-is selected by their geohash. A later
step could then filter out by polygon or bounding box if required over the
subsequent significantly smaller file. If the entire planet file was being
imported in to PostGIS then it could be done in a single pass since
everything would be easily locatable.
Let me know your thoughts and if I've missed something that would explain
why this isn't already being done.
Thanks
Nick
I've been a on and off contributor for ~9 years or so and have recently
taken a look at OSM again for work due to Google's recent pricing changes.
I've set up a fresh server with a planet extract just fine.
In doing this I noticed that all the tools handling PBF files are horribly
inefficient. With PBF files being in essentially raw chronological order it
means extracting any kind of meaningful data out of them requires scanning
the entire file (even if you are only interested in a small region) and
vast quantities of random reads.
Judging from the PBF wiki page, all the work was done ~8 years ago and
included the foresight to have fields for indexing but from what I can find
nothing has been done about that since. Adding an index seems like a
logical step which would reduce processing times for many common operations
drastically. Some tools do make their own index or cache but it needs to be
done for each tool and is sub optimal.
Is there any reason for this? I've done a bit of googling but I can't find
any recent discussions about this issue.
I'm a little tempted to find the time to create an indexed format myself if
needed and submit patches to the relevant tools so they can benefit from
it. An initial thought would be to sort the input pbf file by geohash so
each PBF Blob has it's own unique geohash. Precision could be added when
the block size limit is reached and that Blob split in to 32 new Blobs. I'm
not sure if it would be beneficial to store the Blobs for nodes, ways and
relations in close proximity or not, there is probably some experimentation
needed.
So indexdata could simply be the geohash that the Blob contains, and in the
case of ways and relations it could also contain a preload hint of the Blob
hashes that contain referenced IDs so they can also be loaded in to RAM
prior to processing the block.
With this scheme, if you needed to make a country extract it would be too
easy, Blobs could simply be copied as-is selected by their geohash. A later
step could then filter out by polygon or bounding box if required over the
subsequent significantly smaller file. If the entire planet file was being
imported in to PostGIS then it could be done in a single pass since
everything would be easily locatable.
Let me know your thoughts and if I've missed something that would explain
why this isn't already being done.
Thanks
Nick