[OSM-dev] Indexing of PBF files

Discussion:

Nick Stallman

2018-10-15 21:32:24 UTC

Hi

I've been a on and off contributor for ~9 years or so and have recently
taken a look at OSM again for work due to Google's recent pricing changes.
I've set up a fresh server with a planet extract just fine.

In doing this I noticed that all the tools handling PBF files are horribly
inefficient. With PBF files being in essentially raw chronological order it
means extracting any kind of meaningful data out of them requires scanning
the entire file (even if you are only interested in a small region) and
vast quantities of random reads.

Judging from the PBF wiki page, all the work was done ~8 years ago and
included the foresight to have fields for indexing but from what I can find
nothing has been done about that since. Adding an index seems like a
logical step which would reduce processing times for many common operations
drastically. Some tools do make their own index or cache but it needs to be
done for each tool and is sub optimal.

Is there any reason for this? I've done a bit of googling but I can't find
any recent discussions about this issue.

I'm a little tempted to find the time to create an indexed format myself if
needed and submit patches to the relevant tools so they can benefit from
it. An initial thought would be to sort the input pbf file by geohash so
each PBF Blob has it's own unique geohash. Precision could be added when
the block size limit is reached and that Blob split in to 32 new Blobs. I'm
not sure if it would be beneficial to store the Blobs for nodes, ways and
relations in close proximity or not, there is probably some experimentation
needed.

So indexdata could simply be the geohash that the Blob contains, and in the
case of ways and relations it could also contain a preload hint of the Blob
hashes that contain referenced IDs so they can also be loaded in to RAM
prior to processing the block.

With this scheme, if you needed to make a country extract it would be too
easy, Blobs could simply be copied as-is selected by their geohash. A later
step could then filter out by polygon or bounding box if required over the
subsequent significantly smaller file. If the entire planet file was being
imported in to PostGIS then it could be done in a single pass since
everything would be easily locatable.

Let me know your thoughts and if I've missed something that would explain
why this isn't already being done.

Thanks
Nick

Frederik Ramm

2018-10-15 21:56:42 UTC

Permalink

Hi,

Post by Nick Stallman
In doing this I noticed that all the tools handling PBF files are
horribly inefficient. With PBF files being in essentially raw
chronological order it means extracting any kind of meaningful data out
of them requires scanning the entire file (even if you are only
interested in a small region) and vast quantities of random reads.

I don't think your analysis is correct. I am not aware of any file that
processes PBFs and does random reads - they're all streaming, and worst
case they're reading the file in full three times. But no seeking. And
reading "only a region" from a PBF file is kind of a niche use case;
most people get the file that covers the area they need, and load it
into a database, where derived data structures will be built for
whatever the use case is.

The osmium command line tool is relatively good and efficient at cutting
out regions from a planet file if needed. Indexing a planet file would
only make sense if your use case involves repeatedly cutting out small
areas from a planet file.

Post by Nick Stallman
Judging from the PBF wiki page, all the work was done ~8 years ago and
included the foresight to have fields for indexing but from what I can
find nothing has been done about that since. Adding an index seems like
a logical step which would reduce processing times for many common
operations drastically.

As I said, most people take a PBF and load it into a database, and I
don't see how that processing would benefit from an index. What are the
"many common operations" you are thinking of?

Post by Nick Stallman
Some tools do make their own index or cache but
it needs to be done for each tool and is sub optimal.

I'm only aware of Overpass which is essentially a database
implementation of its own, which not only does regional cuts but also
filtering by tags, and would certainly not be able to simply replace its
own database with an indexed PBF.

Post by Nick Stallman
I'm a little tempted to find the time to create an indexed format myself
if needed and submit patches to the relevant tools so they can benefit
from it.

Again, I struggle to understand which operations and tools would
benefit; I don't think the general OSM data user struggles with the
issues an index would solve. I could imagine if you ran a custom extract
server like extract.bbbike.org then having random, regionally indexed
access to a raw data file could be beneficial but that's about the only
case I can think of.

Post by Nick Stallman
With this scheme, if you needed to make a country extract it would be
too easy, Blobs could simply be copied as-is selected by their geohash.
A later step could then filter out by polygon or bounding box if
required over the subsequent significantly smaller file. If the entire
planet file was being imported in to PostGIS then it could be done in a
single pass since everything would be easily locatable.

The planet is imported into PostGIS in a single pass even now, at least
if you use osm2pgsql.

I am running a nightly job that splits the planet into tons of country
and smaller extracts on download.geofabrik.de. It takes a couple of
hours every night. Having an indexed planet file could save a little
time in the process but I'm not sure if it would be worth it. The reason
many people download country extracts from download.geofabrik.de is
probably not that the planet file isn't indexed and therefor extracting
a region is hard - it's that the planet file is huge and they don't want
to download that much. An indexed planet file would not help these users.

Not saying you shouldn't try it but I haven't yet understood the benefits.

Bye
Frederik

--
Frederik Ramm ## eMail ***@remote.org ## N49°00'09" E008°23'33"

William Temperley

2018-10-16 20:18:08 UTC

Permalink

Hi Frederick,

Requiring the sequential read makes using the pbf format difficult in data
parallel processing.

When files are split into equal sized chunks to be processed in parallel,
it is necessary to be able to seek to the beginning of the next block
(blob) to begin processing there.

This is not currently possible with the pbf format, as the file _must_ be
read sequentially to figure out where the blob ends / new one begins. With
an index, or even just a simple delimiter it would be possible to figure
this out in a parallel processing scenario.

My workaround was to pre-process and separate the blobs into a delimited
format before processing.

Best,

Will Temperley

Post by Frederik Ramm
Hi,

I don't think your analysis is correct. I am not aware of any file that
processes PBFs and does random reads - they're all streaming, and worst
case they're reading the file in full three times. But no seeking. And
reading "only a region" from a PBF file is kind of a niche use case;
most people get the file that covers the area they need, and load it
into a database, where derived data structures will be built for
whatever the use case is.
The osmium command line tool is relatively good and efficient at cutting
out regions from a planet file if needed. Indexing a planet file would
only make sense if your use case involves repeatedly cutting out small
areas from a planet file.

As I said, most people take a PBF and load it into a database, and I
don't see how that processing would benefit from an index. What are the
"many common operations" you are thinking of?

Post by Nick Stallman
Some tools do make their own index or cache but
it needs to be done for each tool and is sub optimal.

Post by Nick Stallman
I'm a little tempted to find the time to create an indexed format myself
if needed and submit patches to the relevant tools so they can benefit
from it.

The planet is imported into PostGIS in a single pass even now, at least
if you use osm2pgsql.
I am running a nightly job that splits the planet into tons of country
and smaller extracts on download.geofabrik.de. It takes a couple of
hours every night. Having an indexed planet file could save a little
time in the process but I'm not sure if it would be worth it. The reason
many people download country extracts from download.geofabrik.de is
probably not that the planet file isn't indexed and therefor extracting
a region is hard - it's that the planet file is huge and they don't want
to download that much. An indexed planet file would not help these users.
Not saying you shouldn't try it but I haven't yet understood the benefits.
Bye
Frederik
--
_______________________________________________
dev mailing list
https://lists.openstreetmap.org/listinfo/dev

Jochen Topf

2018-10-16 20:43:16 UTC

Permalink

Post by William Temperley
Requiring the sequential read makes using the pbf format difficult in data
parallel processing.
When files are split into equal sized chunks to be processed in parallel,
it is necessary to be able to seek to the beginning of the next block
(blob) to begin processing there.
This is not currently possible with the pbf format, as the file _must_ be
read sequentially to figure out where the blob ends / new one begins. With
an index, or even just a simple delimiter it would be possible to figure
this out in a parallel processing scenario.

Osmium can do this just fine. It has one thread reading the data
sequentially, figuring out where the blocks start and end and parceling
out the block decoding work to other threads. Not as simple and probably
not quite as fast as with an index pointing to those blocks, but it does
work.

Indexes have the drawback that you can't streaming-write the data any
more, you have to go back to write the index. Or you write them at the
end, then you can't streaming read any more (at least when you want to
use the index).

Jochen

--
Jochen Topf ***@remote.org https://www.jochentopf.com/ +49-351-31778688

William Temperley

2018-10-16 21:13:09 UTC

Permalink

No, Osmium can't do what I described. The reader thread / worker thread
model you describe does not read the data in parallel on multiple machines,
which is what I have been doing, albeit with a preprocessing step to
separate the blocks as they are not currently directly addressable, or even
seperable, without a sequential read.

A delimiter would however solve this problem.

Post by Jochen Topf

Post by William Temperley
Requiring the sequential read makes using the pbf format difficult in

data

Post by William Temperley
parallel processing.
When files are split into equal sized chunks to be processed in parallel,
it is necessary to be able to seek to the beginning of the next block
(blob) to begin processing there.
This is not currently possible with the pbf format, as the file _must_ be
read sequentially to figure out where the blob ends / new one begins.

With

Post by William Temperley
an index, or even just a simple delimiter it would be possible to figure
this out in a parallel processing scenario.

Osmium can do this just fine. It has one thread reading the data
sequentially, figuring out where the blocks start and end and parceling
out the block decoding work to other threads. Not as simple and probably
not quite as fast as with an index pointing to those blocks, but it does
work.
Indexes have the drawback that you can't streaming-write the data any
more, you have to go back to write the index. Or you write them at the
end, then you can't streaming read any more (at least when you want to
use the index).
Jochen
--
+49-351-31778688

Mateusz Konieczny

2018-10-17 08:43:54 UTC

Permalink

Post by Frederik Ramm
The reason
many people download country extracts from download.geofabrik.de is
probably not that the planet file isn't indexed and therefor extracting
a region is hard - it's that the planet file is huge and they don't want
to download that much. An indexed planet file would not help these users.

As user of extracts I can confirm this that size is main reason for using extracts.

And download is not the only limitation.

When I use small area (that can fit into RAM of my computer) then extracts are

preferable over downloading file that will force my computer to swap during

any, even simplest, processing.