Skip to content

Worldwide aggregator

The worldwide geo-data aggregator is a first-party gispulse source, shipped in 1.9.0 (EPIC #226) and published to PyPI in 2.0.0. Rather than multiplying marketplace packages for every public dataset, GISPulse maintains a curated catalog in-repo (core/data/worldwide_catalog.yml) and a single WorldwideCatalogSource that materialises it at runtime.

Design #226, decision 2 — the worldwide catalog is free: published under gispulse.data_sources in the gispulse distribution itself, so ExtensionHub resolves it as first-party and the community gate passes with no extra code.

Concept

core/data/worldwide_catalog.yml      ← source of truth (curated YAML)


WorldwideCatalogSource               ← a first-party DeclarativeSource

              ▼ (.fetch())
LazyFetcher (4 adapters)             ← one per AccessProtocol


DuckDB / GDAL / WFS                  ← lazy execution + bbox push-down

Each YAML entry becomes a SourceEntryRef carrying the catalog's four filter axes (issue #227):

AxisValue
domainSourceDomain (base, observation, elevation, …)
payloadPayload (vector, raster, pointcloud, tiles, table)
jurisdictionISO 3166-1 / world / eu
access.protocolAccessProtocol (remote-table, ogc-features, stac, http-file, table-file)

Plus a family grouping (in metadata) for the portal gallery.

The five fetchers

These adapter families, one per AccessProtocol, subclass LazyFetcher. The DuckDB engine downloads nothing at instantiation — the scan is lazy, the bbox push-down is built on the SQL side.

ProtocolFetcherIssueNotes
remote-tableGeoParquetS3Fetcher#229 (A3)read_parquet('s3://…', hive_partitioning=true) + bbox push-down on the Overture struct. DuckDB httpfs.
ogc-featuresOGCFeaturesFetcher#230 (A4)OGC API Features / WFS. Lazy ST_Read(...) + COPY via the WFS client to materialise.
stacSTACFetcher#231 (A5)STAC search + download_asset of the COG; raster read on demand.
http-fileHttpFileFetcher#232 (A6)Public vector/raster file behind a GDAL /vsicurl/ virtual path; streamed download.
table-fileTableFileFetchern/aNon-spatial table file. Plain CSV via read_csv_auto; ZIP/XLSX materialised without invented geometry.

bbox push-down

Each fetcher implements _scan_sql(extent), which takes the rule's Extent and injects it into the DuckDB query. Measurable consequence: a département-scoped query on Overture Buildings touches a few MB, not the 2 GB world parquet.

SSRF security

Endpoints are checked structurally at load of the YAML (allow-listed scheme, non-private / non-loopback host), with no DNS resolution — the module import stays network-free. The full DNS-resolving guard (issue #199) runs per fetch inside the LazyFetcher, so an attacker cannot rebind a public hostname at load to a private IP at execution.

Schemes accepted at load time: http, https, s3.

The curated catalog

core/data/worldwide_catalog.yml ships four initial families:

FamilyTypical datasetProtocol
overture-geoparquetOverture Places / Buildingsremote-table
ogc-featuresINSPIRE & public WFSogc-features
stac-imagerySentinel-2, MS Planetarystac
opendata-frFR vectors (data.gouv)http-file

Each entry carries an immutable revision_token (for a versioned release) or null (live service) — A14 (#240) wires the freshness probe into the SourceWatcherRegistry.

Example — querying the aggregator in Python

python
from gispulse.plugins.api import get_source

worldwide = get_source("worldwide-catalog")

# List a family
overture = [
    e for e in worldwide.entries()
    if e.metadata.get("family") == "overture-geoparquet"
]

# Fetch an entry
entry = worldwide.catalog().lookup("overture-buildings")
# entry.access.protocol → AccessProtocol.REMOTE_TABLE
# entry.access.endpoint → "s3://overturemaps-us-west-2/release/..."
# entry.payload         → Payload.VECTOR
# entry.jurisdiction    → "world"

Example — adding a source to the aggregator

Adding an entry to the catalog is one YAML stanza:

yaml
# core/data/worldwide_catalog.yml
entries:
  - id: my-public-dataset
    name: My public dataset
    family: opendata-fr
    domain: environnement
    payload: vector
    jurisdiction: FR
    access:
      protocol: http-file
      endpoint: https://data.gouv.fr/.../layer.gpkg
      format: application/x-sqlite3
    revision_token: null
    metadata:
      provider: My Ministry
      license: Licence Ouverte 2.0

The entry is immediately available to entries() / catalog(). To distribute outside the repo, see the source-catalog content type of data packs — the addition then goes through a DataPackManifest rather than a fork.

CLI / portal

  • gispulse sources list — inventory of ExtensionHub sources, including worldwide-catalog.
  • gispulse sources catalog worldwide-catalog --family ogc-features — family filter (CLI).
  • Portal: "Worldwide catalog" panel with domain / payload / jurisdiction / protocol filters (CLI ↔ portal symmetry).

Code references

Published under AGPL-3.0 license.