gpxz

EU region

2025-04-21T00:00:00-05:00

The GPXZ API now has an EU region!

Use the new api-eu.gpxz.io domain to ensure your elevation queries are performed on servers located within the EU.

Check out the EU region documentation for details.

Raster output profiles

2025-03-31T00:00:00-05:00

The GPXZ raster API lets you extract high-resolution raster DEMs for a given bounding box. The output is provided as a GeoTIFF.

While GeoTIFF is a standard format, it’s also a rather complex one that’s still being updated today. As a result, not all tools support all GeoTIFF files.

We just added a output_profile parameter that lets customers specify a compatibility profile for the GeoTIFF’s returned by the /v1/elevation/hires-raster endpoint. Currently three profiles are supported

zstd uses the Zstandard GeoTIFF compression setting. GeoTIFFs compresses with Zstandard result in some of the smallest filesizes and fastest read times of any compression algorithm. It’s also by far the fastest compression option to write, resulting in faster response times from the GPXZ API. The only downsize is that older tooling doesn’t always support Zstandard.
autocad enables a number of compatibility features required to import GeoTIFFs into Autodesk products such as AutoCAD. One of those options however is that rasters are limited to 4GB in size; the GPXZ API will return a 400 status code in if a 4GB+ raster is requested with the autocad profile.
default is, well, our default raster profile. It’s what we’ve been using since the launch of the API. It produces a cloud-optimised GeoTIFF that is compatible with most tools while still making use of reasonable compression levels and file sizes.

If you’re having compatibility issues with your tooling, please get in touch and we might be able to add a new profile.

Validating rasters

2025-03-28T00:00:00-05:00

GPXZ stores over 10 million unique GeoTIFF rasters between our dataset and our sources. The number of actual files is much larger as those rasters are duplicated many times across our infrastructure.

When shepherding so many files, it’s inevitable to come across some invalid ones. Sometimes the file comes with errors already from the source. Other times files develop corruption, due to software bugs or hardware crashes.

While the final GPXZ dataset goes through a robust QA process to ensure invalid data never reaches our customers, it’s still much easier to resolve data corruption that’s detected as soon as it occurs.

At GPXZ we run a validate-rasters script which we run on any files that have been downloaded, written to, or copied. If you’re interested in building a similar tool, some approaches are outlined below.

gdalinfo

The gdalinfo command gives information about a raster file.

gdalinfo S48E167.tif

# Driver: GTiff/GeoTIFF
# Files: S48E167.tif
# [there's lots more lines which I've truncated here]

If you pass a file that isn’t a geospatial raster then you get an error message

gdalinfo gpxz_logo.tif

# ERROR 4: `gpxz_logo.png' not recognized as being in a supported file format.
# gdalinfo failed - unable to open 'gpxz_logo.png'.

and more importantly you get a non-zero status code which will be interpreted as an error (e.g., if calling from python this will throw a python exception, or if chaining commands with &&)

echo $?  # Display the most recent status code.

# 1

Because we don’t really care about the actual output for validation purposes, we can add a few flags that reduce verbosity

gdalinfo -nomd -norat -noct -nomask S48E167.tif

Used like this, gdalinfo will catch major problems like zero-length files. But because many formats (most notable GeoTIFF) encode all this metadata at the start of the file, simply running gdalinfo won’t catch files that have been corrupted by a crash during writing, which in our experience is one of the most common causes of corruption.

If we truncate a valid GeoTIFF and check it

head -c 100000 S48E167.tif > S48E167.truncated.tif
gdalinfo S48E167.truncated.tif

# Driver: GTiff/GeoTIFF
# Files: S48E167.truncated.tif
# ...

gdal will happily rattle of the metadata from the header and exit with a successful return code!

gdalinfo -stats

To be assured that your data is good, there’s no option but to read it all. Luckily we can again [mis]use gdalinfo to do this by asking it to compute the summary statistics of our data with the -stats flag.

gdalinfo -stats S48E167.tif

# Driver: GTiff/GeoTIFF
# Files: S48E167.tif
# ...
#   Metadata:
#     STATISTICS_MINIMUM=-15
#     STATISTICS_MAXIMUM=746
#     STATISTICS_MEAN=17.086878887742
#     STATISTICS_STDDEV=67.97198232226
#     STATISTICS_VALID_PERCENT=100

At the end of our output we now have some summary statistics which require reading every pixel of the raster to compute. Repeating the command on our corrupted raster gives the error we’re expecting

gdalinfo -stats S48E167.truncated.tif

# ERROR 1: TIFFFillStrip:Read error at scanline 99; got 0 bytes, expected 678
# ERROR 1: TIFFReadEncodedStrip() failed.
# ERROR 1: S48E167.truncated.tif, band 1: IReadBlock failed at X offset 0, Y offset 100: TIFFReadEncodedStrip() failed.
# Driver: GTiff/GeoTIFF
# Files: S48E167.truncated.tif
# ...

echo $?

# 130

rasterio

The gdalinfo -stats command requires reading (and decompressing) the entire raster. If you’re willing you go through all that effort anyway, there’s even more categories of invalid rasters that can be detected! We’ll need to switch over to a more expressive programming language though.

Start by reading the data into memory. This step is enough to catch all the cases covered by gdalinfo -stats, though it will crash on larger-than-memory files.

with rasterio.open(raster_path) as f:
    a = f.read(masked=True)
    meta = f.meta

Now we can add extra test cases.

Like checking if our GeoTIFF file is actually a GeoTIFF, not something else with a .tif filename extension

assert meta["driver"] == "GTiff"

and checking for common NODATA values that don’t represent actual data.

a_filled = np.ma.filled(a, np.nan)
assert not np.any(a_filled == -32768)
assert not np.any(a_filled == -9999)

Domain-specific checks can be added too that go beyond file integrity into data integrity.

For (earth-based) DEMs (in metres) we know the range of expected values

assert np.nanmax(a_filled) < ELEVATION_MT_EVEREST
assert ELEVATION_MARIANA_TRENCH < np.nanmin(a_filled)

and are expecting only a single channel

assert meta['count'] == 1

Elevation

GPXZ is an API for elevation data. Access the highest-quality DEMs globally with a free account.

We also do consulting work around DEMs and geospatial data management. If that sounds like something you could use, reach out at support@gpxz.io!

Extracting DEMs from a WCS server

2025-03-27T00:00:00-05:00

WCS is an interface for retrieving geospatial data. It integrates well with specialised GIS tools like QGIS, but there’s not a lot of information out there about using WCS with custom analytical workflows. So here’s an overview of how you would extract raster DEM data from a WCS server in Python.

It’s worth noting that while WCS is a standard, it mostly specifies how client-server communication is done rather that the schema of the data. Which means that each WCS server may need slightly different handling.

Summary

The basic WCS flow is

Connect to the WCS server
Determine the “coverage ID” representing the desired data
Determine the extent and projection of the data
Split the extent into tiles
Query each tile

Steps

You’ll need OWSLib installed to handle the WCS protocol

pip install OWSLib

as well as some standard python tools which may be installed already

pip install numpy requests

Start off by connecting to the WCS server, which is defined by a URL

from owslib.wcs import WebCoverageService

WCS_URL = "https://example./com/ows/wcs"
wcs = WebCoverageService(WCS_URL)

WCS has a concept of “coverages”, which are analogous to a layer or a dataset. List all the server’s coverages with

wcs.contents.keys()

['example_dem']

and select the one you want

coverage_id = "example_dem"
coverage = wcs.contents[coverage_id]

The next piece of information we need is the spatial extent of the data.

coverage.boundingBox

{
    'nativeSrs': 'http://www.opengis.net/def/crs/EPSG/0/25833',
    'bbox': (228152.0, 5690412.0, 493382.0, 5939023.0)
}

We get a bounding box (in west, south, east, north) order, and a CRS definition. Note how the CRS for this server is specified using a URL rather than an EPSG code or WKT string. If you’re providing your own bounding box (to extract only a subset of the full coverage) make sure it is in the correct CRS.

I find it easy to mess up these unnamed bounding box tuples, so always unpack them instead,

bounds_left, bounds_bottom, bounds_right, bounds_top = coverage.boundingBox['bbox']

For small domains, you can likely query the entire dataset in one go. But for large domains such as this, you’ll need to split up your query into tiles. Choosing the correct tile size will take a bit of guess and check: some servers impose per-query size limits, while others may simply timeout before any limit is reached.

We’ll also define a buffer so that our tiles overlap. This helps avoiding holes, and preserves accuracy when making interpolated raster reads near the edges.

The tile sizes should be in the units of the nativeSrs of the coverage, and should be integer multiples of the dataset’s resolution.

tile_size = 10000
tile_buffer = 10

Now some numpy magic to define the lower-left corners of our tile grids.

import numpy as np

lefts = np.arange(
    bounds_left - bounds_left % tile_size, bounds_right - bounds_right % tile_size, tile_size
)

bottoms = np.arange(
    bounds_bottom - bounds_bottom % tile_size, bounds_top - bounds_top % tile_size, tile_size
)

We can do a nested loop over these corners and request each raster.

Some things to note:

While the query bounds are specified with the “Lat” and “Long” arguments, the query is actually being done in the CRS specified in subsettingcrs, even if that CRS isn’t latlon. In other words, think of Lat as y and Lon as x.
To avoid loss of accuracy via repeated reprojection and interpolation, do all the bounds calculation and subsetting the nativeSrs. You can always postprocess after exporting the data.
The file is named from the tile left/bottom coordinates for tidiness, though the actual extent of the tile will be buffered slightly larger.
The code streams the response to disk to avoid storing the potentially large file in memory.

from pathlib import Path
import itertools
import urllib.parse
import shutil

import requests


BASE_FOLDER = "~/wcs-extract-tifs/"
BASE_FOLDER.mkdir()


for left, bottom in itertools.product(lefts, bottoms):
    # Where to save the tile.
    tif_filename = f"dem_{left}_{bottom}.tif"
    tif_path = BASE_FOLDER / tif_filename

    # Tile query definition.
    params = [
        ("service", "WCS"),
        ("version", wcs.version),
        ("request", "GetCoverage"),
        ("coverageId", coverage_id),
        ("format", "image/tiff"),
        ("subset", f"Lat({bottom-tile_buffer},{bottom+tile_size+tile_buffer})"),
        ("subset", f"Long({left-tile_buffer},{left+tile_size+tile_buffer})"),
        ("subsettingcrs", "http://www.opengis.net/def/crs/EPSG/0/25833"),
    ]
    tile_url = WCS_URL + "?" + urllib.parse.urlencode(params)

    # Make query.
    with requests.get(tile_url, stream=True, timeout=60, verify=False) as r:
        r.raise_for_status()
        with open(tif_path, "wb") as f:
            shutil.copyfileobj(r.raw, f)

At this point your BASE_FOLDER should be full of .tif DEMs!

As a final optimisation, I find WCS servers typically do a bad job of compressing tifs. I’d recommend repacking each tif into a compressed COG geotiff with gdal_translate.

Elevation data

Stuck downloading DEMs from a WCS server? We might already have the data you’re looking for in our dataset, check out our coverage map.

We also do consulting work around DEMs and geospatial data management. If that sounds like something you could use, reach out at support@gpxz.io!

Turning a notebook into production code

2025-01-08T00:00:00-06:00

At GPXZ we make heavy use of jupyter notebooks for developing new features and adjusting existing algorithms. We love the rich output, easy inspection and tweaking of variables, and the ability to experiment on data without reloading it every time.

Once we’re happy with the result, here’s the steps we take to get that new logic into production.

1. Rerun from scratch

Jupyter lets you execute cells out of order: this is great for quick edits to your code without running everything. But it makes it east to accidentally get your notebook into an unrepeatable state.

So first ensure your notebook is correctly capturing the desired result:

Copy the notebook
Clear all outputs
Rerun all cells
Ensure the notebook completes without error
Ensure the results match the old notebook!

2. Extract utility functions

Most data science processes consist of “utility” functions where

You don’t care about intermediate state
The implementation isn’t core to your algorithm
Inputs and outputs are clearly defined and delineated

Our next step is to take these utility functions

# forecast_flowrate.ipynb

df["rainfall"] = float(df.weather_report.str.split("|").str[2]) / 25.4

and put them into their own file. Call it {notebook_name}_utils.py for now: though some utils might be more at home elsewhere in your codebase, and others might get moved back next to the core algorithm.

# forecast_flowrate_utils.py

MM_PER_INCH = 25.4

def extract_rainfall(weather_report: pd.Series) -> pd.Series:
    """Extract the rainfall part of a weather report.

    Reports look like "STATION_NAME|windspeed|rainfall" and are in imperial units.
    """
    weather_report_parts = weather_report.str.split("|")
    rainfall_str = weather_report_parts.str[2]
    rainfall_inches = float(rainfall_str)
    rainfall_mm = rainfall_inches * MM_PER_INCH
    return rainfall_mm

With our utility function extracted, it can be better documented and tested.

Importing the new module with autoreload will re-import the module in the notebook whenever the python file is changed

# forecast_flowrate.ipynb

%load_ext autoreload
%autoreload 2

import forecast_flowrate_utils

and our notebook logic is much easier to read with this computation busywork outsourced

# forecast_flowrate.ipynb

df["rainfall"] = forecast_flowrate_utils.extract_rainfall(df.weather_report)

Depending on your package layout, you may have to hack sys.path to import the utils module

# forecast_flowrate.ipynb

import sys

sys.path.append('forecaster/weather')

3. Extract model inputs and outputs

Our notebook forecast_flowrate.ipynb is going to become a function forecast_flowrate().

# Inputs.
START_AT = datetime.now(timezone.utc)
BOUNDING_BOX = None  # Use whole domain as default.

Having them in one place helps contextualize the model, makes them easy to modify for testing your notebook generalises to the entire input domain.

Extracting inputs may also help with building a repeatable model, though for data-heavy models this may require more work at the data layer.

Similarly, if the model result is currently only being displayed as a notebook cell output, make sure to either save the result or put a final cell of explicit outputs:

# Outputs.
FORECAST_TIMESERIES = df_result.flowrate.fillna(0).to_numpy()

4. Hoist constants

The constants in our models are often set based on lots of experimentation and research: this hard work may be lost without documentation of why the final value was chosen!

Move them to just under the imports of your notebook. Any number other than 0 or 1 should be scrutinised as a potential algorithm parameter.

# Standard deviation of DEM gaussian blur, in px.
#
# Smoothing is needed as our DEMs have a lot of single-pixel noise. But at a 30m resolution, excess
# smoothing removes small waterways, resulting in over-estimates for inundation. A value of 2 had
# the best cross-validation results over the domains in our initial launch, though may need tweaking
# if we move to a new DEM source. 
#
# The code for running the cross-validation is in explore/dem-sigma-cv.ipynb
DEM_SMOOTHING_SIGMA = 2

In production the logic of an algorithm, is largely fixed: desired outcomes are mostly achieved by modifying these constants, which is easier when they’re all clearly in one place/

It’s also easy in a notebook to hardcode these values during experimentation, even when it’s important for the value to be consistent across the notebook. Moving them to a constant variable enforces consistency.

dem = smooth_dem(load_dem(), sigma=2)
if np.any(np.isnan(dem)):
    dem = smooth_dem(load_backup_dem(), sigma=2.5)
    # Is sigma=2.5 because the backup dem needs more smoothing?
    # Or were we experimenting with higher smoothing and forgot to revert everywhere?

5. Add logging

One benefit of notebooks is that we can easily display intermediary results. Unlike more traditional development, where often code either runs or doesn’t, intermediary state can be critical for understanding the quality of a data-heavy simulation is heading.

The easiest way to capture this is to log any parameter that might possibly be important. Having your log messages look like unique_snake_case_name=result makes them easy to search using commandline tools like grep, and makes it possible to parse them for analysis if needed.

dem = load_dem()
logger.info(f"Loaded dem dem_nan_px={np.isinan(dem).sum()}")

More advanced is to use structured logging. Plots can be saved as PDFs.

And more ugly (but something we do frequently for GPXZ!) is to pass around a meta dict, save as much info as possible, and write it out as a single json file at the end of the run.

meta: dict[str, Any] = {}

dem = load_dem()
meta["dem_nan_px"] = np.isinan(dem).sum()

If you are stuffing a bunch of debugging info into a dict it’s likely to include stuff that can’t be serialilsed as json. We use a jsonify function a bit like this to handle that:

def jsonify(x):
    # Mapping.
    if isinstance(x, dict):
        return {k: jsonify(v) for k, v in x.items()}

    # Iterable.
    if isinstance(x, tuple | list):
        return [jsonify(v) for v in x]

    # Scalar.
    return _convert_json_value(x)

def _convert_json_value(v):
    """Convert values to json serialisable one."""
    if isinstance(v, np.ndarray):
        return [_convert_json_value(x) for x in v.tolist()]
    try:
        if math.isnan(v) or np.isnan(v):
            return None
    except TypeError:
        pass
    if isinstance(v, np.integer):
        return int(v)
    if isinstance(v, np.floating):
        return float(v)
    if isinstance(v, np.bool_):
        return bool(v)
    if isinstance(v, Path):
        return v.as_posix()
    if isinstance(v, datetime):
        return v.isoformat()
    if isinstance(v, Decimal):
        return float(v)
    return v

6. Abstraction

Ok now we’re ready to actually turn our notebook into a python file!

Your toplevel function should be a handful of functions that describe the model in high level:

DEM_SMOOTHING_SIGMA = 2

def forecast_flow_rate(start_at: datetime, domain: None | BoundingBox = None):
    """Single day ML forecast for flow at a streamgage location."""
    # Inputs.
    historic_flow = load_historic_flow(start_at)
    dem = load_dem(domain)
    precip = load_precipitation_forecast(domain, start_at)

    # Run model.
    forecast_result = run_lstm_flow_simulation(dem, precip, start_at)
    validate_forecast(forecast_result)

    # Save result.
    save_result_to_db(forecast_result)

Then copy each line of notebook code into the relevant sub function. Teasing out interactions between model steps is an excellent exercise for making sure you captured all your algorithm parameters, and weren’t relying on any weird notebook functionality.

Run unit tests, run model, compare to notebook output, done.

Vancouver Island 1m DEMs have some big holes

2024-11-11T00:00:00-06:00

The Canadian HRDEM dataset is made up of a number of different projects, each consisting of 1m or 2m lidar DEMs of different parts of Canada.

The project covering parts of Vancouver Island, named Vancouver_Island_Sunshine_Coast_2018, has a few rasters with corrupt data. Typically this is manifested in blocky areas with very low elevation, though there are a few areas with very high elevation.

Visualising the data, the corrupt areas are very clear. Here’s a heatmap of dtm_1m_utm10_w_5_136:

Looking at a hillshaded render, it seems there is some realistic texturing inside the corrupt region, but with both an incorrect vertical shift and diagonal striped artefacts:

These are the files with potential corruption:

dtm_1m_utm10_w_10_137.tif  dtm_1m_utm10_w_10_138.tif  dtm_1m_utm10_w_10_146.tif  dtm_1m_utm10_w_11_137.tif  dtm_1m_utm10_w_11_138.tif  dtm_1m_utm10_w_11_140.tif  dtm_1m_utm10_w_11_141.tif  dtm_1m_utm10_w_11_147.tif  dtm_1m_utm10_w_11_152.tif  dtm_1m_utm10_w_11_153.tif  dtm_1m_utm10_w_12_138.tif  dtm_1m_utm10_w_12_140.tif  dtm_1m_utm10_w_12_148.tif  dtm_1m_utm10_w_12_149.tif  dtm_1m_utm10_w_13_138.tif  dtm_1m_utm10_w_13_139.tif  dtm_1m_utm10_w_13_149.tif  dtm_1m_utm10_w_13_150.tif  dtm_1m_utm10_w_14_139.tif  dtm_1m_utm10_w_14_140.tif  dtm_1m_utm10_w_14_141.tif  dtm_1m_utm10_w_15_139.tif  dtm_1m_utm10_w_15_144.tif  dtm_1m_utm10_w_15_154.tif  dtm_1m_utm10_w_16_140.tif  dtm_1m_utm10_w_16_142.tif  dtm_1m_utm10_w_16_144.tif  dtm_1m_utm10_w_16_153.tif  dtm_1m_utm10_w_17_143.tif  dtm_1m_utm10_w_17_157.tif  dtm_1m_utm10_w_18_157.tif  dtm_1m_utm10_w_19_142.tif  dtm_1m_utm10_w_19_157.tif  dtm_1m_utm10_w_19_158.tif  dtm_1m_utm10_w_1_150.tif  dtm_1m_utm10_w_20_158.tif  dtm_1m_utm10_w_21_146.tif  dtm_1m_utm10_w_21_158.tif  dtm_1m_utm10_w_22_147.tif  dtm_1m_utm10_w_24_148.tif  dtm_1m_utm10_w_26_160.tif  dtm_1m_utm10_w_28_157.tif  dtm_1m_utm10_w_2_141.tif  dtm_1m_utm10_w_2_146.tif  dtm_1m_utm10_w_3_136.tif  dtm_1m_utm10_w_4_136.tif  dtm_1m_utm10_w_4_139.tif  dtm_1m_utm10_w_5_136.tif  dtm_1m_utm10_w_5_148.tif  dtm_1m_utm10_w_6_151.tif  dtm_1m_utm10_w_7_136.tif  dtm_1m_utm10_w_7_150.tif  dtm_1m_utm10_w_7_151.tif  dtm_1m_utm10_w_8_136.tif  dtm_1m_utm10_w_8_145.tif  dtm_1m_utm10_w_9_137.tif  dtm_1m_utm10_w_9_145.tif  dtm_1m_utm10_w_9_146.tif  dtm_1m_utm9_e_10_158.tif  dtm_1m_utm9_e_10_161.tif  dtm_1m_utm9_e_10_162.tif  dtm_1m_utm9_e_11_161.tif  dtm_1m_utm9_e_11_162.tif  dtm_1m_utm9_e_13_160.tif  dtm_1m_utm9_e_14_157.tif  dtm_1m_utm9_e_16_159.tif  dtm_1m_utm9_e_19_148.tif  dtm_1m_utm9_e_21_146.tif  dtm_1m_utm9_e_21_147.tif  dtm_1m_utm9_e_21_158.tif  dtm_1m_utm9_e_22_158.tif  dtm_1m_utm9_e_23_157.tif  dtm_1m_utm9_e_23_158.tif  dtm_1m_utm9_e_24_142.tif  dtm_1m_utm9_e_25_158.tif  dtm_1m_utm9_e_26_143.tif  dtm_1m_utm9_e_26_153.tif  dtm_1m_utm9_e_27_145.tif  dtm_1m_utm9_e_27_154.tif  dtm_1m_utm9_e_28_141.tif  dtm_1m_utm9_e_28_145.tif  dtm_1m_utm9_e_29_140.tif  dtm_1m_utm9_e_29_141.tif  dtm_1m_utm9_e_4_161.tif  dtm_1m_utm9_e_5_160.tif  dtm_1m_utm9_e_5_161.tif  dtm_1m_utm9_e_6_159.tif  dtm_1m_utm9_e_7_158.tif  dtm_1m_utm9_e_7_160.tif  dtm_1m_utm9_e_7_161.tif  dtm_1m_utm9_e_7_163.tif  dtm_1m_utm9_e_8_160.tif  dtm_1m_utm9_e_9_160.tif  dtm_1m_utm9_e_9_162.tif  dtm_1m_utm9_e_9_163.tif

Elevation API

GPXZ is an API for elevation data, including clean artefact-free lidar DEMs for much of Canada.

If you need elevation data, or help processing USGS DEMs, reach out at support@gpxz.io!

Mean array scaling in Python

2024-11-04T00:00:00-06:00

scikit-image’s resize_local_mean function is an easy way to rescale a 2D numpy array, when you want the value of the new pixels to represent the mean of the pixels they were covering:

import numpy as np
import skimage.transform

# Small test array.
x = np.linspace(0, 3, 7)
y = np.linspace(0, 1, 5)
xx, yy = np.meshgrid(x, y)
a = xx + yy
new_shape = (3, 4)

# Do the resize.
a_resized = skimage.transform.resize_local_mean(a, new_shape, grid_mode=True)

However, resize_local_mean runs very slowly on large arrays.

OpenCV has a similar algorithm that scales much better to large arrays! The function just becomes a little more complicated:

The dsize argument is reversed compared to the numpy ecosystem.
cv2.resize expects a normalised array with values between 0 and 1.
The normalisation also means checking for an ill-conditioned array.

import cv2


def opencv_resize_local_mean(a: np.ndarray, new_shape: tuple[int, int]) -> np.ndarray:
    a_min = np.min(a)
    a_max = np.max(a)

    # Early exit to avoid divide by zero.
    if a_min == a_max:
        return np.zeros(new_shape) + a_min

    # Rescale the array to [0, 1].
    a_norm = (a - a_min) / (a_max - a_min)
    a_norm = a_norm.astype(np.float32)

    # Do the resizing.
    new_shape_r = [new_shape[1], new_shape[0]]
    res_norm = cv2.resize(a_norm, new_shape_r, interpolation=cv2.INTER_AREA)

    # Unscale the array.
    res = res_norm * (a_max - a_min) + a_min
    assert np.array_equal(res.shape, new_shape)

For a test resizing of a 10,000 px DEM to 3,600 px, the average error between the two solutions is small. The max error was about 2 mm.

diff = res_skimage - res_cv2

np.mean(diff): 1.108501658070558e-05
np.median(diff): -7.279990974495831e-07
np.max(diff): 0.01960058534207576
np.min(diff): -1.3228989928393275e-05

The runtime is a lot better also for large arrays:

5,000px -> 3600px
  resize_local_mean 3.482 [s]
  cv2_area 0.229 [s]

10,000px -> 3,600px
  m_resize_local_mean 11.541 [s]
  cv2_area 0.743 [s]

20,000px -> 3,600px
  m_resize_local_mean 33.303 [s]
  cv2_area 2.822 [s]

And finally, the CPU usage is a lot lower also. While the latency of a single invocation is about 10x faster with OpenCV, if you’re saturating your CPU running these calculations (as I was!) I found my throughput increased by about 50x.

DEMs as a service

GPXZ’s raster API lets you specify your desired bounds and resolution: we do the resizing and warping for you!

If you need high-quality elevation DEMs, get in touch at andrew@gpxz.io

Raster API improvements

2024-09-17T00:00:00-05:00

The /elevation/hires-raster endpoint (for requesting a geotiff DEM by bounding box) has been upgraded:

There’s a new projection parameter for setting the CRS of the output raster
A new tight_bounds parameter has been added, for rasters that cover the bounding box without any buffer
The resolution parameter has a new best option
Improvements to error handling and performance

Existing queries will benefit from the performance improvements with unchanged response data.

But the primary benefit of these new parameters is that if you can ingest rasters of arbitrary resolution, projection, and bounds, then querying with

?projection=best&res_m=best&tight_bounds=false

will give you both the fastest possible response time as well as the most accurate results with minimal warping.

projection=best

By default, rasters are returned in UTM projection. But GPXZ sometimes stores our data in other projection systems for better accuracy: such as NZTM in New Zealand.

Using projection=best lets our API choose the output projection that minimises (or eliminates the need for) re-projecting the raster you requested, and is likely to result in the projection most suitable for analysis in your area of interest.

This does mean you need to be able to ingest rasters of arbitrary CRS. Unless you regularly update your geospatial software (e.g. PROJ, gdal, rasterio), you may want a fallback to re-requesting the raster with projection=utm in the event of a projection error when reading the output.

Also note that the projection used isn’t guaranteed to be stable across requests for the same data. In particular, new dataset releases will change the projection used to store the underlying data, and upgrades to our raster transform algorithm may change which projection is selected when the requested bounds cover multiple.

projection=latlon

Most of our customers do their analysis in UTM projection: the metre-based units are easy to work with, and it makes aligning with other UTM datasets easier.

In addition, most of GPXZ’s high resolution elevation data is stored in UTM, so returning rasters in the same projection minimises response time and avoids accuracy loss due to resampling.

But we heard from some customers that latitude-longitude projections are still being used, often for interfacing with other systems.

You can now add projection=latlon to get a raster in a epsg:4326 CRS.

res_m=best

The GPXZ dataset has coverage at different resolutions in different parts of the world.

By default, 1m resolution rasters are returned.

But for areas where our underlying data isn’t at the exact requested resolution, resampling error may be introduced. And a resolution mismatch means either rasters larger than they need to be, or a loss of horizontal precision.

Setting res_m=best lets the GPXZ API select the most appropriate resolution (typically, the minimum of all data in the requested bounding area).

Some of our data has non-integer resolutions (like Hong Kong at 0.5m, or Denmark at 1.6m).

For projections that aren’t in units of metres (i.e., a projection value of latlon or best), the resolution is approximated based on the average width and height of the bounding box: we try to make these pixels square, but can’t always. The algorithm for calculating resolution for non-metre projections may change.

tight_bounds=false

By default, the returned raster is guaranteed to at least cover the requested bounds, and may extend beyond them.

As well as ensuring bounds coverage, this improves the accuracy of any interpolation you do near the edges of the raster.

Further, in most cases, buffering bounds lets GPXZ align your output raster with the grid of the underlying data, reducing the need for accuracy-reducing re-projection.

But if you would like to reduce this buffer as much as possible, you can set tight_bounds=true.

For projection=latlon projected rasters, tight_bounds=false guarantees the output raster will exactly match the bounding box.

For rasters in other projections, the buffer is reduced as small as possible. But it is often not geometrically possible to completely remove the buffer while still ensuring coverage of the bounding box.

Coverage of the bounding box is guaranteed (within ε² for tight_bounds=false).

If res_m is specified to a value that doesn’t evenly divide the bounding box, the bounds will be buffered rather than the resolution altered.

Improved validation

A bug has been fixed that could result in valid but incomplete rasters being returned.

All successful requests are guaranteed to be valid geotiff files, and have no NODATA regions.

Improved performance

Responses are much faster now.

Don’t trust webhooks

2024-07-15T00:00:00-05:00

Webhook requests often contain rich data beyond the fact that some event has occurred.

It’s tempting to use this data to update your own representation of the third party data.

@app.route("/injest-new-order-webhook")
def ingest_new_order_webhook():
	webhook_orders = request.json()["new_orders"]
	for webhook_order in webhook_orders:
		models.Order(**webhook_order).save()

But beware:

Any downtime or bug on your end becomes a data integrity risk by missing your chance to process the webhook (and its retries).
You have to authenticate the sender of the webhook and verify its contents. This isn’t hard in theory but is often forgotten in practice.
Do you trust your webhook sender to
- retry webhooks in the case of errors?
- retry webhooks over a long enough time period for the cause of error to be resolved?
- never send duplicate webhooks?

Instead, treat webhooks as simple notifications that trigger a resync via a normal API. Also run this resync as a regular job to automatically recover from issues.

@app.route("/injest-new-order-webhook")
def ingest_new_order_webhook():
	fetch_new_orders.enqueue()


@cron(every=ORDER_FETCH_INTERVAL)
def fetch_new_orders(since=ORDER_FETCH_INTERVAL * 2):
	api_orders = requests.get(
		ORDERS_API_ENDPOINT,
		params={"since": since}
	).json()["new_orders"]
	for api_order in api_orders:
		order = models.Order.from_dict(api_order)
		if not order.exists():
			order.save()

DEM void filling algorithms

2024-06-13T00:00:00-05:00

Here’s a simple 2D numpy array with some NaN pixels.

import numpy as np

size = 200
x = np.linspace(0, 1, size)
y = np.linspace(0, 1, size)
xx, yy = np.meshgrid(x, y)
unfilled[(xx > 0.25) & (xx < 0.75)] = np.nan

To fill the white space with realistic-looking data you need a void filling algorithm. Making things hard to google, this process is sometimes called inpainting, imputation, interpolation or extrapolation.

Different algorithms have different strengths and weaknesses. For interpolating elevation data (DEMs) we’re mostly looking for smooth transitions (as sharp edges are uncommon in natural terrain).

Less-good algorithms

First, the algorithms that couldn’t handle our simple array above.

from scipy.interpolate import NearestNDInterpolator
import pyinterp.fill
import cv2


def scipy_nearest(a):
    mask = np.where(np.isfinite(a))
    interp = NearestNDInterpolator(np.transpose(mask), a[mask])
    return interp(*np.indices(a.shape))


def opencv(a, algo):
    assert algo in {cv2.INPAINT_NS, cv2.INPAINT_TELEA}
    a_min = np.nanmin(a)
    a_max = np.nanmax(a)
    a_norm = (a - a_min) / (a_max - a_min)
    a_norm = a_norm.astype(np.float32)
    
    mask = np.isnan(a).astype(np.uint8)
    a_norm[np.isnan(a_norm)] = 0

    res_norm = cv2.inpaint(a_norm, mask, inpaintRadius=10, flags=algo)
    res = res_norm * (a_max - a_min) + a_min
    return res


def pyinterp_loess(a):
    x_axis = pyinterp.Axis(x, is_circle=False)
    y_axis = pyinterp.Axis(y, is_circle=False)
    grid = pyinterp.Grid2D(x_axis, y_axis, a)
    res = pyinterp.fill.loess(grid, nx=55, ny=55)
    return res

We’re looking for this

but instead get this

In fairness

The opencv algorithms are probably designed for small infills in 3-channel RGB images.
The loess function in pyinterp is designed for infilling a few pixels deep along edges, not large areas.
Scipy’s nearest interpolator is pretty handy for certain applications (like categorical rasters).

Better algorithms

Here’s a more complex shape:

unfilled = np.sqrt(xx**2 + yy**2) / 2**0.5
unfilled[(unfilled > .25) & (unfilled < 0.75)] = np.nan

and the results from the remainder of the algorithms (which all got a perfect score in the simple case above):

from skimage.restoration import inpaint_biharmonic
import rasterio.fill


def fill_rasterio(a):
    mask = np.isfinite(a)
    max_search_distance = int(math.sqrt(a.shape[0] ** 2 + a.shape[1] ** 2)) + 1
    return rasterio.fill.fillnodata(a, mask=mask, max_search_distance=max_search_distance, smoothing_iterations=0)


def fill_skimage_inpaint_biharmonic_region(a):
    mask = np.isnan(a)
    return inpaint_biharmonic(a, mask, split_into_regions=True)


def fill_pyinterp_gauss_seidel(a):
    x_axis = pyinterp.Axis(x,is_circle=False)
    y_axis = pyinterp.Axis(y,is_circle=False)
    grid = pyinterp.Grid2D(x_axis, y_axis, a)
    has_converged, res = pyinterp.fill.gauss_seidel(grid, num_threads=1)
    return res

Rasterio (which uses GDAL’s GDALFillNodata function under the hood) has this weird orthogonal artefacting.

Let’s try an actual DEM:

path = Path("Copernicus_DSM_10_N51_00_W116_00_DEM.tif")
with rasterio.open(path) as f:
    dem = f.read(1)
unfilled = dem[:size, :size]

Rasterio / GDAL again adds a bunch of linear artefacts that look nothing like real topography.
Skimage’s biharmonic algorithm certainly looks the most like real terrain and is the smoothest. Though both the interpolation and extrapolation result in large areas with a value at the extreme ends of the range of edge pixel values.
pyinterp’s Gauss Seidel fill looks more like a patch than realistic terrain, though the filled data tends towards the global mean of the raster/edge, so may better minimise a quantitative error function.

Tiny infill performance

For small void areas the skimage and rasterio are able to do some optimisations to reduce the computation time.

Algorithms not tested

Scipy has lots of different interpolation functionality, but it’s very complex and the recent rewrite of the interpolation API has made it even harder to find examples. I’m fairly sure scipy can’t do gridded interpolation of missing values: it can do void interpolation by treating the pixels as a non-gridded collection of points, but this approach scales badly for large rasters.

Astropy is my usual go-to for arrays with missing data: the library typically handles NaNs by ignoring them rather than propagating them everywhere. You can do filling with astropy but it means convolving a fixed-sized kernel over your array, so it’d be hard to choose an appropriate kernel for voids of different scales.

Finally there are a number of AI tools for infilling, some trained directly on elevation data. But

machine learning algorithms are typically rather slow and can’t be sped up for small voids
they’re harder to generalise: due to the complexity of AI algorithms you can’t get a robust sense of model performance by inspecting just a few cases like in this blog post
complex models are more likely to overfit. For my DEM usecase an underfit model is preferable, as at least a featureless surface communicates lack of certainty while spurious terrain features imply a greater accuracy than than the reality.

Voidless elevation data

GPXZ’s global elevation dataset comes with voids pre-filled: no inpainting necessary!

If you need high-quality elevation data, get in touch at andrew@gpxz.io!