Fixing sparse .xyz files

I’ve been working with the Northern Ireland DTM dataset, which is provided as .txt files like this:

293465.0000     444005.0000          1.6954
293475.0000     444005.0000          0.0746
293485.0000     444005.0000          0.1014
...

At first glance it looks like an .xyz file which is supported by gdal. However, gdal complains when parsing them:

gdal_translate -a_srs EPSG:29903 Sheet001v4.txt Sheet001v4.tif

ERROR 1: Ungridded dataset: At line 38676, too many stepY values

It turns out that .xyz files cannot be sparse: they have to contain a row for every x, y coordinate combination, while the NI DTM files omit cells with missing data.

I’m not the first person to come across this issue: there’s a couple of answers on the GIS stackexchange suggesting to use gdal_grid, but I couldn’t get any to work properly.

Instead, I found it easier and faster to fill in the missing cells with pandas in python:

import numpy as np
import pandas as pd


# Setup.
input_file = 'Sheet001v4.txt'
output_file = 'Sheet001v4.csv'
resolution = 10
nodata_value = -9999


# Load txt file.
df = pd.read_csv(input_file, sep='\s+', header=None, names=['x', 'y', 'z'])


# Figure out which x and y values are needed.
x_vals = set(np.arange(df.x.min(), df.x.max() + resolution, resolution))
y_vals = set(np.arange(df.y.min(), df.y.max() + resolution, resolution))


# For each x value, find any missing y values, and add a NODATA row.
dfs = [df]
for x in x_vals:
	y_vals_missing = y_vals - set(df[df.x == x].y)
	if y_vals_missing:
		df_missing = pd.DataFrame({'x': x, 'y': y_vals_missing, 'z': nodata_value})
		dfs.append(df_missing)


# Build full csv, and sort to xyz spec.
df = pd.concat(dfs, ignore_index=True)
df = df.sort_values(['y', 'x'])


# Check.
assert len(df) == len(x_vals) * len(y_vals)


# Output.
df.to_csv(output_file, index=False, header=False)

The resulting file can be used with gdal normally.

gdal_translate -a_srs EPSG:29903 -a_nodata -9999 Sheet001v4.csv Sheet001v4.tif