---
title: "Streaming spatial operations"
author: "Gilles Colling"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Streaming spatial operations}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
has_sf <- requireNamespace("sf", quietly = TRUE)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = has_sf,
  fig.width = 7,
  fig.height = 4.2,
  out.width = "100%",
  dpi = 96
)
```

This vignette is the hands-on companion to
[Out-of-core GIS with vectra](spatial.html). That article lays out the cost
model; this one runs the streaming vector verbs end to end on a real layer, so
every block below is code you can execute and check. It needs the optional
[sf](https://r-spatial.github.io/sf/) package, which supplies the geometry
engine vectra streams around.

```{r libs, message = FALSE}
library(vectra)
library(sf)
```

## The problem

A desktop GIS holds one layer in memory, runs a tool, and writes the result.
That model breaks the moment the working layer is bigger than RAM: a national
occurrence set, a continental road network, every parcel in a country. The
geometry alone runs to tens of gigabytes before any operation touches it.

vectra keeps the same toolbox but changes what stays resident. A query is pulled
through the C engine in fixed-size batches; each spatial step works on the batch
in front of it and spills the transformed batch back to disk as a fresh lazy
node. Peak memory is one batch plus whatever small layer the step compares
against, so a billion-row layer flows past a fixed budget. The verbs in this
vignette are that streaming envelope: `spatial_map()`, `spatial_filter()`,
`spatial_clip()`, `spatial_join()`, `spatial_dissolve()`, `spatial_overlay()`,
and `rasterize()`.

## How geometry travels

vectra has no geometry type. A geometry rides through the engine as hex-encoded
WKB in an ordinary string column, and the coordinate reference system is carried
on the returned node rather than written into the `.vtr` file. Topology stays
with sf and the GEOS library it links: vectra contributes the streaming, the
spill machinery, and a native fast path, not the geometry algorithms.

One streamed step is a loop. Pull a batch, decode its WKB column into an `sf`
object, run the operation, encode the result back into a WKB string column, and
append it to a run-file. When the loop finishes, the run-files become a single
lazy node you can carry on querying. The memory a run holds is

```
peak  =  one batch  +  the resident comparison layer
```

independent of how many batches stream past. The batch size for the spill is
`flush_rows` (default 500,000 transformed rows); the resident layer is the small
`y` a join or filter tests against.

To make the examples concrete, load the North Carolina counties shipped with sf
and project them to the state plane in metres. Projecting matters here: in a
planar CRS distances, areas, and buffers are Euclidean, and the recognised
operations run natively on the GEOS C API straight off the WKB column instead of
decoding each batch back to `sf`.

```{r load-nc}
nc <- st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE)
nc <- st_transform(nc, 32119)        # NAD83 / North Carolina, metres
crs_nc <- st_crs(nc)
nrow(nc)
```

Write the polygons to a `.vtr` file with their geometry as a hex-WKB column.
This is the shape every streamed layer takes: attribute columns plus one string
column of WKB. In a real out-of-core workflow this file would be written once
from a large source and reused; here it is 100 counties standing in for the
billion-row case.

```{r write-poly}
f_poly <- tempfile(fileext = ".vtr")
write_vtr(data.frame(
  NAME  = nc$NAME,
  BIR74 = nc$BIR74,
  SID74 = nc$SID74,
  geometry = st_as_binary(st_geometry(nc), hex = TRUE)
), f_poly)
```

`tbl(f_poly)` now opens a lazy node over that file. Nothing is read until a
verb pulls it. Two functions bring a streamed result back to R: `collect()`
returns the underlying data.frame, geometry still a WKB string; `collect_sf()`
decodes that string and reattaches the node's CRS, giving an ordinary `sf`
object ready to plot or hand to any sf function.

## Per-feature transforms with spatial_map

The widest door is `spatial_map()`: apply any per-feature sf operation to a
streamed layer one batch at a time. Buffering, simplifying, computing centroids,
reprojecting, extracting coordinates, measuring area. The function you pass takes
one `sf` batch and returns an `sf` object, an `sfc`, or a plain data.frame. The
active geometry of the return becomes the streamed geometry.

Buffer every county centroid by 15 km. The first step builds a centroid layer
on disk, the second streams a buffer over it.

```{r map-buffer}
cent <- tempfile(fileext = ".vtr")
write_vtr(data.frame(
  NAME = nc$NAME,
  geometry = st_as_binary(st_centroid(st_geometry(nc)), hex = TRUE)
), cent)

buffered <- tbl(cent) |>
  spatial_map(~ st_buffer(.x, 15000), crs = crs_nc)
buffered
```

`buffered` is a lazy node, not a materialized layer: the buffer has not run yet.
`collect_sf()` pulls it through and decodes the geometry.

```{r map-buffer-plot}
b_sf <- collect_sf(buffered)
plot(st_geometry(nc), border = "grey80", col = NA,
     main = "County centroids buffered by 15 km")
plot(st_geometry(b_sf), border = "#3366cc", col = "#3366cc22", add = TRUE)
```

The formula syntax (`~ st_buffer(.x, 15000)`) is rlang's: `.x` is the batch. A
named function works identically and is clearer for anything longer than one
call. The return can also be a plain data.frame rather than a geometry, which
turns the spatial step into a streamed feature-summary with the geometry
dropped. That is the way to compute per-feature area over a layer too large to
hold whole.

```{r map-area}
areas <- tbl(f_poly) |>
  spatial_map(~ data.frame(NAME = .x$NAME,
                           area_km2 = as.numeric(st_area(.x)) / 1e6),
              crs = crs_nc)
head(collect(areas))
```

Because the result is a node, it chains. A buffer feeding a filter feeding a
join is three streamed steps, each spilling to its own run-files, with the CRS
carried forward so you state it once. `flush_rows` controls the spill batch:
raise it for fewer, larger temporary files when rows are small, lower it when a
single feature's geometry is heavy. The default of 500,000 suits point and
small-polygon work.

## Smooth geometry with spatial_smooth

`spatial_smooth()` rounds the corners of every line and polygon by Chaikin
corner-cutting, one batch at a time. Each pass replaces a vertex with two points
along its adjacent edges, so sharp angles become a smooth curve; more
`iterations` give a smoother result with more vertices. It is computed directly
on the coordinates with no GEOS call, and open lines keep their endpoints.

```{r smooth}
zig <- st_linestring(rbind(c(0, 0), c(1, 1), c(2, 0), c(3, 1), c(4, 0)))
f_zig <- tempfile(fileext = ".vtr")
write_vtr(data.frame(
  id = 1L, geometry = st_as_binary(st_sfc(zig), hex = TRUE)), f_zig)

tbl(f_zig) |> spatial_smooth(iterations = 3) |> collect_sf()
unlink(f_zig)
```

## Select by location with spatial_filter

The most-used vector tool keeps the features standing in a spatial relation to
another layer: points inside a study region, parcels touching a road, patches
within a buffer of the coast. `spatial_filter()` streams the large layer and
tests each batch against a small resident layer with an sf predicate, keeping or
dropping whole features. It is the spatial counterpart to `semi_join()`: rows
are filtered, never duplicated, and no columns are added, so the output carries
the input schema unchanged.

Sample 500 points across the state and store them as plain x/y columns. Point
input as coordinate columns is the headline case and the fully sf-free path:
each point is built in C and matched against the resident layer's spatial index
with no per-batch round-trip through sf.

```{r sample-points}
set.seed(1)
pts <- st_coordinates(st_sample(st_union(nc), 500))
fp <- tempfile(fileext = ".vtr")
write_vtr(data.frame(id = seq_len(nrow(pts)), x = pts[, 1], y = pts[, 2]), fp)
```

Keep the points falling inside a five-county region in the northwest.

```{r filter-region}
region <- nc[nc$NAME %in% c("Ashe", "Alleghany", "Surry", "Wilkes", "Watauga"),
             "NAME"]

inside <- tbl(fp) |>
  spatial_filter(region, coords = c("x", "y"), crs = crs_nc)
nrow(collect(inside))
```

Twenty-five of the 500 points land in the region. Plot the kept points against
the dropped ones.

```{r filter-plot}
keep_xy <- collect(inside)
plot(st_geometry(nc), border = "grey85", col = NA, main = "Select by location")
plot(st_geometry(region), border = "#cc3344", col = "#cc334411", add = TRUE)
points(pts, pch = 16, cex = 0.5, col = "grey70")
points(keep_xy$x, keep_xy$y, pch = 16, cex = 0.6, col = "#cc3344")
```

`negate = TRUE` inverts the test, keeping the 475 points outside the region. A
different `predicate` changes the relation: `st_within`, `st_covered_by`,
`st_touches`, `st_crosses`, and the rest of the sf binary predicates are all
accepted. `st_is_within_distance` takes its radius through `dist`, in CRS units,
and keeps every feature within that distance of the layer.

```{r filter-distance}
near <- tbl(fp) |>
  spatial_filter(region, predicate = st_is_within_distance,
                 coords = c("x", "y"), crs = crs_nc, dist = 30000)
nrow(collect(near))
```

Sixty points sit within 30 km of the region, more than the 25 strictly inside.
For the recognised predicates on planar data the test runs in C off the WKB
column or the raw coordinates; an unrecognised predicate, geographic data with
spherical geometry switched on, or an extra predicate argument falls back to the
per-batch sf loop, which preserves sf's exact semantics at the cost of decoding
each batch.

## Cut geometry with spatial_clip

`spatial_filter()` keeps or drops whole features; `spatial_clip()` cuts their
geometry along a mask. Clipping intersects each batch with the mask and keeps the
part inside it, the streamed equivalent of `st_intersection()` against a fixed
boundary. `erase = TRUE` keeps the part outside instead.

```{r clip}
mask_region <- st_union(st_geometry(region))

clipped <- tbl(f_poly) |> spatial_clip(mask_region, crs = crs_nc)
c_sf <- collect_sf(clipped)
nrow(c_sf)
```

Twelve counties have area inside the region's bounding shape, and each comes back
trimmed to the part that overlaps it. Plot the clipped slivers over the full
state.

```{r clip-plot}
plot(st_geometry(nc), border = "grey85", col = NA,
     main = "Counties clipped to the region")
plot(st_geometry(c_sf), border = "#2a9d5c", col = "#2a9d5c33", add = TRUE)
```

The mask is unioned once into a single resident geometry, then each batch is cut
against it in C. As with the join and filter, the cut runs natively on the WKB
column for planar data and through `st_intersection()` / `st_difference()` for
the geographic-with-s2 and coordinate-assembled cases. Erase reverses the
keep: `tbl(f_poly) |> spatial_clip(mask_region, erase = TRUE)` returns the 95
counties with any area outside the region, each trimmed to that outside part.

## Split geometry with spatial_split

`spatial_split()` cuts each feature against a small resident `blade` layer, the
QGIS "split with lines". A polygon is divided into the faces the blade carves
out, a line into the arcs between crossings, and each piece is emitted as its own
row with the source attributes copied. A feature the blade does not cross passes
through whole. With `extract = "points"` the verb returns the crossing points
instead.

```{r split}
square <- st_polygon(list(rbind(c(0, 0), c(4, 0), c(4, 4), c(0, 4), c(0, 0))))
blade  <- st_sfc(st_linestring(rbind(c(2, -1), c(2, 5))))

f_sq <- tempfile(fileext = ".vtr")
write_vtr(data.frame(
  id = 1L, geometry = st_as_binary(st_sfc(square), hex = TRUE)), f_sq)

tbl(f_sq) |> spatial_split(blade) |> collect_sf()
unlink(f_sq)
```

## Tag features with spatial_join

To attach a layer's attributes rather than filter or cut, `spatial_join()`
streams the large left side and joins each batch against a small resident right
side. The dominant workload is tagging a huge point set with the polygon it
falls in: which county each occurrence sits in, which census tract each address
belongs to. The billion-row left stream never materializes while the polygon
layer stays in RAM.

```{r join-tag}
tagged <- tbl(fp) |>
  spatial_join(nc["NAME"], coords = c("x", "y"), crs = crs_nc)
tdf <- collect(tagged)
head(tdf[, c("id", "NAME")])
```

Every point now carries the `NAME` of its county. The default predicate is
`st_intersects` and the default is a left join, so an unmatched left row is kept
once with NA in the right columns; `left = FALSE` drops the unmatched rows for an
inner join. Colour the points by the county they were tagged with.

```{r join-plot}
tagged_sf <- st_as_sf(tdf, coords = c("x", "y"), crs = crs_nc)
plot(st_geometry(nc), border = "grey85", col = NA,
     main = "Points tagged with their county")
plot(tagged_sf["NAME"], pch = 16, cex = 0.5, add = TRUE)
```

Columns present on both sides are disambiguated with `suffix` (default
`c(".x", ".y")`), exactly as `st_join()` does. A different `join` predicate
changes the relation; `st_nearest_feature` finds the closest right feature to
each left one, which is how you snap points to the nearest road or station.

```{r join-nearest}
ncent <- st_sf(NAME = nc$NAME, geometry = st_centroid(st_geometry(nc)))
nearest <- tbl(fp) |>
  spatial_join(ncent, join = st_nearest_feature,
               coords = c("x", "y"), crs = crs_nc)
nrow(collect(nearest))
```

### When both sides are larger than RAM

The resident-`y` path assumes the right side fits in memory. When it does not,
pass `partition = grid(cellsize)` and a streamed `vectra_node` as `y`. Both
inputs are binned to a uniform grid and joined one shard at a time, so neither
side is ever fully resident. A coordinate maps to the cell

```
cell  =  floor( (coord - origin) / cellsize )
```

per axis. Each left feature is assigned to the single cell of its reference
point; each right feature is replicated to every cell its bounding box overlaps.
A left row is therefore emitted exactly once, and the result equals the resident
join for point left geometry.

```{r join-partition}
g_poly <- tempfile(fileext = ".vtr")
write_vtr(data.frame(
  NAME = nc$NAME,
  geometry = st_as_binary(st_geometry(nc), hex = TRUE)
), g_poly)

tagged2 <- tbl(fp) |>
  spatial_join(tbl(g_poly), coords = c("x", "y"), crs = crs_nc,
               partition = grid(80000))
t2 <- collect(tagged2)
sum(!is.na(t2$NAME))
```

All 500 points are tagged, matching the resident join. The `cellsize` is the one
tuning knob: large enough that one cell's features fit in memory, small enough
that the shards stay balanced. For point-in-polygon tagging any cell larger than
the polygons works; for an extended-on-extended join choose it larger than the
left features. The nearest-feature predicate is not local to one cell, so under
`partition` a left point searches its own cell and the eight around it; set
`cellsize` at least as large as the largest expected nearest distance.

### Same answer as resident sf

Streaming changes the memory profile, not the result. The streamed join returns
exactly what `st_join()` returns on the whole layer held in memory, feature for
feature. Run both on the 500-point set and compare.

```{r join-equality}
streamed <- collect(
  tbl(fp) |> spatial_join(nc["NAME"], coords = c("x", "y"), crs = crs_nc))

resident <- st_join(
  st_as_sf(collect(tbl(fp)), coords = c("x", "y"), crs = crs_nc, remove = FALSE),
  nc["NAME"], join = st_intersects)

all.equal(streamed$NAME[order(streamed$id)],
          resident$NAME[order(resident$id)])
```

The county tags match for every point. The equality is the contract: a streamed
verb is the resident sf call run batch by batch, so the choice between them is a
memory decision, not an accuracy trade-off.

### At scale

The point of streaming shows when the layer stops fitting in memory. Scatter
200,000 points across the state's bounding box and filter them to the five-county
region. Only one batch is resident at a time, so the same code that ran on 500
points runs on 200,000 with a flat memory profile.

```{r filter-scale}
set.seed(42)
bb <- st_bbox(nc)
n_big <- 2e5
big <- data.frame(id = seq_len(n_big),
                  x = runif(n_big, bb["xmin"], bb["xmax"]),
                  y = runif(n_big, bb["ymin"], bb["ymax"]))
fbig <- tempfile(fileext = ".vtr")
write_vtr(big, fbig)

kept <- tbl(fbig) |>
  spatial_filter(region, coords = c("x", "y"), crs = crs_nc) |>
  collect()
nrow(kept)
```

A few thousand of the 200,000 land in the five counties. At a billion points the
file would be larger but the resident set would not grow: the engine still pulls
one batch, tests it in C against the region's spatial index, and moves on. That
is the whole proposition. The operation that a desktop GIS can only run on a
layer it can open is the same operation, run past a fixed memory budget.

```{r cleanup-scale, include = FALSE}
unlink(fbig)
```

## Nearest features with spatial_knn

Where `spatial_join()` attaches the single nearest feature, `spatial_knn()`
returns the `k` nearest features per streamed point, one row per pair, each with
the rank (1 is nearest) and the distance. The candidate layer `y` is held
resident while the points stream.

```{r knn}
towns <- suppressWarnings(st_centroid(st_geometry(nc)))[1:5]
towns <- st_sf(town = nc$NAME[1:5], geometry = towns)

set.seed(1)
pts <- suppressWarnings(st_coordinates(st_sample(nc, 100)))
f_pts <- tempfile(fileext = ".vtr")
write_vtr(data.frame(id = seq_len(nrow(pts)), x = pts[, 1], y = pts[, 2]), f_pts)

tbl(f_pts) |>
  spatial_knn(towns, k = 2, coords = c("x", "y"), crs = crs_nc, y_id = "town") |>
  collect() |> head()
unlink(f_pts)
```

## Merge geometries with spatial_dissolve

Dissolve unions the geometries within each group into one feature, the GIS
"Dissolve" tool: counties into states, parcels into ownership blocks. Unlike the
per-batch verbs, it needs every geometry of a group together to union them, so it
rides a different tier. The input is spilled once and routed into one disjoint
shard per group in a single bounded pass; each shard is then read back and
unioned. Peak memory is the routing budget during the pass, then one group's
geometries while that group unions. Partition on a key whose groups fit in
memory.

Split the counties into two bands by their 1974 SIDS count and merge each band
into a single feature, summing births along the way.

```{r dissolve}
nc$band <- ifelse(nc$SID74 > 5, "high", "low")
fb <- tempfile(fileext = ".vtr")
write_vtr(data.frame(
  band = nc$band, BIR74 = nc$BIR74,
  geometry = st_as_binary(st_geometry(nc), hex = TRUE)
), fb)

merged <- tbl(fb) |>
  spatial_dissolve(by = "band", crs = crs_nc,
                   .fun = list(births = function(d) sum(d$BIR74)))
m_sf <- collect_sf(merged)
m_sf
```

Two features come back, one per band, each carrying the summed `births` of its
counties. The `.fun` argument is a named list of summaries; each function takes
the group's data.frame and returns one value, and the list name becomes the
output column. Plot the two bands.

```{r dissolve-plot}
plot(m_sf["band"], main = "Counties dissolved into two SIDS bands")
```

With no `by`, the whole layer dissolves into one feature, which is the fast way
to compute a layer's outline out of core. On planar data each group is unioned
natively off the WKB column; geographic data with s2 on, or an extra `st_union()`
argument such as `is_coverage = TRUE`, unions through sf instead.

## Split overlaps apart with spatial_overlay

Dissolve merges overlapping geometry; overlay splits it. `spatial_overlay()`
takes a polygon layer that overlaps itself and cuts it along every overlap into
disjoint pieces, returning one row per piece per covering polygon. Where `k`
polygons overlap, that piece appears `k` times, each row carrying one source
polygon's attributes. This is the operation a GIS exposes as "Union (single
layer)": the overlap is retained once per contributing feature rather than
dissolved away. It answers questions like "which protected areas, designated in
which years, cover this exact patch of sea".

Three overlapping squares designated in different years make the smallest
meaningful example.

```{r overlay}
sq <- function(a, b) st_polygon(list(rbind(
  c(a, 0), c(b, 0), c(b, 1), c(a, 1), c(a, 0))))
polys <- st_sf(year = c(1990L, 2010L, 2000L),
               geometry = st_sfc(sq(0, 2), sq(1, 3), sq(1.5, 3.5)))

pieces <- collect_sf(spatial_overlay(polys))
nrow(pieces)
length(unique(pieces$piece_id))
```

The three squares decompose into five disjoint pieces, returned as nine rows
because the overlapping pieces are repeated once per covering square. The
`piece_id` column keys the pieces: rows sharing an id are the same patch of
ground seen from different source polygons. Resolve the duplication with a
grouped slice. Earliest designation year wins:

```{r overlay-resolve}
first <- spatial_overlay(polys) |>
  group_by(piece_id) |>
  slice_min(year, n = 1, with_ties = FALSE) |>
  collect_sf()
nrow(first)

plot(first["year"], main = "Overlay pieces, earliest year wins")
```

Five pieces remain, one per disjoint patch, each labelled with the earliest year
that covers it. Swapping `slice_min` for `slice_max` gives the latest; any
grouped reduction works, because the pieces are an ordinary streamed node by this
point.

Three properties make the overlay trustworthy on real data. First, correctness
is checked, not assumed: for a valid decomposition the piece areas covering an
input must sum to that input's area, and the engine verifies this invariant per
batch, warning if coverage drifts past a relative `1e-4` rather than returning
silently wrong. Second, memory is bounded by tiling: a connected cluster of many
large overlapping polygons can node into far more linework than the input, so
clusters too large for the budget are tiled over their own extent and clipped,
with each feature cleaned exactly once. Third, the snapping grid is explicit:
coordinates are snapped to a grid derived from their magnitude so the
near-duplicate boundaries that overlapping polygons share coincide, which is what
keeps the pieces disjoint instead of leaving sliver faces.

Unlike the other verbs, `spatial_overlay()` takes a resident `sf` object, not a
lazy node: building the overlap graph needs the geometries in memory. The
exploded result, typically several times larger than the input, is what streams
to disk. `mem_limit` caps the peak working set and `threads` sets the parallel
overlay width; raise both for speed on a big machine, lower `mem_limit` for
tighter memory.

### A second layer

Passing a second layer `y` nodes two layers into one planar partition and
carries the attributes of the covering `x`-record and `y`-record onto each
piece. A `how` argument selects which pieces to keep: `"intersection"` (covered
by both, the default), `"union"` (every piece of either, the absent side filled
with `NA`), `"identity"` (all of `x` split by `y`), or `"symdiff"` (pieces in
exactly one layer).

```{r overlay-two}
zones <- st_sf(zone = c("A", "B"),
               geometry = st_sfc(sq(0, 1.5), sq(1.5, 3)))

inter <- spatial_overlay(polys, zones, how = "intersection") |> collect_sf()
inter
```

Each piece now carries both its `year` (from the squares) and its `zone` (from
the zone layer). `vars_y` narrows the carried `y` columns, and a name shared
with an `x` column is disambiguated with a `.x` / `.y` suffix. A file-path `y`
is read in batches with `layer_y` / `query_y`, the same way `x` is.

## Materializing and round-tripping

Three exits bring a streamed result out. `collect()` returns the data.frame with
geometry still a WKB string, useful when the next step is a
non-spatial verb or another `.vtr` write. `collect_sf()` decodes the WKB and
reattaches the node's CRS, giving an `sf` object. `write_vtr()` on a node streams
the result straight to a new `.vtr` file without ever holding it whole, so a
multi-step spatial pipeline can land its output on disk under the same fixed
memory budget it ran in.

```{r roundtrip}
out <- tempfile(fileext = ".vtr")
tbl(fp) |>
  spatial_filter(region, coords = c("x", "y"), crs = crs_nc) |>
  write_vtr(out)
nrow(collect(tbl(out)))
```

The CRS lives on the node, not in the file. A pipeline that opens projected data,
maps, filters, and joins keeps the projection because each verb carries it
forward; you state `crs =` once at the first step that needs it, or let it
inherit from the upstream node. The `.vtr` file itself stores only the WKB bytes,
so reopening a written file and calling `collect_sf()` needs the CRS supplied
again if you want it labelled.

## The cost model

Each verb states what it holds resident, so the toolbox reads as a cost model.
Three tiers cover everything here.

| Tier | Verbs | Resident set |
|---|---|---|
| Monoid fold | `spatial_map`, `spatial_filter`, `spatial_clip`, `rasterize` | one batch + small `y` |
| Partition | `spatial_dissolve`, partitioned `spatial_join` | routing budget, then one shard |
| All-to-all | `spatial_overlay` | one overlap cluster (tiled) |

The fold tier is the cheapest: one batch at a time, no spill, memory flat across
the whole stream. The partition tier spills the input once and processes it one
group or shard at a time, so its peak is set by the largest group rather than the
whole layer. The all-to-all tier is inherently resident because the operation is
global, and the overlay bounds it by tiling overlap clusters. Operations that are
global by nature and not tiled, such as Voronoi diagrams, Delaunay triangulation,
convex hulls, and neighbour graphs, are not streamed at all: `collect_sf()` the
layer and run sf or terra on it.

A second axis is whether a step runs natively or through sf. The recognised
predicates and operations run in C straight off the WKB column when the data is
planar, which means projected, or geographic with spherical geometry off, or of
unknown CRS. Geographic data with `sf::sf_use_s2()` on keeps the spherical sf
path so the answer matches sf exactly. The native path parses the resident layer
once into a spatial index and tests each batch in C with no decode; the fallback
decodes each batch to `sf`. For the large planar workloads vectra targets, the
native path is the common case, which is why the examples project up front.

Four options tune the machinery:

- `vectra.spatial_flush` (default 500,000): rows buffered before a spill flush.

- `vectra.partition_budget`: rows held while routing a dissolve or partitioned
  join before flushing shards.
- `vectra.overlay_mem_limit` (default 2 GB): the overlay's peak working-set
  budget.
- `vectra.overlay_threads` and `vectra.spatial_threads`: worker counts for the
  overlay and the per-batch GEOS loops.

## Practical guidance

A few rules of thumb, with the numbers that drive them.

**Project before you stream.** A projected CRS puts the recognised verbs on the
native GEOS path and makes distances and areas Euclidean. Geographic data with
s2 on works but decodes every batch to sf and computes on the sphere, which is
correct but slower; reach for it only when spherical accuracy over large extents
matters more than throughput.

**Size the join grid to the data, not the machine.** For point-in-polygon
tagging set `grid(cellsize)` to anything larger than your polygons; the join is
exact for point left geometry at any such size. For an extended-on-extended join
make `cellsize` larger than the left features. For a partitioned nearest-feature
join make it at least the largest expected nearest distance, because the search
only reaches the eight neighbouring cells.

**Match `flush_rows` to feature weight.** The 500,000 default suits points and
small polygons. For heavy geometry, dense coastlines, detailed admin boundaries,
lower it so a spill batch is a workable size; for millions of tiny point rows,
raise it to cut the number of temporary files.

**Partition dissolve on a key whose groups fit.** Dissolve holds one group's
geometry while it unions. Dissolving a country into one outline is fine; grouping
by a key with a single enormous group is not, because that group must be resident
to union. Split the key finer if a group blows the budget.

**Trust the overlay's coverage warning.** If `spatial_overlay()` warns that
coverage drifted past `1e-4`, the named input features are finer than the
snapping grid or invalid after it. Pass a smaller `grid =` for fine geometry, or
inspect the `coverage_offenders` attribute on the result for the worst rows. A
clean run means the pieces reconstruct the inputs exactly.

When the verbs map to tiers and resident sets, the choice of tool is a memory
decision:

| Need | Verb | Holds resident |
|---|---|---|
| Per-feature transform | `spatial_map` | one batch |
| Keep / drop by location | `spatial_filter` | one batch + locator |
| Cut geometry to a mask | `spatial_clip` | one batch + mask |
| Tag with attributes | `spatial_join` | one batch + polygons |
| Tag, both sides huge | `spatial_join(partition=)` | one grid shard |
| Merge by group | `spatial_dissolve` | one group |
| Split self-overlaps | `spatial_overlay` | one overlap cluster |

**When not to stream.** For a layer that already fits in memory, sf is simpler
and faster: there is no gain in spilling 10,000 features through run-files when
`st_join()` runs in a blink. Streaming earns its overhead when the layer is
larger than RAM, or when a step in a longer pipeline would otherwise force a full
materialization. For genuinely global operations, Voronoi, Delaunay, hulls,
neighbour graphs, there is nothing to stream: collect the layer and run the
resident tool. The streaming envelope covers the operations whose memory can be
bounded, and the cost model names the ones it cannot.

```{r cleanup, include = FALSE}
unlink(c(f_poly, cent, fp, g_poly, fb, out))
```

## Where to go next

- [Out-of-core GIS with vectra](spatial.html) for the cost model and the raster
  half of the toolbox: `zonal()`, `focal()`, `terrain()`, `warp()`,
  `polygonize()`, `contours()`, `proximity()`.
- [Larger-than-RAM strategy](large-data.html) for the spill and partition tiers
  the spatial verbs reuse.
- [Species distribution modelling](sdm.html) for an end-to-end ecological
  workflow built on these verbs.
- The function reference for every spatial verb and its arguments.
```