Handling Missing Values in Timeseries Datasets

  ·   5 min read

Overview #

There are a couple of strategies that can be used to fill missing numeric values for timeseries:

  • using the last value to fill forwards (Last observation carried forward)
  • using the next value to fill backwards
  • using an arbitrary expression such as the median
  • filling in a specific value
  • linear interpolation

Some databases (notably Timescale) avail these strategies natively. For the rest of the masses, it has to be implemented through UDFs (user defined functions) or via some hairy SQL. In the case of DuckDB though, it doesn’t have to be too convoluted since we can either use Polars or pyarrow-based UDFs. Let’s explore both methods.

Some sample data #

First, some sample data:

import duckdb

duckdb.sql("create table tbl (dt date, val int)")
duckdb.sql(
    """
    insert into tbl(dt, val) values
    ('1970-01-01', 1    ),
    ('1970-01-02', null ),
    ('1970-01-03', null ),
    ('1970-01-04', null ),
    ('1970-01-05', 5    ),
    ('1970-01-06', null ),
    ('1970-01-07', null ),
    ('1970-01-08', 8    );
    """
)

Using Polars #

Polars does have methods for filling in missing data.

Let’s use the .pl() method on DuckDB’s python client to output a Polars method then fill in the missing values using various strategies:

import polars as pl

df = duckdb.sql("select dt, val from tbl").pl()
missing_vals = pl.col("val")
filled = df.with_columns(
    missing_vals.fill_null(strategy="forward").alias("forward"),
    missing_vals.fill_null(strategy="backward").alias("backward"),
    missing_vals.interpolate().alias("interpolated"),
    missing_vals.fill_null(pl.median("val")).alias("with_median"),
    missing_vals.fill_null(pl.lit(10)).alias("with_literal_val_10"),
)

This outputs:

┌────────────┬──────┬─────────┬──────────┬──────────────┬─────────────┬─────────────────────┐
│ dt         ┆ val  ┆ forward ┆ backward ┆ interpolated ┆ with_median ┆ with_literal_val_10 │
│ ---        ┆ ---  ┆ ---     ┆ ---      ┆ ---          ┆ ---         ┆ ---                 │
│ date       ┆ i32  ┆ i32     ┆ i32      ┆ i32          ┆ f64         ┆ i32                 │
╞════════════╪══════╪═════════╪══════════╪══════════════╪═════════════╪═════════════════════╡
│ 1970-01-01 ┆ 1    ┆ 1       ┆ 1        ┆ 1            ┆ 1.0         ┆ 1                   │
│ 1970-01-02 ┆ null ┆ 1       ┆ 5        ┆ 2            ┆ 5.0         ┆ 10                  │
│ 1970-01-03 ┆ null ┆ 1       ┆ 5        ┆ 3            ┆ 5.0         ┆ 10                  │
│ 1970-01-04 ┆ null ┆ 1       ┆ 5        ┆ 4            ┆ 5.0         ┆ 10                  │
│ 1970-01-05 ┆ 5    ┆ 5       ┆ 5        ┆ 5            ┆ 5.0         ┆ 5                   │
│ 1970-01-06 ┆ null ┆ 5       ┆ 8        ┆ 6            ┆ 5.0         ┆ 10                  │
│ 1970-01-07 ┆ null ┆ 5       ┆ 8        ┆ 7            ┆ 5.0         ┆ 10                  │
│ 1970-01-08 ┆ 8    ┆ 8       ┆ 8        ┆ 8            ┆ 8.0         ┆ 8                   │
└────────────┴──────┴─────────┴──────────┴──────────────┴─────────────┴─────────────────────┘

We can then query the result back within DuckDB:

duckdb.sql(
    """
    select
        regexp_extract(name, 'avg_(\\w+)', 1) as strategy,
        round(avg, 2) as avg
    from (
        unpivot (
        select
            avg(val) as avg_original,
            avg(forward) as avg_forward,
            avg(backward) as avg_backward,
            avg(interpolated) as avg_interpolated,
            avg(with_median) as avg_with_median,
            avg(with_literal_val_10) as avg_with_literal_val_10
        from filled
        ) on columns(*)
        into name name value avg
    )
    """
).show()

This gives:

┌─────────────────────┬────────┐
│      strategy       │  avg   │
│       varchar       │ double │
├─────────────────────┼────────┤
│ original            │   4.67 │
│ forward             │   3.38 │
│ backward            │   5.63 │
│ interpolated        │    4.5 │
│ with_median         │   4.88 │
│ with_literal_val_10 │    8.0 │
└─────────────────────┴────────┘

Using PyArrow-based UDFs #

Recently DuckDB introduced arrow-based UDFs. This can be an alternative to using Polars for filling NULL values:

Let’s set up a couple of UDFs:

from duckdb.typing import *

import pyarrow as pa
import pyarrow.compute as pc

# define the UDFs
udfs = [
    ("fill_null_with_0", lambda vals: pc.fill_null(vals, 0)),
    ("fill_null_backward", lambda vals: pc.fill_null_backward(vals)),
    ("fill_null_forward", lambda vals: pc.fill_null_forward(vals)),
    ("fill_null_with_approx_median", lambda vals: pc.fill_null(vals, pc.approximate_median(vals))),
]

# register the UDFs
for (name, udf) in udfs:
    duckdb.create_function(
        name,
        udf,
        [INTEGER],
        INTEGER,
        type="arrow",
    )

After defining and registering the UDFs, we can now use them directly within SQL:

duckdb.sql(
    """
    select list(val) as vals from tbl union all
    select list(fill_null_with_0(val)) from tbl union all
    select list(fill_null_forward(val)) from tbl union all
    select list(fill_null_backward(val)) from tbl union all
    select list(fill_null_with_approx_median(val)) from tbl
    """
).show()

This prints:

┌─────────────────────────────────────────┐
│                  vals                   │
│                 int32[]                 │
├─────────────────────────────────────────┤
│ [1, NULL, NULL, NULL, 5, NULL, NULL, 8] │
│ [1, 0, 0, 0, 5, 0, 0, 8]                │
│ [1, 1, 1, 1, 5, 5, 5, 8]                │
│ [1, 5, 5, 5, 5, 8, 8, 8]                │
│ [1, 5, 5, 5, 5, 5, 5, 8]                │
└─────────────────────────────────────────┘

Polars + PyArrow UDFs + DuckDB #

Nothing’s stopping us from using Polars within UDFs though. Since pyarrow.compute doesn’t have linear interpolation for filling NULLs, we can use Polars within a UDF to create a Timescale-esque interpolate function:

def interpolate(vals):
    missing_vals = pl.from_arrow(vals)  # convert to polars series
    filled = missing_vals.interpolate()  # interpolate
    return filled.to_arrow()  # convert back to arrow fmt and return

duckdb.create_function(
    "interpolate", interpolate, [INTEGER], INTEGER, type="arrow"
)

duckdb.sql("select dt, val, interpolate(val) as filled from tbl").show()

This outputs:

┌────────────┬───────┬────────┐
│     dt     │  val  │ filled │
│    date    │ int32 │ int32  │
├────────────┼───────┼────────┤
│ 1970-01-01 │     1 │      1 │
│ 1970-01-02 │  NULL │      2 │
│ 1970-01-03 │  NULL │      3 │
│ 1970-01-04 │  NULL │      4 │
│ 1970-01-05 │     5 │      5 │
│ 1970-01-06 │  NULL │      6 │
│ 1970-01-07 │  NULL │      7 │
│ 1970-01-08 │     8 │      8 │
└────────────┴───────┴────────┘

Good old-fashioned SQL #

Pure SQL still packs a punch. Replacing NULLs with a literal value or with an expression is trivial in SQL.

For filling forwards/backwards we have to use window functions with an extra keyword: "ignore nulls" within the expression so that we can pick the desired non-null value. Also note the frames for each:

select
    dt,
    val,
    last_value(val ignore nulls) over (
        order by dt asc
        rows between unbounded preceding and current row
    ) as filled_forward,
    first_value(val ignore nulls) over (
        order by dt asc
        rows between current row and unbounded following
    ) as filled_backward
from tbl

As a parting shot, here’s an excercise left to the reader: implement interpolate using only SQL :-D

References/Further Reading #