CARVIEW |
Handling Missing Values in Time-series with SQL
This article is about a specific use-case that comes up often when dealing with time-series data.
This morning I read Madison Schott’s article about the LAST_VALUE
 function, where she highlights the usefulness of this lesser-known SQL function.
It inspired me to write a follow-up article about a specific use-case that comes up often when dealing with time-series data.
An Example
Let’s pretend you’re building a predictive maintenance model using sensor data.
After some wrangling, you end up with hourly data that looks like this:
Example of some preprocessed sensor data
At this point, we’ve already done some pretty significant data engineering in order to create these evenly spaced observations at every hour. How to do this is a subject for another article. However, take notice that there are some gaps in the temperature measurements. This is where LAST_VALUE
 comes to the rescue.
The reason for the missing values is usually because sensors only report when the value changes. This reduces the amount of data that the machine needs to transmit, but it creates a data problem for us to solve.
The Reason Why
If we build a model with this data directly, the accuracy is going to suffer when a particular value is missing, because there is no historical context written into the row itself. For the most accurate model possible, we should add features such as:
- Last temperature reading
- Avg temperature over past 6 hours
- Hours since temperature reading increased/decreased
- Rate of change of temperature over past 12 hours
An illustration of the types of features that will be useful for a predictive model
Our very first step should be to replace the missing values with the last known value. The reason we choose to do this first, is because the other features will become much easier to create.
For example, if we leave them missing and try to calculate a rolling average, the average will be calculated incorrectly (it will ignore the missing values and only average the non-missing).
Avg Temp Prior 4 hours (with missing)
(null + 85 + null + null) / 1 =Â 85
Avg Temp Prior 4 hours (replaced)
(84 + 85 + 85 + 85) / 4 =Â 84.75
How to Fix It
In python, we would start with forwardfill. However, doing this in SQL means that we can take advantage of the power of our data warehouse.
In SQL, we use LAST_VALUE
. See this article for a more in depth explanation.
Here is the syntax:
SELECT MACHINE_ID, OBSERVATION_DATETIME, LAST_VALUE( CASING_TEMPERATURE_F ignore NULLS ) OVER ( PARTITION BY MACHINE_ID ORDER BY OBSERVATION_DATETIME ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS LATEST_CASING_TEMPERATURE_F, LAST_VALUE( BEARING_TEMPERATURE_F ignore NULLS ) OVER ( PARTITION BY MACHINE_ID ORDER BY OBSERVATION_DATETIME ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS LATEST_BEARING_TEMPERATURE_F, LAST_VALUE(FLYWHEEL_RPM ignore NULLS) OVER ( PARTITION BY MACHINE_ID ORDER BY OBSERVATION_DATETIME ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS LATEST_FLYWHEEL_RPM, --8<-- snip --8<-- FROM hourly_machine_data
Result with the missing replaced with LAST_VALUE
There we have it!
Conclusion
Hopefully I’ve been able to shine a light on LAST_VALUE
 and it’s cousin, FIRST_VALUE
 , which are lesser known SQL Window functions.
Josh Berry (@Twitter) leads Customer Facing Data Science at Rasgo and has been in the data and analytics profession since 2008. Josh spent 10 years at Comcast where he built the data science team and was a key owner of the internally developed Comcast feature store - one of the first feature stores to hit the market. Following Comcast, Josh was a critical leader in building out Customer Facing Data Science at DataRobot. In his spare time Josh performs complex analysis on interesting topics such as baseball, F1 racing, housing market predictions, and more.
Original. Reposted with permission.
- Using SHAP Values for Model Interpretability in Machine Learning
- Tips for Handling Large Datasets in Python
- Advanced Error Handling in Python: Beyond Try-Except
- 5 Error Handling Patterns in Python (Beyond Try-Except)
- The Optimal Way to Input Missing Data with Pandas fillna()
- How to Identify Missing Data in Time-Series Datasets
Latest Posts
- Building Pure Python Web Apps with Reflex
- Here’s When You Would Choose Spreadsheets Over SQL
- 5 NotebookLM Tips to Make Your Day a Little Easier
- We Benchmarked DuckDB, SQLite, and Pandas on 1M Rows: Here’s What Happened
- Prompt Engineering Templates That Work: 7 Copy-Paste Recipes for LLMs
- A Complete Guide to Seaborn
Top Posts |
---|
- We Benchmarked DuckDB, SQLite, and Pandas on 1M Rows: Here’s What Happened
- How I Actually Use Statistics as a Data Scientist
- The Lazy Data Scientist’s Guide to Exploratory Data Analysis
- Prompt Engineering Templates That Work: 7 Copy-Paste Recipes for LLMs
- 10 Command-Line Tools Every Data Scientist Should Know
- A Gentle Introduction to TypeScript for Python Programmers
- A Complete Guide to Seaborn
- 5 Fun AI Agent Projects for Absolute Beginners
- From Excel to Python: 7 Steps Analysts Can Take Today
- A Gentle Introduction to MCP Servers and Clients