Data modeling for modern data warehouses

Activity schema is an open standard for data modeling and transformation in a data warehouse.
It is designed to make data modeling and analysis substantially simpler, faster, and more reliable.

Single Source Of Truth

Data is modeled using independent activities.

One Table

All warehouse data is in a single time-series table

Single Model Layer

All plots and analyses for BI run against a single table


A Different Way to Model

An activity schema models all data in the warehouse as a single time series table.

Data is built from independent activities instead of facts and dimensions.

Any activity can be combined with any other by using relationships in time instead of foreign keys, allowing for true ad hoc queries.

Diagram showing many data tables with dependencies between each other, with a set of plots referring on them
Model Dependencies

Traditional Modeling

Existing data modeling approaches, such as a star schema, have many layers of dependencies.

These are difficult to manage and maintain. The source of truth is not always clear, they are harder to debug, and require more documentation to use.


Activity Schema

An activity schema transforms source tables into a single, time series table called an activity stream. All downstream plots, tables, materialized views, etc used for BI are built directly from that single table, with no other dependencies.

Diagram showing a single table (the activity stream) with plots depending on it
For data engineers

Maintainable Models

Fewer models

One business concept per activity means fewer models to manage, understand, and maintain

Simple data lineage

A single data layer makes tracing data provenance and debugging far easier

Easier to build

No joins between models means no need to tie disparate source systems together

Faster updates

Time-series modeling means incremental updates (rather than full rebuilds) by default

Quickly accomodate source data changes

Changes to source data typically only affect a single activity

No data dictionaries

Fewer models, with one concept each, makes them vastly easier to document


Faster Analysis and Querying

Single source of truth

Each activity represents a single concept (like a 'page view' or 'completed order'), so it's always clear which to use

Autogenerated queries

A standard data model means that queries don't have to be written by hand

Query across all data sources

Time-based joins means any activity can be queried and combined with another without defining foreign keys

True ad-hoc querying

Because all activities are related in time, swapping one activity for another requires no structural changes to queries.

Reusable analyses

A standard data model means that any analysis can be reused across companies. A customer acquisition cost calculation for one company can be shared with another.

High performance

Queries run substantially faster against an activity stream table, which has fewer columns, requires fewer joins, and can be easily partitioned by activity or time