Metrics

How metrics look like#

re_data metrics are currently just expressions which are added to select statements run automatically by re_data.

re_data query

select metric1, metric2, metric3from your_tablewhere data in time_window

These simple definitions still make it possible to create a wide variety of metrics. If metrics you would like to monitor are more complicated than that, we advise creating dbt models filtering/joining/grouping data and monitoring models created with re_data.

Time based#

re_data metrics are time-based. (re_data filters all your table data to a specific created_at is You can choose any time window with proper config.

info

Why do we only support time-based metrics? We believe all datasets gain in quality when they have some time-based column (think creation_time, ingestion_time, etc.) added to them. This way you know when data is coming, or when it was created, etc. Without a time-based mark, it's quite hard to define metrics & anomalies properly. Let us know if you think for your use-case it doesn't make sense.

Default metrics#

Default metrics are computed for all monitored tables. They can be either:

table-level - metric for the whole table
column level - metric for a specific column in the table

Almost all metrics can be found in the re_data_metrics model crated by re_data. Apart from that re_data also creates models from specific metrics (As views filtering from metrics table)

Sample table for example metrics

__      title               rental_rate rating      created_at1       Chamber Italian     4.99        NC-17       2021-09-01T11:00:002       Grosse Wonderful    4.99        R           2021-09-01T12:00:003       Airport Pollock     4.99        R           2021-09-01T15:00:004       Bright Encounters   4.99        PG-13       2021-09-01T09:00:005       Academy Dinosaur    0.99        PG-13       2021-09-01T08:00:006       Ace Goldfinger      4.99        G           2021-09-01T10:00:007       Adaptation Holes    2.99        NC-17       2021-09-01T11:00:008       Affair Prejudice    2.99        G           2021-09-01T19:00:009       African Egg         2.99        G           2021-09-01T20:00:0010      Agent Truman        2.99        PG          2021-09-01T07:00:0011      Airplane Sierra     4.99        PG-13       2021-09-02T09:00:0012      Alabama Devil       2.99        PG-13       2021-09-02T10:00:0013      Aladdin Calendar    4.99        NC-17       2021-09-02T11:00:0014      Alamo Videotape     0.99        G           2021-09-02T12:00:0015      Alaska Phantom      0.99        PG          2021-09-02T13:00:0016      Date Speed          0.99        R           2021-09-02T14:00:0017      Ali Forever         4.99        PG          2021-09-02T15:00:0018      Alice Fantasia      0.99        NC-17       2021-09-02T16:00:0019      Alien Center        2.99        NC-17       2021-09-02T17:00:00

Below is a list of currently available metrics and how they are computed internally by re_data:

Default Table level metrics#

row_count #

Numbers of rows added to the table in a specific time range.

row_count = 10 where time window is >= 2021-09-01T00:00:00 and < 2021-09-02T00:00:00

freshness #

Information about the latest record in a given time frame. Suppose we are calculating the freshness metric in the table above for the time window [2021-09-01T00:00:00, 2021-09-02T00:00:00). We observe that the latest record in that time frame appears in row 9 with created_at=2021-09-01T20:00:00. freshness is the difference between the end of the time window and the latest record in the time frame in seconds. For this example described, freshness would be calculated as

2021-09-02T00:00:00 - 2021-09-01T20:00:00 = 14400

schema_changes#

Information about schema changes in the monitored table.

Stored separately from the rest of the metrics in the re_data_schema_changes model.

caution

Schema changes are metric different from the rest. Because information about schema changes is gathered by comparing schemas between re_data runs this metric doesn't filter changes to time-window specified and in fact, doesn't use time_window settings at all.

Default Column level metrics#

min #

Minimal value appearing in a given numeric column.

min(rental_rate) = 0.99 where time window is >= 2021-09-01T00:00:00 and < 2021-09-02T00:00:00

max #

Maximal value appearing in a given numeric column.

max(rental_rate) = 4.99 where time window is >= 2021-09-01T00:00:00 and < 2021-09-02T00:00:00

avg #

Average of all values appearing in a given numeric column.

avg(rental_rate) = 3.79 where time window is >= 2021-09-01T00:00:00 and < 2021-09-02T00:00:00

stddev #

The standard deviation of all values appearing in a given numeric column.

stddev(rental_rate) = 1.3984117975602022 where time window is >= 2021-09-01T00:00:00 and < 2021-09-02T00:00:00

variance #

The variance of all values appearing in a given numeric column.

variance(rental_rate) = 1.9555555555555557 where time window is >= 2021-09-01T00:00:00 and < 2021-09-02T00:00:00

min_length #

Minimal length of all strings appearing in a given column.

min_length(rating) = 1 where time window is >= 2021-09-01T00:00:00 and < 2021-09-02T00:00:00

max_length #

Maximal length of all strings appearing in a given column

max_length(rating) = 5 where time window is >= 2021-09-01T00:00:00 and < 2021-09-02T00:00:00

avg_length #

The average length of all strings appearing in a given column

avg_length(rating) = 2.4 where time window is >= 2021-09-01T00:00:00 and < 2021-09-02T00:00:00

nulls_count #

A number of nulls in a given column.

nulls_count(rating) = 0 where time window is >= 2021-09-01T00:00:00 and < 2021-09-02T00:00:00

missing_count #

A number of nulls and empty string values in a given column for the specific time range.

missing_count(rating) = 0 where time window is >= 2021-09-01T00:00:00 and < 2021-09-02T00:00:00

missing_percent #

A percentage of nulls and empty string values in a given column for the specific time range.

missing_percent(rating) = 0 where time window is >= 2021-09-01T00:00:00 and < 2021-09-02T00:00:00

nulls_percent #

A percentage of null values in a given column for the specific time range.

nulls_percent(rating) = 0 where time window is >= 2021-09-01T00:00:00 and < 2021-09-02T00:00:00

Extra Metrics#

There are metrics provided by re_data but are not computed by default in monitored tables. It is worth noting some of these metrics may be computationally heavy which is why they aren't computed by default.

Extra Table Metrics#

distinct_table_rows #

This metric computes the distinct number of rows in the given table

time window is >= 2021-09-01T00:00:00 and < 2021-09-02T00:00:00distinct_rows = 10

Extra Column Metrics#

info

regex_match_expression is resolved at runtime depending on the database in use.

match_regex #

Determines the count of values in a given column that matches the specified regex. Suppose we want to check if the rating column matches a specific regular expression pattern and we define it in our dbt_project.yml file.

regex must be specified for this metric to work else a compiler exception would be raised.

specifying match_regex config

vars:  re_data:monitored:    - tables:        - name: sample_table          time_filter: created_at
          metrics:            column:              rating:                - match_regex:                    regex: ([0-9]+)

select coalesce(        sum(            case when {{ regex_match_expression('rating', '([0-9]+)') }}                then 1            else 0            end        ), 0    ) from your_tablewhere created_at between time_window_start and time_window_end
------------------------------------------------------------------------1       Chamber Italian     4.99        NC-17       2021-09-01T11:00:004       Bright Encounters   4.99        PG-13       2021-09-01T09:00:005       Academy Dinosaur    0.99        PG-13       2021-09-01T08:00:007       Adaptation Holes    2.99        NC-17       2021-09-01T11:00:00
match_regex = 4 where created_at is between 2021-09-01T00:00:00 and 2021-09-02T00:00:00

match_regex_percent #

Determines the percentage of values in a given column that matches the specified regex.

Suppose we use the same configuration for the match_regex metric above, we havematch_regex_percent = 40 where created_at is between 2021-09-01T00:00:00 and 2021-09-02T00:00:00

not_match_regex #

Determines the count of values in a given column that does not match the specified regex.

Suppose we pass in ([0-9]+) as our regex parameter,not_match_regex = 6 where created_at is between 2021-09-01T00:00:00 and 2021-09-02T00:00:00

distinct_values #

Determines the count of values in a given column that are unique.

rating  count-----------------PG-13   2G       3NC-17   2PG      1R       2time window is >= 2021-09-01T00:00:00 and < 2021-09-02T00:00:00distinct_values = 5. (PG)

duplicate_values #

Determines the count of values in a given column that are duplicated.

rating  count-----------------PG-13   2G       3NC-17   2PG      1R       2
time window is >= 2021-09-01T00:00:00 and < 2021-09-02T00:00:00duplicate_values = 4. (PG-13, G, NC-17, R)

duplicate_rows #

Determines the count of rows in a given column that have values which are duplicates.

rating  count-----------------PG-13   2G       3NC-17   2PG      1R       2
time window is >= 2021-09-01T00:00:00 and < 2021-09-02T00:00:00duplicate_count = 9. (PG-13[2], G[3], NC-17[2], R[2])

unique_rows #

Determines the count of rows in a given column that have values which are unique.

rating  count-----------------PG-13   2G       3NC-17   2PG      1R       2
time window is >= 2021-09-01T00:00:00 and < 2021-09-02T00:00:00distinct_count = 1 (PG)

approx_distinct_values #

Determines the approximate distinct count of values in a given column. This metric is useful in large tables where an approximation is sufficient and query performance is required.
Note: Postgres does not support for approximate count of distinct values unlike bigquery, snowflake and redshift.

Defining your metric#

Defining your own metric is very easy. You can create both table-level or column-level metrics.

Metrics can be defined in any place in your dbt project, as macros with names following the pattern: re_data_metric_(your_name)

Both column and table level metrics take a dictionary called context. Any extra configuration passed to a metric in re_data:monitored would be merged with the context dicionary.

   # Below is the structure of a context dictionary by default   {       "column_name": # contains the name of the column to compute the metric on. null in table metrics       "table_name": # contains the full table name for reference in metric query definition       "metric_name": # name of the metric being computed       "time_filter": # time column used to filter the time window   }

Here are examples of custom metrics (one table, one column level)

macros/my_metrics.sql

{% macro re_data_metric_buy_count(context) %}    coalesce(        sum(            case when event_type = 'buy'                then 1            else 0            end        ), 0    ){% endmacro %}
{% macro re_data_metric_distinct_count(context) %}    count(distinct( {{context.column_name}} )){% endmacro %}

Some important difference between table and column level metrics are:

Table level metrics reference column names directly.
Column level metrics reference it through context.column_name variable. (Which makes them more re-usable)
Tip: When using sub queries in custom metrics, remember to filter the records to the time frame in context. We can use in_time_window({{context.time_fiter}}) macro to achieve this. The macro distinct_table_rows is a good example.

Your metric ideas#

If you have other suggestions of metrics you would like supported, please let us know on Slack! 😊