Open-sourcing dbt-score: lint model metadata with ease!

With the rise of tools like dbt, it has become much easier to document, test and create data models in a data warehouse. In a previous blog post we highlighted how most of this can be achieved through a simple declarative approach. With that configuration, it is possible to define custom metadata properties. This results in an increase of metadata related to data models, which is great! Metadata includes documentation, tests, column types, constraints, and more. At scale it becomes increasingly difficult to manage all of it. That’s where dbt-score comes in!

The value of metadata

As an organisation grows, data teams grow and the number of people working on (and with) data models increases. This means that developing data models becomes democratised and people from various teams and backgrounds can collaborate on them. They usually work together on SQL and metadata, building upon each other’s work. This is a good thing, as people closer to the business can create their own data models, leading to faster development cycles.

So, the amount of metadata associated with data models is increasing as is the number of people developing them. This is a recipe for disaster because chaos can ensue. How will we ensure that data models across teams have high-quality metadata? Defining standards is easy, ensuring the standards are met is a different story!

Take for example the metadata of the two models below. Both are valid and can be compiled by dbt. However, the orders model lacks some metadata in comparison to customers. It is missing a test, description, team and owner.

"1" data-code-block-lang="yaml">models:
  - name: orders
    config:
      materialized: table
    columns:
      - name: customer_id
        data_type: int
      - name: customer_name
        data_type: varchar
  - name: customers
    description: Table of customers
    config:
      materialized: table
    meta:
      owner: Jane Doe
      team: Sales
    columns:
      - name: customer_id
        data_type: int
        description: Customer identifier
        tests:
          - unique
      - name: customer_name
        data_type: varchar
        description: Name of the customer

These properties can be important in various ways:

Tests are important as they ensure data in a data model is what the users expect. If tests are not added, faulty data may be used to make important business decisions.
The description of a model gives the user a clear idea of how the data can be used. It can be stored in the database and even in a data catalog.
The owner and team of a data model describes who is responsible for the data, for example, if users have questions. It can also be used to automatically notify the team of a data model when tests are failing.

These are just small examples of metadata attributes that are useful. In reality, the possibilities are endless. Therefore, it is essential to have a way to check if certain metadata attributes are set correctly. At Picnic, we took matters into our own hands and developed a new tool: dbt-score, a linter for dbt metadata.

dbt-score

dbt-score is not opinionated and is highly configurable by design. The reason is that data model requirements vary a lot per company, department or team. Depending on these requirements, teams are free to customise dbt-score. Metadata properties in the example above might be valuable for a certain team, but not another. The goal of dbt-score is to let teams decide on the standards themselves, as they know best what they need.

Linting

The approach to linting is simple: dbt-score has a set of rules defined that are evaluated on a model. dbt-score comes with a set of predefined rules that are enabled by default. A rule is a simple function that takes as input a dbt model, applies some logic, and outputs a violation when the model violates the logic defined in the rule. After all rules have been evaluated on a model, the model maturity is assessed and a score is provided. Based on the score, the model is awarded a badge. When linting a dbt project, these steps are executed for all enabled models in a dbt project. The example below shows the linting results of orders.

"2" data-code-block-lang="shell">> dbt-score lint
🚧 orders (score: 2.7)
 WARN (medium) dbt_score.rules.generic.columns_have_description: Columns lack a description: customer_id, customer_name.
 WARN (high) dbt_score.rules.generic.has_description: Model lacks a description.
 WARN (medium) dbt_score.rules.generic.has_owner: Model lacks an owner.
 WARN (medium) dbt_score.rules.generic.sql_has_reasonable_number_of_lines: SQL query too long: 238 lines (> 200).
 WARN (medium) dbt_score_rules.custom_rules.has_test: Model lacks a test.
Project score: 2.7 🚧

It is clear from the output that, considering the dbt-score project rules, the model does not meet the standards. Each rule has a short description specifying how to address the issue at hand. The linting results can be easily integrated into continuous integration systems, in order to enforce rules before models reach production.

Rules

Rules can be configured and disabled, and are also very easy to create. This enables every team to have their own set of rules applied to their models. Take a look at the rule below that checks if a model has a description. With just a couple of lines of code, dbt-score will be able to apply it to all models in a project.

"2" data-code-block-lang="python">from dbt_score import rule, RuleViolation, Model
@rule
def has_description(model: Model) -> RuleViolation | None:
    """A model should have a description."""
    if not model.get("description"):
        return RuleViolation(message="Model lacks a description.")

Badges

Badges are an important part of dbt-score to gamify the effort of adding metadata to models. Writing metadata can be fun after all! When a model is scored, a badge (one of 🥇🥈🥉🚧) is awarded based on certain thresholds. Both the thresholds and the badges can be configured per project, which can be useful if different projects have different metadata quality requirements. The badge of a model is shown in the output:

"2" data-code-block-lang="shell">🥇 customers (score: 10.0)
 OK   dbt_score.rules.generic.columns_have_description
 OK   dbt_score.rules.generic.has_description
 OK   dbt_score.rules.generic.has_owner
 OK   dbt_score.rules.generic.sql_has_reasonable_number_of_lines
 OK   dbt_score_rules.custom_rules.has_test

The badge can be displayed in other places as well, e.g. in a data catalog. This can be a big motivation for data model developers to keep improving their models. So, setting up the right badge configurations for a project is essential to get the most out of dbt-score!

Wrapping up

The metadata of dbt models doesn’t need to be messy. dbt-score provides a framework for data model developers to set standards and also ensure they are met. As every organisation has different needs and requirements for their data models, dbt-score is highly configurable.

At Picnic, we will keep developing, maintaining and using dbt-score. Interested in learning more? Take a look at the documentation website and feel free to contribute!