Module 4: Slowly Changing Dimensions | Data Modeling Interview Mastery

🔄 Slowly Changing Dimensions

SCD describes how you handle changes in dimension attributes over time. A customer moves from NYC to LA — do you overwrite, track history, or keep both?

This is one of the most-tested topics in data modeling interviews. Know all types.

Type 0 — Retain Original

Never update. The original value is preserved forever. Use for attributes that should never change (original signup date, birth date).

Type 1 — Overwrite

Replace old value with new. No history kept. Simple but you lose the ability to report "as-of" a past date. Use when history doesn't matter.

Type 2 — Add New Row

Insert a new row with effective dates. Full history preserved. The gold standard for "as-of" reporting. Most interview questions focus here.

Type 3 — Add Column

Add "previous_value" column. Limited history (only one prior version). Rarely used — too rigid for most scenarios.

Type 4 — History Table

Keep current values in the main dimension, full history in a separate table. Used when history queries are infrequent but must be available.

Type 6 — Hybrid (1+2+3)

Combines Type 1 overwrite of "current" columns with Type 2 row versioning and Type 3 previous column. Maximum flexibility, maximum complexity.

Implementing SCD Type 2

SCD2 is the most interview-critical pattern. Here's the anatomy of a Type 2 dimension:

sk	customer_id	name	city	eff_start	eff_end	is_current
101	C-42	Alice	NYC	2023-01-01	2024-06-14	N
102	C-42	Alice	LA	2024-06-15	9999-12-31	Y

When Alice moves from NYC to LA, we close the old row (set eff_end, is_current=N) and insert a new row with the updated city.

Detect change

→

Close old row
set eff_end, flag=N

→

Insert new row
new SK, flag=Y

-- Close the existing record
UPDATE dim_customer
SET eff_end = '2024-06-14', is_current = 'N'
WHERE customer_id = 'C-42' AND is_current = 'Y';

-- Insert the new version
INSERT INTO dim_customer (customer_id, name, city, eff_start, eff_end, is_current)
VALUES ('C-42', 'Alice', 'LA', '2024-06-15', '9999-12-31', 'Y');

Surrogate vs Natural Keys

SCD Type 2 requires surrogate keys. Here's why:

🔑 Surrogate Key

📌 Synthetic, auto-generated (integer sequence or hash)
📌 Unique per version of an entity
📌 Stable — never changes even if source key changes
📌 Required for SCD2 (multiple rows per entity)
📌 Enables consistent joins across time

🏷️ Natural Key

📌 The business identifier (customer_id, email)
📌 Comes from the source system
📌 Can change (email updates, mergers)
📌 Not unique in SCD2 (same customer_id, multiple rows)
📌 Kept as a non-unique business key column

💡 Interview Gold

"Always use surrogate keys in dimensions. Natural keys are preserved as a business key column, but the surrogate key is what fact tables reference. This decouples the warehouse from source system key changes."

Late-Arriving Dimensions & Mini-Dimensions

Two advanced SCD scenarios that separate senior engineers from juniors.

⏰ Late-Arriving Dimensions

A fact record arrives before its dimension record. Example: a sales event for a new customer whose profile hasn't loaded yet.

Solution: Insert a placeholder (inferred) dimension row with the natural key and null attributes. When the real data arrives, update the placeholder.

🔬 Mini-Dimensions

When a dimension has rapidly-changing attributes (e.g., customer credit score, loyalty tier), putting them in the main dimension causes SCD2 row explosion.

Solution: Extract volatile attributes into a separate mini-dimension. The fact table gets a second FK pointing to the mini-dimension for the current state at transaction time.

dim_customer_profile (mini-dimension)
profile_key	credit_band	loyalty_tier	income_band
1	700-749	Gold	80K-100K
2	750-799	Platinum	100K-120K
3	650-699	Silver	60K-80K

🎯 Interview Tip

"Mini-dimensions prevent SCD2 row explosion for high-cardinality, frequently-changing attributes. Banding continuous values (income, age, score) into ranges keeps the mini-dimension small and reusable."

SCD Implementation in dbt

dbt has first-class support for SCD Type 2 via snapshots. Instead of writing manual UPDATE/INSERT logic, dbt detects changes and manages versioned rows automatically.

-- models/snapshots/customer_snapshot.sql
{% snapshot customer_snapshot %}
{{
    config(
      target_schema='snapshots',
      unique_key='customer_id',
      strategy='timestamp',
      updated_at='updated_at',
    )
}}
SELECT * FROM {{ source('raw', 'customers') }}
{% endsnapshot %}

dbt supports two snapshot strategies for detecting changes:

⏱️ Timestamp Strategy

📌 Compares an updated_at column between runs
📌 Only detects changes when the timestamp advances
📌 Preferred when source has a reliable updated_at
📌 More efficient — no need to compare every column

🔍 Check Strategy

📌 Compares specified columns value-by-value
📌 Detects changes even without a timestamp column
📌 Use check_cols='all' or list specific columns
📌 More expensive — hashes/compares every row

When dbt detects a change, it automatically generates these metadata columns:

customer_id	name	city	dbt_scd_id	dbt_valid_from	dbt_valid_to
C-42	Alice	NYC	abc123…	2023-01-01	2024-06-15
C-42	Alice	LA	def456…	2024-06-15	null

dbt_valid_to = null means the row is current. dbt_scd_id is a unique hash per version — it's your surrogate key.

🎯 Interview Callout

"How do you implement SCD2 in dbt? → Use dbt snapshots with timestamp or check strategy. dbt automatically manages valid_from/valid_to and detects changes. No manual UPDATE/INSERT logic needed."

SCD Implementation in Spark / Delta Lake

In Spark + Delta Lake, SCD Type 2 is implemented using the MERGE operation. Delta's ACID transactions guarantee the close-old + insert-new steps happen atomically.

from delta.tables import DeltaTable
from pyspark.sql.functions import col, lit

target = DeltaTable.forPath(spark, "/data/dim_customer")
source = spark.read.parquet("/data/customer_changes")

# Step 1: Identify changed records
changes = target.toDF().alias("t").join(
    source.alias("s"), "customer_id"
).where("s.updated_at > t.updated_at AND t.is_current = true")

# Step 2: Stage new versions + closed records
new_versions = changes.select(
    col("s.customer_id"), col("s.name"), col("s.city"),
    col("s.updated_at").alias("eff_start"),
    lit("9999-12-31").cast("date").alias("eff_end"),
    lit(True).alias("is_current")
)

closed_records = changes.select(
    col("t.customer_id"), col("t.name"), col("t.city"),
    col("t.eff_start"),
    col("s.updated_at").alias("eff_end"),
    lit(False).alias("is_current")
)

# Step 3: MERGE — close old + insert new atomically
target.alias("t").merge(
    closed_records.union(new_versions).alias("s"),
    "t.customer_id = s.customer_id AND t.eff_start = s.eff_start"
).whenMatchedUpdate(set={
    "eff_end": "s.eff_end",
    "is_current": "s.is_current"
}).whenNotMatchedInsertAll().execute()

The same logic as pure SQL MERGE (useful for Databricks SQL or Spark SQL):

MERGE INTO dim_customer t
USING staged_changes s
  ON t.customer_id = s.customer_id AND t.eff_start = s.eff_start
WHEN MATCHED THEN UPDATE SET
  t.eff_end = s.eff_end,
  t.is_current = s.is_current
WHEN NOT MATCHED THEN INSERT *;

🔒 Delta Lake Advantage

ACID transactions make SCD2 atomic — readers never see a partially-closed old row without its new version. No "half-updated" dimensions.

⏳ Time Travel Bonus

Delta's time travel (VERSION AS OF) gives you point-in-time queries for free, complementing SCD2's own effective dates.

🎯 Interview Callout

"When should you use dbt snapshots vs Spark MERGE? → dbt snapshots are ideal for batch ELT in a warehouse (Snowflake, BigQuery, Redshift) — zero custom code. Spark MERGE is better for lakehouse architectures (Delta Lake, Iceberg) where you need fine-grained control, handle large-scale data, or run SCD logic inside a Spark pipeline."

🔄 Slowly Changing Dimensions

Type 0 — Retain Original

Type 1 — Overwrite

Type 2 — Add New Row

Type 3 — Add Column

Type 4 — History Table

Type 6 — Hybrid (1+2+3)

Implementing SCD Type 2

Surrogate vs Natural Keys

🔑 Surrogate Key

🏷️ Natural Key

Late-Arriving Dimensions & Mini-Dimensions

⏰ Late-Arriving Dimensions

🔬 Mini-Dimensions

SCD Implementation in dbt

⏱️ Timestamp Strategy

🔍 Check Strategy

SCD Implementation in Spark / Delta Lake

🔒 Delta Lake Advantage

⏳ Time Travel Bonus

Quiz: Test Yourself

Q1: A customer changes address. You need to report revenue by the address at the time of each order. Which SCD type?

Q2: Why does SCD Type 2 REQUIRE surrogate keys?

Q3: A fact record arrives for customer_id='C-99' but that customer doesn't exist in dim_customer yet. What do you do?

Q4: Customer credit scores change monthly. Using SCD2 for all attributes causes the dimension to grow 12x per year. What's the fix?

Q5: What is SCD Type 6 and why is it named "6"?

Q6: In dbt snapshots, what is the difference between the 'timestamp' and 'check' strategies?

Q7: Why is Delta Lake's MERGE preferred over separate UPDATE + INSERT for SCD Type 2?