🔄 Slowly Changing Dimensions

SCD describes how you handle changes in dimension attributes over time. A customer moves from NYC to LA — do you overwrite, track history, or keep both?

This is one of the most-tested topics in data modeling interviews. Know all types.

Type 0 — Retain Original

Never update. The original value is preserved forever. Use for attributes that should never change (original signup date, birth date).

Type 1 — Overwrite

Replace old value with new. No history kept. Simple but you lose the ability to report "as-of" a past date. Use when history doesn't matter.

Type 2 — Add New Row

Insert a new row with effective dates. Full history preserved. The gold standard for "as-of" reporting. Most interview questions focus here.

Type 3 — Add Column

Add "previous_value" column. Limited history (only one prior version). Rarely used — too rigid for most scenarios.

Type 4 — History Table

Keep current values in the main dimension, full history in a separate table. Used when history queries are infrequent but must be available.

Type 6 — Hybrid (1+2+3)

Combines Type 1 overwrite of "current" columns with Type 2 row versioning and Type 3 previous column. Maximum flexibility, maximum complexity.

Implementing SCD Type 2

SCD2 is the most interview-critical pattern. Here's the anatomy of a Type 2 dimension:

skcustomer_idnamecityeff_starteff_endis_current
101C-42AliceNYC2023-01-012024-06-14N
102C-42AliceLA2024-06-159999-12-31Y

When Alice moves from NYC to LA, we close the old row (set eff_end, is_current=N) and insert a new row with the updated city.

Detect change
Close old row
set eff_end, flag=N
Insert new row
new SK, flag=Y
-- Close the existing record UPDATE dim_customer SET eff_end = '2024-06-14', is_current = 'N' WHERE customer_id = 'C-42' AND is_current = 'Y'; -- Insert the new version INSERT INTO dim_customer (customer_id, name, city, eff_start, eff_end, is_current) VALUES ('C-42', 'Alice', 'LA', '2024-06-15', '9999-12-31', 'Y');

Surrogate vs Natural Keys

SCD Type 2 requires surrogate keys. Here's why:

🔑 Surrogate Key

  • 📌 Synthetic, auto-generated (integer sequence or hash)
  • 📌 Unique per version of an entity
  • 📌 Stable — never changes even if source key changes
  • 📌 Required for SCD2 (multiple rows per entity)
  • 📌 Enables consistent joins across time

🏷️ Natural Key

  • 📌 The business identifier (customer_id, email)
  • 📌 Comes from the source system
  • 📌 Can change (email updates, mergers)
  • 📌 Not unique in SCD2 (same customer_id, multiple rows)
  • 📌 Kept as a non-unique business key column
💡 Interview Gold

"Always use surrogate keys in dimensions. Natural keys are preserved as a business key column, but the surrogate key is what fact tables reference. This decouples the warehouse from source system key changes."

Late-Arriving Dimensions & Mini-Dimensions

Two advanced SCD scenarios that separate senior engineers from juniors.

⏰ Late-Arriving Dimensions

A fact record arrives before its dimension record. Example: a sales event for a new customer whose profile hasn't loaded yet.

Solution: Insert a placeholder (inferred) dimension row with the natural key and null attributes. When the real data arrives, update the placeholder.

🔬 Mini-Dimensions

When a dimension has rapidly-changing attributes (e.g., customer credit score, loyalty tier), putting them in the main dimension causes SCD2 row explosion.

Solution: Extract volatile attributes into a separate mini-dimension. The fact table gets a second FK pointing to the mini-dimension for the current state at transaction time.

dim_customer_profile (mini-dimension)
profile_keycredit_bandloyalty_tierincome_band
1700-749Gold80K-100K
2750-799Platinum100K-120K
3650-699Silver60K-80K
🎯 Interview Tip

"Mini-dimensions prevent SCD2 row explosion for high-cardinality, frequently-changing attributes. Banding continuous values (income, age, score) into ranges keeps the mini-dimension small and reusable."

SCD Implementation in dbt

dbt has first-class support for SCD Type 2 via snapshots. Instead of writing manual UPDATE/INSERT logic, dbt detects changes and manages versioned rows automatically.

-- models/snapshots/customer_snapshot.sql {% snapshot customer_snapshot %} {{ config( target_schema='snapshots', unique_key='customer_id', strategy='timestamp', updated_at='updated_at', ) }} SELECT * FROM {{ source('raw', 'customers') }} {% endsnapshot %}

dbt supports two snapshot strategies for detecting changes:

⏱️ Timestamp Strategy

  • 📌 Compares an updated_at column between runs
  • 📌 Only detects changes when the timestamp advances
  • 📌 Preferred when source has a reliable updated_at
  • 📌 More efficient — no need to compare every column

🔍 Check Strategy

  • 📌 Compares specified columns value-by-value
  • 📌 Detects changes even without a timestamp column
  • 📌 Use check_cols='all' or list specific columns
  • 📌 More expensive — hashes/compares every row

When dbt detects a change, it automatically generates these metadata columns:

customer_idnamecitydbt_scd_iddbt_valid_fromdbt_valid_to
C-42AliceNYCabc123…2023-01-012024-06-15
C-42AliceLAdef456…2024-06-15null

dbt_valid_to = null means the row is current. dbt_scd_id is a unique hash per version — it's your surrogate key.

🎯 Interview Callout

"How do you implement SCD2 in dbt? → Use dbt snapshots with timestamp or check strategy. dbt automatically manages valid_from/valid_to and detects changes. No manual UPDATE/INSERT logic needed."

SCD Implementation in Spark / Delta Lake

In Spark + Delta Lake, SCD Type 2 is implemented using the MERGE operation. Delta's ACID transactions guarantee the close-old + insert-new steps happen atomically.

from delta.tables import DeltaTable from pyspark.sql.functions import col, lit target = DeltaTable.forPath(spark, "/data/dim_customer") source = spark.read.parquet("/data/customer_changes") # Step 1: Identify changed records changes = target.toDF().alias("t").join( source.alias("s"), "customer_id" ).where("s.updated_at > t.updated_at AND t.is_current = true") # Step 2: Stage new versions + closed records new_versions = changes.select( col("s.customer_id"), col("s.name"), col("s.city"), col("s.updated_at").alias("eff_start"), lit("9999-12-31").cast("date").alias("eff_end"), lit(True).alias("is_current") ) closed_records = changes.select( col("t.customer_id"), col("t.name"), col("t.city"), col("t.eff_start"), col("s.updated_at").alias("eff_end"), lit(False).alias("is_current") ) # Step 3: MERGE — close old + insert new atomically target.alias("t").merge( closed_records.union(new_versions).alias("s"), "t.customer_id = s.customer_id AND t.eff_start = s.eff_start" ).whenMatchedUpdate(set={ "eff_end": "s.eff_end", "is_current": "s.is_current" }).whenNotMatchedInsertAll().execute()

The same logic as pure SQL MERGE (useful for Databricks SQL or Spark SQL):

MERGE INTO dim_customer t USING staged_changes s ON t.customer_id = s.customer_id AND t.eff_start = s.eff_start WHEN MATCHED THEN UPDATE SET t.eff_end = s.eff_end, t.is_current = s.is_current WHEN NOT MATCHED THEN INSERT *;

🔒 Delta Lake Advantage

ACID transactions make SCD2 atomic — readers never see a partially-closed old row without its new version. No "half-updated" dimensions.

⏳ Time Travel Bonus

Delta's time travel (VERSION AS OF) gives you point-in-time queries for free, complementing SCD2's own effective dates.

🎯 Interview Callout

"When should you use dbt snapshots vs Spark MERGE? → dbt snapshots are ideal for batch ELT in a warehouse (Snowflake, BigQuery, Redshift) — zero custom code. Spark MERGE is better for lakehouse architectures (Delta Lake, Iceberg) where you need fine-grained control, handle large-scale data, or run SCD logic inside a Spark pipeline."

Quiz: Test Yourself

Q1: A customer changes address. You need to report revenue by the address at the time of each order. Which SCD type?

Q2: Why does SCD Type 2 REQUIRE surrogate keys?

Q3: A fact record arrives for customer_id='C-99' but that customer doesn't exist in dim_customer yet. What do you do?

Q4: Customer credit scores change monthly. Using SCD2 for all attributes causes the dimension to grow 12x per year. What's the fix?

Q5: What is SCD Type 6 and why is it named "6"?

Q6: In dbt snapshots, what is the difference between the 'timestamp' and 'check' strategies?

Q7: Why is Delta Lake's MERGE preferred over separate UPDATE + INSERT for SCD Type 2?