SCD describes how you handle changes in dimension attributes over time. A customer moves from NYC to LA — do you overwrite, track history, or keep both?
This is one of the most-tested topics in data modeling interviews. Know all types.
Type 0 — Retain Original
Never update. The original value is preserved forever. Use for attributes that should never change (original signup date, birth date).
Type 1 — Overwrite
Replace old value with new. No history kept. Simple but you lose the ability to report "as-of" a past date. Use when history doesn't matter.
Type 2 — Add New Row
Insert a new row with effective dates. Full history preserved. The gold standard for "as-of" reporting. Most interview questions focus here.
Type 3 — Add Column
Add "previous_value" column. Limited history (only one prior version). Rarely used — too rigid for most scenarios.
Type 4 — History Table
Keep current values in the main dimension, full history in a separate table. Used when history queries are infrequent but must be available.
Type 6 — Hybrid (1+2+3)
Combines Type 1 overwrite of "current" columns with Type 2 row versioning and Type 3 previous column. Maximum flexibility, maximum complexity.
Implementing SCD Type 2
SCD2 is the most interview-critical pattern. Here's the anatomy of a Type 2 dimension:
sk
customer_id
name
city
eff_start
eff_end
is_current
101
C-42
Alice
NYC
2023-01-01
2024-06-14
N
102
C-42
Alice
LA
2024-06-15
9999-12-31
Y
When Alice moves from NYC to LA, we close the old row (set eff_end, is_current=N) and insert a new row with the updated city.
Detect change
→
Close old row set eff_end, flag=N
→
Insert new row new SK, flag=Y
-- Close the existing recordUPDATE dim_customer
SET eff_end = '2024-06-14', is_current = 'N'WHERE customer_id = 'C-42'AND is_current = 'Y';
-- Insert the new versionINSERT INTO dim_customer (customer_id, name, city, eff_start, eff_end, is_current)
VALUES ('C-42', 'Alice', 'LA', '2024-06-15', '9999-12-31', 'Y');
Surrogate vs Natural Keys
SCD Type 2 requires surrogate keys. Here's why:
🔑 Surrogate Key
📌 Synthetic, auto-generated (integer sequence or hash)
📌 Unique per version of an entity
📌 Stable — never changes even if source key changes
📌 Required for SCD2 (multiple rows per entity)
📌 Enables consistent joins across time
🏷️ Natural Key
📌 The business identifier (customer_id, email)
📌 Comes from the source system
📌 Can change (email updates, mergers)
📌 Not unique in SCD2 (same customer_id, multiple rows)
📌 Kept as a non-unique business key column
💡 Interview Gold
"Always use surrogate keys in dimensions. Natural keys are preserved as a business key column, but the surrogate key is what fact tables reference. This decouples the warehouse from source system key changes."
Late-Arriving Dimensions & Mini-Dimensions
Two advanced SCD scenarios that separate senior engineers from juniors.
⏰ Late-Arriving Dimensions
A fact record arrives before its dimension record. Example: a sales event for a new customer whose profile hasn't loaded yet.
Solution: Insert a placeholder (inferred) dimension row with the natural key and null attributes. When the real data arrives, update the placeholder.
🔬 Mini-Dimensions
When a dimension has rapidly-changing attributes (e.g., customer credit score, loyalty tier), putting them in the main dimension causes SCD2 row explosion.
Solution: Extract volatile attributes into a separate mini-dimension. The fact table gets a second FK pointing to the mini-dimension for the current state at transaction time.
dim_customer_profile (mini-dimension)
profile_key
credit_band
loyalty_tier
income_band
1
700-749
Gold
80K-100K
2
750-799
Platinum
100K-120K
3
650-699
Silver
60K-80K
🎯 Interview Tip
"Mini-dimensions prevent SCD2 row explosion for high-cardinality, frequently-changing attributes. Banding continuous values (income, age, score) into ranges keeps the mini-dimension small and reusable."
SCD Implementation in dbt
dbt has first-class support for SCD Type 2 via snapshots. Instead of writing manual UPDATE/INSERT logic, dbt detects changes and manages versioned rows automatically.
dbt supports two snapshot strategies for detecting changes:
⏱️ Timestamp Strategy
📌 Compares an updated_at column between runs
📌 Only detects changes when the timestamp advances
📌 Preferred when source has a reliable updated_at
📌 More efficient — no need to compare every column
🔍 Check Strategy
📌 Compares specified columns value-by-value
📌 Detects changes even without a timestamp column
📌 Use check_cols='all' or list specific columns
📌 More expensive — hashes/compares every row
When dbt detects a change, it automatically generates these metadata columns:
customer_id
name
city
dbt_scd_id
dbt_valid_from
dbt_valid_to
C-42
Alice
NYC
abc123…
2023-01-01
2024-06-15
C-42
Alice
LA
def456…
2024-06-15
null
dbt_valid_to = null means the row is current. dbt_scd_id is a unique hash per version — it's your surrogate key.
🎯 Interview Callout
"How do you implement SCD2 in dbt? → Use dbt snapshots with timestamp or check strategy. dbt automatically manages valid_from/valid_to and detects changes. No manual UPDATE/INSERT logic needed."
SCD Implementation in Spark / Delta Lake
In Spark + Delta Lake, SCD Type 2 is implemented using the MERGE operation. Delta's ACID transactions guarantee the close-old + insert-new steps happen atomically.
from delta.tables import DeltaTable
from pyspark.sql.functions import col, lit
target = DeltaTable.forPath(spark, "/data/dim_customer")
source = spark.read.parquet("/data/customer_changes")
# Step 1: Identify changed records
changes = target.toDF().alias("t").join(
source.alias("s"), "customer_id"
).where("s.updated_at > t.updated_at AND t.is_current = true")
# Step 2: Stage new versions + closed records
new_versions = changes.select(
col("s.customer_id"), col("s.name"), col("s.city"),
col("s.updated_at").alias("eff_start"),
lit("9999-12-31").cast("date").alias("eff_end"),
lit(True).alias("is_current")
)
closed_records = changes.select(
col("t.customer_id"), col("t.name"), col("t.city"),
col("t.eff_start"),
col("s.updated_at").alias("eff_end"),
lit(False).alias("is_current")
)
# Step 3: MERGE — close old + insert new atomically
target.alias("t").merge(
closed_records.union(new_versions).alias("s"),
"t.customer_id = s.customer_id AND t.eff_start = s.eff_start"
).whenMatchedUpdate(set={
"eff_end": "s.eff_end",
"is_current": "s.is_current"
}).whenNotMatchedInsertAll().execute()
The same logic as pure SQL MERGE (useful for Databricks SQL or Spark SQL):
MERGE INTO dim_customer t
USING staged_changes s
ON t.customer_id = s.customer_id AND t.eff_start = s.eff_start
WHEN MATCHED THEN UPDATE SET
t.eff_end = s.eff_end,
t.is_current = s.is_current
WHEN NOT MATCHED THEN INSERT *;
🔒 Delta Lake Advantage
ACID transactions make SCD2 atomic — readers never see a partially-closed old row without its new version. No "half-updated" dimensions.
⏳ Time Travel Bonus
Delta's time travel (VERSION AS OF) gives you point-in-time queries for free, complementing SCD2's own effective dates.
🎯 Interview Callout
"When should you use dbt snapshots vs Spark MERGE? → dbt snapshots are ideal for batch ELT in a warehouse (Snowflake, BigQuery, Redshift) — zero custom code. Spark MERGE is better for lakehouse architectures (Delta Lake, Iceberg) where you need fine-grained control, handle large-scale data, or run SCD logic inside a Spark pipeline."
Quiz: Test Yourself
Q1: A customer changes address. You need to report revenue by the address at the time of each order. Which SCD type?
Q2: Why does SCD Type 2 REQUIRE surrogate keys?
Q3: A fact record arrives for customer_id='C-99' but that customer doesn't exist in dim_customer yet. What do you do?
Q4: Customer credit scores change monthly. Using SCD2 for all attributes causes the dimension to grow 12x per year. What's the fix?
Q5: What is SCD Type 6 and why is it named "6"?
Q6: In dbt snapshots, what is the difference between the 'timestamp' and 'check' strategies?
Q7: Why is Delta Lake's MERGE preferred over separate UPDATE + INSERT for SCD Type 2?