πŸ”₯ The Interview Gauntlet

15 tricky scenario questions that separate senior data engineers from the rest. No warm-up β€” every question is interview-grade.

⚠️ Rules of Engagement

Read each scenario carefully. Pick your answer before revealing the solution. In a real interview, you'd need to explain your reasoning β€” practice saying it out loud.

Round 1: Rapid Fire

Q1: Revenue jumped 3x after joining to a bridge table. What went wrong?

Q2: A customer moved from NYC→LA. After applying SCD Type 1, all their past orders show revenue in LA. Is this correct?

Q3: A dashboard sums monthly account balances across 12 months and shows $600K. The actual balance is $50K. What's wrong?

Q4: Facts for product P-999 arrived but dim_product has no P-999 row yet. Your pipeline must not fail. What do you do?

Q5: An analyst asks you to add an avg_unit_price column to the fact table. What's your response?

Round 2: Design Challenges

🏒 Challenge 1: E-Commerce Order Model

"Design a dimensional model for an e-commerce company. They need to track orders, returns, and customer lifetime value. Orders can have multiple items, promotions, and shipping methods."

Grain: One row per order line item.

Fact table: fact_order_lines (order_line_id, order_date_key, customer_key, product_key, promotion_key, shipping_key, flag_key, order_number [degenerate dim], quantity, unit_price, discount_amount, line_total)

Dimensions: dim_date, dim_customer (SCD2 for address changes), dim_product, dim_promotion, dim_shipping_method, dim_order_flags (junk dim: is_gift, is_rush, payment_type)

Returns: Separate fact_returns with return_date_key, original_order_line FK, reason_key, refund_amount. Don't put returns in fact_order_lines β€” different grain (one return per line, but returns can be partial).

CLV: Build as a derived periodic snapshot: fact_customer_monthly (customer_key, month_key, cumulative_revenue, order_count, months_since_first_order).

πŸ₯ Challenge 2: Hospital Patient Tracking

"Design a model to track patient encounters. A patient can have multiple diagnoses per visit, a doctor can see multiple patients, and you need to report on average length of stay by diagnosis."

Grain: One row per patient encounter.

Fact table: fact_encounter (encounter_key, admit_date_key, discharge_date_key [role-playing dim_date], patient_key, attending_doctor_key, department_key, length_of_stay_days, total_charges)

Many-to-many diagnoses: bridge_encounter_diagnosis (encounter_key, diagnosis_key, weight_factor, is_primary_flag). SUM(charges * weight_factor) gives correct per-diagnosis revenue.

Dimensions: dim_patient (SCD2 for address/insurance changes), dim_doctor, dim_diagnosis (ICD codes, category, body system), dim_department, dim_date

Avg length of stay: SUM(length_of_stay_days * weight_factor) / COUNT(DISTINCT encounter_key) β€” weighted by diagnosis bridge.

πŸ“Š Challenge 3: SaaS Subscription Metrics

"Design a model for a SaaS company that needs to track MRR, churn, expansions, and contractions. Subscriptions have plan changes, upgrades, and cancellations over time."

Fact table 1 (accumulating snapshot): fact_subscription (subscription_key, customer_key, plan_key, created_date_key, activated_date_key, cancelled_date_key, current_mrr, status). Updated as milestones occur.

Fact table 2 (periodic snapshot): fact_mrr_monthly (customer_key, month_key, beginning_mrr, new_mrr, expansion_mrr, contraction_mrr, churn_mrr, ending_mrr). One row per customer per month. Semi-additive: can SUM across customers, not across months.

Fact table 3 (transactional): fact_subscription_events (event_key, subscription_key, event_date_key, event_type [new/upgrade/downgrade/cancel], old_plan_key, new_plan_key, mrr_delta).

Dimensions: dim_customer (SCD2), dim_plan (features, tier, price), dim_date.

Round 3: Gotcha Questions

Q6: Two teams built separate star schemas for sales and support. They want to join on customer. What's the likely problem?

Q7: You need to track which students are eligible for which scholarship programs. There's no dollar amount. What do you build?

Q8: dim_customer has 50 geography-related columns (city, state, country, region, timezone, etc.) that are shared with dim_store. How do you refactor?

Q9: You need SCD2 history for as-of reporting, BUT also want to easily filter all of a customer's orders by their CURRENT city. Which SCD type?

Q10: Credit scores change weekly. Full SCD2 would create millions of rows. How do you track score at time of loan application?

Round 4: Final Five

Q11: "How do you prevent double-counting in a data model?" β€” Give the interviewer a structured 3-point answer.

Q12: "Design a fact table for daily warehouse inventory levels." What's your first sentence?

Q13: An analyst is confused by 3 date FKs in the fact table. How do you make the model more self-service?

Q14: "You just deployed a model change. How do you validate it?" Give a structured answer.

Q15: "Explain Kimball vs Inmon." What's the one-sentence distinction?

πŸ† Common Pitfalls Cheat Sheet

❌ Undefined Grain

If you can't state what one row represents in one sentence, stop and fix it before writing any DDL.

❌ Pre-Computed Ratios

Never store avg_price or margin% in fact tables. Store components (revenue, cost, quantity) and compute at query time.

❌ Summing Semi-Additives

Balances, inventory counts, and headcounts can't be summed across time. Use AVG, LAST_VALUE, or filter to single period.

❌ Missing Bridge Tables

Many-to-many without a bridge = row explosion = wrong totals. Always check cardinality before joining.

❌ Natural Keys as PKs

Natural keys change (email, product SKU). Use surrogate keys in dimensions. Keep the natural key as a business key column.

❌ Non-Conformed Dimensions

"Marketing's customer count doesn't match Sales." Same entity, different definitions. One conformed dimension fixes this.

πŸŽ“ You've Completed the Course!

You've covered data modeling fundamentals, fact/dimension tables, star and snowflake schemas, SCDs, advanced patterns, and survived the gauntlet. You're ready to ace your data modeling interview.