Module 8: Modern Modeling Patterns | Data Modeling Interview Mastery

📊 One Big Table (OBT)

Every data modeling course teaches you to normalize. The OBT pattern says: forget all that — pre-join everything into a single, massively wide, completely denormalized table.

An OBT takes your entire star schema — fact table plus every dimension — and materializes it into one flat table. No joins. No lookups. Just one table to rule them all.

In the age of columnar cloud warehouses — Snowflake, BigQuery, ClickHouse, Redshift — OBT has become a surprisingly practical pattern. A query touching 3 columns out of 200 only scans those 3 columns.

OBT: 1 table, 0 joins

↕ Pre-joined from ↕

fact_sales

dim_customer

dim_product

dim_store

dim_date

-- Materialize an OBT from a star schema
CREATE TABLE analytics.obt_sales AS
SELECT
    f.order_id, f.order_date, f.quantity, f.revenue,
    -- Customer dimension
    c.customer_name, c.customer_segment, c.customer_region,
    -- Product dimension
    p.product_name, p.product_category, p.brand,
    -- Store dimension
    s.store_name, s.store_city, s.store_state,
    -- Date dimension
    d.day_of_week, d.month_name, d.quarter, d.fiscal_year
FROM fact_sales f
JOIN dim_customer c ON f.customer_key = c.customer_key
JOIN dim_product p  ON f.product_key = p.product_key
JOIN dim_store s    ON f.store_key = s.store_key
JOIN dim_date d     ON f.date_key = d.date_key;

Aspect	OBT Advantage	OBT Drawback
Query simplicity	No joins — any analyst can query	—
Performance	Columnar scans only needed cols	Expensive if SELECT *
Onboarding	One table to learn	Column explosion (200+ cols)
Storage	Columnar compression mitigates	2–10× more raw storage
Freshness	Materialized for speed	Stale until next rebuild
Updates	—	Full rebuild on dim change

📖 Analogy

OBT is like a fully printed report binder. A star schema is a filing cabinet where you pull invoices, customer info, and product specs from separate drawers and staple them together. OBT is the pre-stapled binder — every page has all context already assembled. More paper (storage), but anyone can grab it and read instantly.

💡 Interview Insight

"OBT is a fully denormalized table where all dimensions are pre-joined into the fact table. It works in columnar engines because they only scan columns your query references — so a 200-column table querying 3 columns is just as fast. I'd use OBT for read-heavy dashboards and self-serve BI. The tradeoff is storage redundancy and rebuild cost. It's a serving-layer pattern, not a storage-layer pattern."

⚡ Activity Schema

Modern apps emit hundreds of event types. In Kimball, each gets its own fact table — fact_page_views, fact_purchases, fact_signups. With 50+ event types, you get 50+ tables and 50+ pipelines.

The Activity Schema takes a radically different approach: one unified fact table for all event types.

❌ Traditional (50 tables)

📌 fact_page_views
📌 fact_purchases
📌 fact_signups
📌 fact_refunds
📌 fact_clicks … × 50

✅ Activity Schema (1 table)

📌 activity_id (UUID)
📌 entity_id (who)
📌 activity_type (what)
📌 ts (when)
📌 feature_json (details)

-- Activity Schema table definition
CREATE TABLE activity_stream (
    activity_id     STRING,        -- Unique event ID
    entity_id       STRING,        -- Who did it
    activity_type   STRING,        -- What happened
    ts              TIMESTAMP,     -- When
    revenue_impact  DECIMAL(18,2),-- Revenue if applicable
    feature_json    JSON           -- Everything else
);

-- User journey in one query
SELECT entity_id, activity_type, ts
FROM activity_stream
WHERE entity_id = 'user_12345'
ORDER BY ts;

-- Extract from JSON for specific event type
SELECT
    entity_id, ts,
    JSON_EXTRACT_SCALAR(feature_json, '$.product_id') AS product_id
FROM activity_stream
WHERE activity_type = 'add_to_cart'
  AND ts >= CURRENT_DATE - INTERVAL '7' DAY;

Aspect	Benefit	Drawback
Flexibility	New event = new value, no schema change	No schema enforcement per event
Simplicity	One table, one pipeline	Overloaded columns lose meaning
Funnel analysis	Trivial — all events co-located	JSON parsing slower than typed cols
Data quality	—	Typo in activity_type = silent loss

📖 Analogy

Activity Schema is like a universal journal. Instead of separate notebooks for work, personal, and exercise, you write everything in one journal with a tag on each entry. Flexible and easy to carry — but reviewing only your exercise history means flipping past a lot of irrelevant entries.

💡 Interview Insight

"Activity Schema stores all event types in a single table with entity_id, activity_type, timestamp, and a JSON column. Proposed by Ahmed Elsamadisi at Fivetran. Excellent for user journey analysis and funnel conversion. The tradeoff is weaker governance — no schema contract per event type, JSON parsing adds cost, and a typo in activity_type creates silent data quality issues."

📐 Wide Tables & Columnar Storage

For decades: wider tables = slower queries. In row-store databases, reading one column from a 200-column table means loading all 200 columns. Wide Table patterns were impractical.

Columnar storage flipped the equation. In Snowflake, BigQuery, or ClickHouse, each column is stored independently. A query touching 3 out of 200 columns reads only those 3. The other 197 have zero I/O cost.

🐌 Row Store (PostgreSQL)

📌 All columns stored together per row
📌 SELECT city, SUM(revenue) reads ALL columns
📌 NULLs consume space in each row
📌 Wide tables = massive I/O waste

🚀 Columnar Store (Snowflake)

📌 Each column stored independently on disk
📌 SELECT city, SUM(revenue) reads ONLY 2 cols
📌 RLE: NULL×1,000,000 = one instruction
📌 Dictionary encoding: repeated strings → tiny ints

📦 Run-Length Encoding

Long consecutive runs of the same value (including NULL) stored as a single entry. A column that's NULL for 1M rows → "NULL × 1,000,000". Sparse columns are nearly free.

📖 Dictionary Encoding

Low-cardinality columns (e.g., 5 regions across 100M rows) replaced with integer lookups. Column of repeated strings becomes tiny integers + a 5-entry dictionary.

Scenario	Normalized Model	Wide Table
Dashboard query (3 cols)	4 joins across 5 tables	Direct scan, 3 cols only
Ad-hoc exploration	Must know schema relationships	One table, autocomplete
Adding new attribute	ALTER TABLE or new table + join	ALTER TABLE ADD COLUMN
Storage (sparse data)	Compact per table, join overhead	RLE/dictionary → NULLs ~free
Write performance	Fast (narrow inserts)	Slower (wide inserts)
Transactional updates	Efficient (one table)	Expensive (wide row)

💡 Interview Insight

"Columnar engines store each column independently, so a query touching 3 of 300 columns reads only those 3. Combined with RLE (making NULLs essentially free) and dictionary encoding (replacing repeated strings with tiny integers), wide sparse tables are storage-efficient. Columnar storage decouples table width from query cost — you pay for what you read, not what exists."

🔄 EAV — The Flexibility Trap

EAV (Entity-Attribute-Value) stores data vertically instead of horizontally. Instead of one column per attribute, you have three columns: entity_id, attribute_name, and attribute_value.

📋 Relational (Horizontal)

📌 id=1 | name=Shirt | color=Blue | weight=0.3
📌 id=2 | name=Drill | color=Red | voltage=120
📌 2 rows × typed columns
📌 Simple: WHERE color='Blue' AND weight>1

🔄 EAV (Vertical)

📌 (1, 'name', 'Shirt'), (1, 'color', 'Blue')
📌 (1, 'weight', '0.3'), (1, 'fabric', 'Cotton')
📌 8 rows × 3 string columns
📌 Nightmare: self-joins + CAST

-- In a normal table: simple and fast
SELECT product_id, name
FROM products
WHERE color = 'Blue' AND weight_kg > 1.0;

-- In EAV: a nightmare of self-joins
SELECT e.entity_id
FROM product_attributes e
JOIN product_attributes c
  ON e.entity_id = c.entity_id
  AND c.attribute_name = 'color'
  AND c.attribute_value = 'Blue'
JOIN product_attributes w
  ON e.entity_id = w.entity_id
  AND w.attribute_name = 'weight'
  AND CAST(w.attribute_value AS DECIMAL) > 1.0
WHERE e.attribute_name = 'name';

Problem	Why It Hurts
Type coercion	Everything is VARCHAR — CAST at query time
No constraints	Can't enforce "weight must be positive"
Query complexity	Self-joins or pivoting — 10× more complex SQL
Performance	Self-joins on large EAV tables are expensive
BI incompatibility	Tableau/Looker expect columnar data
Indexing	Can't index mixed-type attribute_value

✅ When EAV Is Right

Attributes truly unknowable at design time (CMS form builders), thousands of attributes with tiny subsets per entity, or when you never query across attributes.

🔄 Modern Alternative

JSON/JSONB in PostgreSQL or VARIANT in Snowflake provide EAV flexibility with better query ergonomics, type support, and indexing.

📖 Analogy

EAV is like storing your wardrobe as a spreadsheet with three columns: "Item ID", "Property", "Value." Your shirt becomes four rows (color=blue, size=M, fabric=cotton, sleeve=long). Great for inventory — but try answering "what blue cotton shirts in size M?" and you're doing gymnastics instead of looking in your closet.

💡 Interview Insight

"EAV is maximally flexible — adding a new attribute is just inserting a row. But it's terrible for analytics: self-joins or pivoting to reconstruct rows, all values are strings requiring runtime casting, and you lose schema enforcement and indexability. I'd use EAV only when attributes are truly dynamic and user-defined. Even then, I'd first consider JSON columns as a better alternative."

🧠 Feature Stores & Point-in-Time Correctness

ML models don't just need data — they need data as it existed at a specific point in time. This is the fundamental difference between analytics modeling and ML modeling.

Point-in-Time Correctness prevents data leakage — when training data includes information that wouldn't have been available at prediction time.

Data Leakage Timeline

✅ Jan 1 – Feb 28: Use this data

❌ Mar 1 – Mar 20: Future data!

Mar 1 = Prediction Point → Features must use only data before this

-- Feature table: entity key + timestamp + features
CREATE TABLE features.customer_daily (
    customer_id     STRING,
    feature_ts      TIMESTAMP,      -- When features were computed
    orders_last_30d         INT,
    revenue_last_30d        DECIMAL(12,2),
    days_since_last_order   INT,
    support_tickets_30d     INT,
    loyalty_tier            STRING,
    PRIMARY KEY (customer_id, feature_ts)
);

-- Point-in-time join: get latest features BEFORE prediction date
SELECT pe.customer_id, pe.prediction_date, pe.label_churned, f.*
FROM prediction_events pe
LEFT JOIN features.customer_daily f
  ON pe.customer_id = f.customer_id
  AND f.feature_ts = (
      SELECT MAX(f2.feature_ts)
      FROM features.customer_daily f2
      WHERE f2.customer_id = pe.customer_id
        AND f2.feature_ts <= pe.prediction_date
  );

📚 Offline Store

📌 Data warehouse / data lake
📌 Latency: seconds to minutes
📌 Full history (all timestamps)
📌 Use case: model training with PIT joins
📌 Batch updates (hourly/daily)

⚡ Online Store

📌 Redis, DynamoDB, Bigtable
📌 Latency: < 10 milliseconds
📌 Latest values only
📌 Use case: real-time model serving
📌 Streaming / near-real-time updates

📖 Analogy

A Feature Store is like a library with two rooms. The archive room (offline store) has every edition of every book ever published — researchers study how knowledge evolved over time. The reference desk (online store) has only the latest edition — visitors grab the current answer fast. Same books, different access patterns.

💡 Interview Insight

"The critical principle is point-in-time correctness — features must reflect only data available before the prediction timestamp to prevent data leakage. Feature tables use a composite key of entity_id + feature_timestamp. For training, use ASOF joins. Architecture has offline store (warehouse, full history, batch training) and online store (Redis, latest values, low-latency serving). The biggest mistake is computing features without temporal boundaries — that introduces leakage."

🧭 Decision Matrix — When to Use What

There is no single "best" data model. Each pattern optimizes for different axes. Senior data engineers don't argue about which model is best; they ask "best for what?"

Pattern	Query Simplicity	Schema Flex	Governance	Storage Eff.	Update Cost	ML-Ready
3NF	⭐⭐	⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐
Star Schema	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐
OBT	⭐⭐⭐⭐⭐	⭐⭐	⭐⭐⭐	⭐⭐	⭐	⭐⭐⭐⭐
Activity Schema	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐
Wide Table	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐	⭐⭐	⭐⭐⭐⭐
EAV	⭐	⭐⭐⭐⭐⭐	⭐	⭐⭐⭐	⭐⭐⭐⭐	⭐
Data Vault	⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐

Scenario	Recommended Pattern	Why
Executive dashboard (5 KPIs)	OBT	Max simplicity, zero SQL expertise
E-commerce catalog (10K+ attrs)	EAV / JSON cols	Truly dynamic attributes
Multi-source enterprise DWH	Data Vault → Star	Integration + audit + BI serving
Product analytics (clickstream)	Activity Schema	All events in one place
ML churn prediction	Feature Store (PIT)	Point-in-time correctness
Self-serve for 200 analysts	Star + OBT marts	Governed metrics + exploration
IoT sensor data (100+ types)	Wide Table	Sparse cols compress well
Regulatory reporting	3NF / Data Vault	Max integrity + audit trail
Startup MVP (3-person team)	OBT / Star Schema	Ship fast, refactor later

📐 The Layered Reality

Mature orgs use different patterns at different layers: Data Vault (raw integration) → Star Schema (semantic) → OBT (serving) → Feature Store (ML).

🎯 The Right Answer

Usually 2–3 patterns aligned to your team's capabilities. Worst mistake: one pattern everywhere. Second worst: six patterns when two would suffice.

📖 Analogy

Choosing a data modeling pattern is like choosing a vehicle. A 3NF sedan is reliable for daily commutes. A star schema SUV handles most terrain. An OBT bus carries everyone without them driving. A Data Vault freight train moves massive volumes. You don't argue which vehicle is "best" — you pick the right one for the journey.

💡 Interview Insight

"I start with the use case, not the pattern. OLTP → 3NF. Enterprise integration → Data Vault. Analyst BI → Star Schema. Self-serve → OBT. Events → Activity Schema. ML → Feature Store. In practice, mature platforms use multiple patterns in layers. The worst mistake is picking one and forcing everything into it."

📊 One Big Table (OBT)

⚡ Activity Schema

❌ Traditional (50 tables)

✅ Activity Schema (1 table)

📐 Wide Tables & Columnar Storage

🐌 Row Store (PostgreSQL)

🚀 Columnar Store (Snowflake)

📦 Run-Length Encoding

📖 Dictionary Encoding

🔄 EAV — The Flexibility Trap

📋 Relational (Horizontal)

🔄 EAV (Vertical)

✅ When EAV Is Right

🔄 Modern Alternative

🧠 Feature Stores & Point-in-Time Correctness

📚 Offline Store

⚡ Online Store

🧭 Decision Matrix — When to Use What

📐 The Layered Reality

🎯 The Right Answer

✅ Quiz: Test Yourself

Q1: What is the primary reason OBT works well in columnar databases like Snowflake?

Q2: In the Activity Schema pattern, what is the main tradeoff for having all events in one table?

Q3: Why is run-length encoding (RLE) particularly beneficial for wide tables with many NULL columns?

Q4: What is the biggest problem with using EAV for analytical queries?

Q5: What is "data leakage" in ML feature engineering?

Q6: What is the difference between the offline store and online store in a feature store?

Q7: A 3-person data team needs to serve dashboards to 50 non-technical users. Best pattern?

Q8: In a mature data platform, what is the recommended approach to modeling patterns?