📊 One Big Table (OBT)

Every data modeling course teaches you to normalize. The OBT pattern says: forget all that — pre-join everything into a single, massively wide, completely denormalized table.

An OBT takes your entire star schema — fact table plus every dimension — and materializes it into one flat table. No joins. No lookups. Just one table to rule them all.

In the age of columnar cloud warehouses — Snowflake, BigQuery, ClickHouse, Redshift — OBT has become a surprisingly practical pattern. A query touching 3 columns out of 200 only scans those 3 columns.

OBT: 1 table, 0 joins
↕ Pre-joined from ↕
fact_sales
dim_customer
dim_product
dim_store
dim_date
-- Materialize an OBT from a star schema CREATE TABLE analytics.obt_sales AS SELECT f.order_id, f.order_date, f.quantity, f.revenue, -- Customer dimension c.customer_name, c.customer_segment, c.customer_region, -- Product dimension p.product_name, p.product_category, p.brand, -- Store dimension s.store_name, s.store_city, s.store_state, -- Date dimension d.day_of_week, d.month_name, d.quarter, d.fiscal_year FROM fact_sales f JOIN dim_customer c ON f.customer_key = c.customer_key JOIN dim_product p ON f.product_key = p.product_key JOIN dim_store s ON f.store_key = s.store_key JOIN dim_date d ON f.date_key = d.date_key;
AspectOBT AdvantageOBT Drawback
Query simplicityNo joins — any analyst can query
PerformanceColumnar scans only needed colsExpensive if SELECT *
OnboardingOne table to learnColumn explosion (200+ cols)
StorageColumnar compression mitigates2–10× more raw storage
FreshnessMaterialized for speedStale until next rebuild
UpdatesFull rebuild on dim change
📖 Analogy

OBT is like a fully printed report binder. A star schema is a filing cabinet where you pull invoices, customer info, and product specs from separate drawers and staple them together. OBT is the pre-stapled binder — every page has all context already assembled. More paper (storage), but anyone can grab it and read instantly.

💡 Interview Insight

"OBT is a fully denormalized table where all dimensions are pre-joined into the fact table. It works in columnar engines because they only scan columns your query references — so a 200-column table querying 3 columns is just as fast. I'd use OBT for read-heavy dashboards and self-serve BI. The tradeoff is storage redundancy and rebuild cost. It's a serving-layer pattern, not a storage-layer pattern."

Activity Schema

Modern apps emit hundreds of event types. In Kimball, each gets its own fact table — fact_page_views, fact_purchases, fact_signups. With 50+ event types, you get 50+ tables and 50+ pipelines.

The Activity Schema takes a radically different approach: one unified fact table for all event types.

❌ Traditional (50 tables)

  • 📌 fact_page_views
  • 📌 fact_purchases
  • 📌 fact_signups
  • 📌 fact_refunds
  • 📌 fact_clicks … × 50

✅ Activity Schema (1 table)

  • 📌 activity_id (UUID)
  • 📌 entity_id (who)
  • 📌 activity_type (what)
  • 📌 ts (when)
  • 📌 feature_json (details)
-- Activity Schema table definition CREATE TABLE activity_stream ( activity_id STRING, -- Unique event ID entity_id STRING, -- Who did it activity_type STRING, -- What happened ts TIMESTAMP, -- When revenue_impact DECIMAL(18,2),-- Revenue if applicable feature_json JSON -- Everything else ); -- User journey in one query SELECT entity_id, activity_type, ts FROM activity_stream WHERE entity_id = 'user_12345' ORDER BY ts; -- Extract from JSON for specific event type SELECT entity_id, ts, JSON_EXTRACT_SCALAR(feature_json, '$.product_id') AS product_id FROM activity_stream WHERE activity_type = 'add_to_cart' AND ts >= CURRENT_DATE - INTERVAL '7' DAY;
AspectBenefitDrawback
FlexibilityNew event = new value, no schema changeNo schema enforcement per event
SimplicityOne table, one pipelineOverloaded columns lose meaning
Funnel analysisTrivial — all events co-locatedJSON parsing slower than typed cols
Data qualityTypo in activity_type = silent loss
📖 Analogy

Activity Schema is like a universal journal. Instead of separate notebooks for work, personal, and exercise, you write everything in one journal with a tag on each entry. Flexible and easy to carry — but reviewing only your exercise history means flipping past a lot of irrelevant entries.

💡 Interview Insight

"Activity Schema stores all event types in a single table with entity_id, activity_type, timestamp, and a JSON column. Proposed by Ahmed Elsamadisi at Fivetran. Excellent for user journey analysis and funnel conversion. The tradeoff is weaker governance — no schema contract per event type, JSON parsing adds cost, and a typo in activity_type creates silent data quality issues."

📐 Wide Tables & Columnar Storage

For decades: wider tables = slower queries. In row-store databases, reading one column from a 200-column table means loading all 200 columns. Wide Table patterns were impractical.

Columnar storage flipped the equation. In Snowflake, BigQuery, or ClickHouse, each column is stored independently. A query touching 3 out of 200 columns reads only those 3. The other 197 have zero I/O cost.

🐌 Row Store (PostgreSQL)

  • 📌 All columns stored together per row
  • 📌 SELECT city, SUM(revenue) reads ALL columns
  • 📌 NULLs consume space in each row
  • 📌 Wide tables = massive I/O waste

🚀 Columnar Store (Snowflake)

  • 📌 Each column stored independently on disk
  • 📌 SELECT city, SUM(revenue) reads ONLY 2 cols
  • 📌 RLE: NULL×1,000,000 = one instruction
  • 📌 Dictionary encoding: repeated strings → tiny ints

📦 Run-Length Encoding

Long consecutive runs of the same value (including NULL) stored as a single entry. A column that's NULL for 1M rows → "NULL × 1,000,000". Sparse columns are nearly free.

📖 Dictionary Encoding

Low-cardinality columns (e.g., 5 regions across 100M rows) replaced with integer lookups. Column of repeated strings becomes tiny integers + a 5-entry dictionary.

ScenarioNormalized ModelWide Table
Dashboard query (3 cols)4 joins across 5 tablesDirect scan, 3 cols only
Ad-hoc explorationMust know schema relationshipsOne table, autocomplete
Adding new attributeALTER TABLE or new table + joinALTER TABLE ADD COLUMN
Storage (sparse data)Compact per table, join overheadRLE/dictionary → NULLs ~free
Write performanceFast (narrow inserts)Slower (wide inserts)
Transactional updatesEfficient (one table)Expensive (wide row)
💡 Interview Insight

"Columnar engines store each column independently, so a query touching 3 of 300 columns reads only those 3. Combined with RLE (making NULLs essentially free) and dictionary encoding (replacing repeated strings with tiny integers), wide sparse tables are storage-efficient. Columnar storage decouples table width from query cost — you pay for what you read, not what exists."

🔄 EAV — The Flexibility Trap

EAV (Entity-Attribute-Value) stores data vertically instead of horizontally. Instead of one column per attribute, you have three columns: entity_id, attribute_name, and attribute_value.

📋 Relational (Horizontal)

  • 📌 id=1 | name=Shirt | color=Blue | weight=0.3
  • 📌 id=2 | name=Drill | color=Red | voltage=120
  • 📌 2 rows × typed columns
  • 📌 Simple: WHERE color='Blue' AND weight>1

🔄 EAV (Vertical)

  • 📌 (1, 'name', 'Shirt'), (1, 'color', 'Blue')
  • 📌 (1, 'weight', '0.3'), (1, 'fabric', 'Cotton')
  • 📌 8 rows × 3 string columns
  • 📌 Nightmare: self-joins + CAST
-- In a normal table: simple and fast SELECT product_id, name FROM products WHERE color = 'Blue' AND weight_kg > 1.0; -- In EAV: a nightmare of self-joins SELECT e.entity_id FROM product_attributes e JOIN product_attributes c ON e.entity_id = c.entity_id AND c.attribute_name = 'color' AND c.attribute_value = 'Blue' JOIN product_attributes w ON e.entity_id = w.entity_id AND w.attribute_name = 'weight' AND CAST(w.attribute_value AS DECIMAL) > 1.0 WHERE e.attribute_name = 'name';
ProblemWhy It Hurts
Type coercionEverything is VARCHAR — CAST at query time
No constraintsCan't enforce "weight must be positive"
Query complexitySelf-joins or pivoting — 10× more complex SQL
PerformanceSelf-joins on large EAV tables are expensive
BI incompatibilityTableau/Looker expect columnar data
IndexingCan't index mixed-type attribute_value

✅ When EAV Is Right

Attributes truly unknowable at design time (CMS form builders), thousands of attributes with tiny subsets per entity, or when you never query across attributes.

🔄 Modern Alternative

JSON/JSONB in PostgreSQL or VARIANT in Snowflake provide EAV flexibility with better query ergonomics, type support, and indexing.

📖 Analogy

EAV is like storing your wardrobe as a spreadsheet with three columns: "Item ID", "Property", "Value." Your shirt becomes four rows (color=blue, size=M, fabric=cotton, sleeve=long). Great for inventory — but try answering "what blue cotton shirts in size M?" and you're doing gymnastics instead of looking in your closet.

💡 Interview Insight

"EAV is maximally flexible — adding a new attribute is just inserting a row. But it's terrible for analytics: self-joins or pivoting to reconstruct rows, all values are strings requiring runtime casting, and you lose schema enforcement and indexability. I'd use EAV only when attributes are truly dynamic and user-defined. Even then, I'd first consider JSON columns as a better alternative."

🧠 Feature Stores & Point-in-Time Correctness

ML models don't just need data — they need data as it existed at a specific point in time. This is the fundamental difference between analytics modeling and ML modeling.

Point-in-Time Correctness prevents data leakage — when training data includes information that wouldn't have been available at prediction time.

Data Leakage Timeline
✅ Jan 1 – Feb 28: Use this data
❌ Mar 1 – Mar 20: Future data!
Mar 1 = Prediction Point → Features must use only data before this
-- Feature table: entity key + timestamp + features CREATE TABLE features.customer_daily ( customer_id STRING, feature_ts TIMESTAMP, -- When features were computed orders_last_30d INT, revenue_last_30d DECIMAL(12,2), days_since_last_order INT, support_tickets_30d INT, loyalty_tier STRING, PRIMARY KEY (customer_id, feature_ts) ); -- Point-in-time join: get latest features BEFORE prediction date SELECT pe.customer_id, pe.prediction_date, pe.label_churned, f.* FROM prediction_events pe LEFT JOIN features.customer_daily f ON pe.customer_id = f.customer_id AND f.feature_ts = ( SELECT MAX(f2.feature_ts) FROM features.customer_daily f2 WHERE f2.customer_id = pe.customer_id AND f2.feature_ts <= pe.prediction_date );

📚 Offline Store

  • 📌 Data warehouse / data lake
  • 📌 Latency: seconds to minutes
  • 📌 Full history (all timestamps)
  • 📌 Use case: model training with PIT joins
  • 📌 Batch updates (hourly/daily)

⚡ Online Store

  • 📌 Redis, DynamoDB, Bigtable
  • 📌 Latency: < 10 milliseconds
  • 📌 Latest values only
  • 📌 Use case: real-time model serving
  • 📌 Streaming / near-real-time updates
📖 Analogy

A Feature Store is like a library with two rooms. The archive room (offline store) has every edition of every book ever published — researchers study how knowledge evolved over time. The reference desk (online store) has only the latest edition — visitors grab the current answer fast. Same books, different access patterns.

💡 Interview Insight

"The critical principle is point-in-time correctness — features must reflect only data available before the prediction timestamp to prevent data leakage. Feature tables use a composite key of entity_id + feature_timestamp. For training, use ASOF joins. Architecture has offline store (warehouse, full history, batch training) and online store (Redis, latest values, low-latency serving). The biggest mistake is computing features without temporal boundaries — that introduces leakage."

🧭 Decision Matrix — When to Use What

There is no single "best" data model. Each pattern optimizes for different axes. Senior data engineers don't argue about which model is best; they ask "best for what?"

PatternQuery SimplicitySchema FlexGovernanceStorage Eff.Update CostML-Ready
3NF⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Star Schema⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
OBT⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Activity Schema⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Wide Table⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
EAV⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Data Vault⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
ScenarioRecommended PatternWhy
Executive dashboard (5 KPIs)OBTMax simplicity, zero SQL expertise
E-commerce catalog (10K+ attrs)EAV / JSON colsTruly dynamic attributes
Multi-source enterprise DWHData Vault → StarIntegration + audit + BI serving
Product analytics (clickstream)Activity SchemaAll events in one place
ML churn predictionFeature Store (PIT)Point-in-time correctness
Self-serve for 200 analystsStar + OBT martsGoverned metrics + exploration
IoT sensor data (100+ types)Wide TableSparse cols compress well
Regulatory reporting3NF / Data VaultMax integrity + audit trail
Startup MVP (3-person team)OBT / Star SchemaShip fast, refactor later

📐 The Layered Reality

Mature orgs use different patterns at different layers: Data Vault (raw integration) → Star Schema (semantic) → OBT (serving) → Feature Store (ML).

🎯 The Right Answer

Usually 2–3 patterns aligned to your team's capabilities. Worst mistake: one pattern everywhere. Second worst: six patterns when two would suffice.

📖 Analogy

Choosing a data modeling pattern is like choosing a vehicle. A 3NF sedan is reliable for daily commutes. A star schema SUV handles most terrain. An OBT bus carries everyone without them driving. A Data Vault freight train moves massive volumes. You don't argue which vehicle is "best" — you pick the right one for the journey.

💡 Interview Insight

"I start with the use case, not the pattern. OLTP → 3NF. Enterprise integration → Data Vault. Analyst BI → Star Schema. Self-serve → OBT. Events → Activity Schema. ML → Feature Store. In practice, mature platforms use multiple patterns in layers. The worst mistake is picking one and forcing everything into it."

✅ Quiz: Test Yourself

Q1: What is the primary reason OBT works well in columnar databases like Snowflake?

Q2: In the Activity Schema pattern, what is the main tradeoff for having all events in one table?

Q3: Why is run-length encoding (RLE) particularly beneficial for wide tables with many NULL columns?

Q4: What is the biggest problem with using EAV for analytical queries?

Q5: What is "data leakage" in ML feature engineering?

Q6: What is the difference between the offline store and online store in a feature store?

Q7: A 3-person data team needs to serve dashboards to 50 non-technical users. Best pattern?

Q8: In a mature data platform, what is the recommended approach to modeling patterns?