🏗️ What Is Data Modeling?

Data modeling is the process of structuring data so it can be queried reliably and efficiently. A good model makes metrics consistent, reduces complexity for analysts, and aligns storage with query patterns.

A bad model leads to inconsistent KPIs, confusing joins, expensive queries, and engineers spending weeks debugging why revenue is "off by $2M."

💡 Interview Gold

"Data modeling is designing how data is stored, related, and accessed — optimized for the questions the business needs to answer. It's the blueprint of your warehouse."

Three Levels of Data Models

Interviewers love asking about model layers. Know these cold:

📐 Conceptual Model

High-level business entities and relationships. No technical details. "Customers place Orders that contain Products." Stakeholders and architects use this.

📋 Logical Model

Attributes, data types, primary/foreign keys — but platform-independent. Defines "what" without "how." Think ER diagrams with full column lists.

⚙️ Physical Model

The actual implementation: table DDL, indexes, partitioning, compression. Platform-specific — Snowflake vs BigQuery vs Redshift choices live here.

Conceptual
Business entities
Logical
Attributes & keys
Physical
DDL & indexes

OLTP vs OLAP

This distinction drives every modeling decision. Get it wrong and your warehouse will feel like a production database — slow and unusable for analytics.

🔄 OLTP (Transactional)

  • ✅ Optimized for writes (INSERT/UPDATE)
  • ✅ Highly normalized (3NF)
  • ✅ Row-oriented storage
  • ✅ Small, frequent transactions
  • ⚠️ Terrible for analytical queries
  • 📌 Example: Order entry system

📊 OLAP (Analytical)

  • ✅ Optimized for reads (SELECT/AGG)
  • Denormalized (star/snowflake)
  • ✅ Columnar storage
  • ✅ Complex aggregations over millions of rows
  • ⚠️ Not designed for transactional writes
  • 📌 Example: Revenue dashboards
🎯 Interview Tip

"OLTP protects data integrity during writes. OLAP accelerates reads for decision-making. A data warehouse typically ingests from OLTP sources and restructures data into OLAP-friendly models."

Normalization vs Denormalization

This trade-off is at the heart of every data modeling interview. Normalization reduces redundancy; denormalization speeds up queries.

1NF — First Normal Form

No repeating groups. Each cell holds a single atomic value. No arrays in columns.

2NF — Second Normal Form

1NF + every non-key column depends on the entire primary key (no partial dependencies).

3NF — Third Normal Form

2NF + no transitive dependencies. Every non-key column depends only on the key.

Denormalized (Star Schema)

Intentionally redundant. Customer city stored in the dimension, not in a separate city table. Fewer joins = faster queries.

💡 Interview Gold

"In OLTP, we normalize to protect write integrity. In OLAP, we denormalize because join cost > storage cost. Modern columnar warehouses compress denormalized data extremely well, making redundancy nearly free."

Data Modeling in the Modern Stack

Interviewers increasingly ask: "Where does modeling fit in a lakehouse?" Know the landscape:

🏢 Traditional Warehouse

ETL → star schema in Redshift/Snowflake/BigQuery. Models are pre-computed and tightly governed. Kimball or Inmon methodology.

🏕️ Data Lake

Raw files on S3/ADLS in Parquet/Delta. Schema-on-read. Risk of becoming a "data swamp" without modeling discipline.

🏠 Lakehouse

Delta Lake / Iceberg on the lake with ACID transactions. Brings warehouse-like modeling to lake storage. Best of both worlds.

Raw / Bronze
As-is from source
Cleaned / Silver
Deduplicated, typed
Modeled / Gold
Star schemas, metrics
🎯 Interview Tip

"Even in a lakehouse, dimensional modeling matters. The medallion architecture (Bronze → Silver → Gold) is just the pipeline. The Gold layer still needs proper star schemas, conformed dimensions, and defined grain."

Quiz: Test Yourself

Q1: An interviewer asks "What is data modeling?" Which answer is strongest?

Q2: Your team's dashboard scans 500M rows to compute monthly revenue. Which model type should back it?

Q3: Which model layer includes column names, data types, and keys but NOT indexes or partitioning?

Q4: Why is denormalization more acceptable in modern columnar warehouses than in traditional row-store OLTP systems?

Q5: In a medallion (Bronze/Silver/Gold) lakehouse, where does dimensional modeling primarily happen?