Data modeling is the process of structuring data so it can be queried reliably and efficiently. A good model makes metrics consistent, reduces complexity for analysts, and aligns storage with query patterns.
A bad model leads to inconsistent KPIs, confusing joins, expensive queries, and engineers spending weeks debugging why revenue is "off by $2M."
"Data modeling is designing how data is stored, related, and accessed — optimized for the questions the business needs to answer. It's the blueprint of your warehouse."
Interviewers love asking about model layers. Know these cold:
High-level business entities and relationships. No technical details. "Customers place Orders that contain Products." Stakeholders and architects use this.
Attributes, data types, primary/foreign keys — but platform-independent. Defines "what" without "how." Think ER diagrams with full column lists.
The actual implementation: table DDL, indexes, partitioning, compression. Platform-specific — Snowflake vs BigQuery vs Redshift choices live here.
This distinction drives every modeling decision. Get it wrong and your warehouse will feel like a production database — slow and unusable for analytics.
"OLTP protects data integrity during writes. OLAP accelerates reads for decision-making. A data warehouse typically ingests from OLTP sources and restructures data into OLAP-friendly models."
This trade-off is at the heart of every data modeling interview. Normalization reduces redundancy; denormalization speeds up queries.
No repeating groups. Each cell holds a single atomic value. No arrays in columns.
1NF + every non-key column depends on the entire primary key (no partial dependencies).
2NF + no transitive dependencies. Every non-key column depends only on the key.
Intentionally redundant. Customer city stored in the dimension, not in a separate city table. Fewer joins = faster queries.
"In OLTP, we normalize to protect write integrity. In OLAP, we denormalize because join cost > storage cost. Modern columnar warehouses compress denormalized data extremely well, making redundancy nearly free."
Interviewers increasingly ask: "Where does modeling fit in a lakehouse?" Know the landscape:
ETL → star schema in Redshift/Snowflake/BigQuery. Models are pre-computed and tightly governed. Kimball or Inmon methodology.
Raw files on S3/ADLS in Parquet/Delta. Schema-on-read. Risk of becoming a "data swamp" without modeling discipline.
Delta Lake / Iceberg on the lake with ACID transactions. Brings warehouse-like modeling to lake storage. Best of both worlds.
"Even in a lakehouse, dimensional modeling matters. The medallion architecture (Bronze → Silver → Gold) is just the pipeline. The Gold layer still needs proper star schemas, conformed dimensions, and defined grain."