Fables and Facts
Do you know the difference between dimensional modeling truth and fiction?By Margy Ross & Ralph Kimball
According to Merriam-Webster, fables are fictitious statements. Unfortunately, fables about dimensional modeling circulate throughout our industry. These false claims and assertions are a distraction, especially if you're trying to align a team. In this column, we'll describe the root misunderstandings that perpetuate these myths so you understand why they're as unfounded as fairy tales about two-headed creatures.
Not all Dimensional Models are Created Equal
We review a lot of dimensional models in our work. They often illustrate best practice design principles from our Toolkits and columns. However, not all supposed dimensional models are designed appropriately. Some blatantly violate core dimensional modeling tenets. Given the abysmal sample star schemas found in seemingly authoritative books and training presentations, this comes as no surprise. However, you shouldn't lump all dimensional models into a "bad" category based on misguided imposters.
Most of the fabled assertions are rooted in several basic mistakes regarding dimensional modeling best practices. Dimensional modeling can't be blamed if its fundamental concepts aren't embraced. Likewise, criticisms lobbed by individuals who don't understand its key premises need to be taken with a grain of salt. Once we clarify these misunderstandings, you'll be prepared to personally distinguish fables from facts.
Focus on Measurement Processes, not Departmental Reports
We advocate a four-step approach for designing dimensional models. The first step is to identify the business process, followed by declaring the grain, then selecting the dimensions and facts. Nowhere do we recommend specifying the business's top 10 report layouts or queries.
If requirements are gathered by focusing exclusively on report or query templates, you're susceptible to modeling data to produce a specific report, rather than capturing the key metrics and related dimensions for analysis. Obviously, it's important to consider business usage when designing dimensional models. The dimension attributes must support the BI environment's filtering and labeling requirements. Robust dimension attributes translate into nearly endless analytic slicing-and-dicing combinations. However, don't blindly focus on a top-10 list in isolation because priorities and "hot" reports will inevitably evolve.
Instead of concentrating on specific reports or departmental needs in a vacuum, we suggest focusing the dimensional design on the most critical performance measurement process. In doing so, you can put the following fables to rest.
FABLE: Data marts are built to address a specific business report or application. When the business needs a new report, another star schema is built.
FACT: Data marts with dimensional models should be built around physical measurement processes or events. A fact table row is created when a measurement occurs. The associated dimension attributes reflect contextual characteristics and hierarchies. If the business identifies a new report based on the same measurement process, there's no need to build a new mart, model, or schema. Measurement processes are relatively stable in most organizations; the analytics performed against these metrics are more fluid.
FABLE: Dimensional models are departmental solutions. When a different department needs access to the data, a new star schema is built and labeled with the department's vocabulary. Data marts require multiple extracts from the same source data repeatedly.
FACT: Dimensional models shouldn't be departmentally bound. A fact table representing a fundamental measurement process need only have one physical instance that's shared across business functions or departments. There's no reason to create multiple extracts from the same source. Metrics resulting from the invoicing process, for example, are made available in a single dimensional model for access across the enterprise; there's no reason to replicate invoice performance metrics in separate departmental solutions for finance, marketing, and sales. Even if these departmental solutions were sourced from the same repository, they likely use similar, but slightly different naming conventions, definitions, and business rules, defeating the promise of a single version of the truth. The departmental approach is highly vulnerable to inconsistent, nonintegrated point solutions. We've never advocated this approach.
FABLE: You can't incorporate new data sources without rebuilding the original star schema or creating separate fact tables or data marts.
FACT: If the new data source is another capture system for an existing measurement process in the BI environment, then the new data can be gracefully combined with the original data without altering any existing reporting applications, presuming the granularity is the same. If the new data source is at a different grain representing a new measurement process, then a new fact table must be created. This has nothing to do with dimensional modeling. Any data representation would create a new entity when a new table with different keys is introduced.
FABLE: With dimensional modeling, the fact table is forced to a single grain that is inflexible.
FACT: Having the discipline to create fact tables with a single level of detail assures that measurements aren't inappropriately double counted. A table with mixed-grain facts can only be queried by a custom application knowledgeable about the varying levels of detail, effectively ruling out ad hoc exploration. If measurements naturally exist at different grains, then the most foolproof design establishes a fact table for each level. Far from being inflexible, this approach protects existing applications from breaking or recoding as changes occur.
Begin with Atomic Details, Not Summarized Data
Some claim that data marts with dimensional models are intended for managerial, strategic analysis and, therefore, should be populated with summarized data, not operational details. We strongly disagree. Dimensional models should be populated with atomic data so business users can ask very precise questions. Even if users don't care about the details of a single transaction, their "question of the moment" involves summarizing the details in unpredictable ways. Database administrators may presummarize some information, either physically or via materialized views, to avoid on-the-fly summarization with every query. However, these aggregates are performance-tuning complements to the atomic level, not replacements. If you create dimensional models with atomic details, the following fables are nonissues.
FABLE: Star schemas and dimensional models presuppose the business question. When the requirements change, the model must be modified.
FACT: When you presummarize information, you've presupposed the business question. However, dimensional models with atomic data are independent of the business question as users can roll up or drill down ad infinitum. They answer new, previously unspecified questions without database changes. Obviously, business requirements are key input to any DW/BI initiative.
FABLE: Star schemas and dimensional models are only appropriate when there's a predictable pattern of usage. Dimensional models aren't appropriate for exploratory queries.
FACT: Both normalized and dimensional models contain the same information and data relationships; both are capable of answering exactly the same questions, albeit with varying difficulty. Dimensional models naturally represent the "physics" of a measurement event; fact tables contain the measurements and dimension tables contain the context. A single dimensional model based on the most atomic data is capable of answering all possible questions against that data.
FABLE: Dimensional models aren't scalable. If detailed data is stored in a dimensional data mart, performance will be degraded. Data marts only contain recent information and are restricted from storing history.
FACT: Dimensional star schemas are extremely scalable. It isn't unusual for modern fact tables to have billions of rows corresponding to the billions of measurement transactions captured. Million-row dimension tables are common. Dimensional models should contain as much history as required to address the business requirements. There's nothing about dimensional modeling that prohibits the storage of substantial history.
FABLE: Dimensional models aren't extensible and are unable to address future needs of the data warehouse.
FACT: Dimensional models that express data at the lowest level of detail deliver maximum flexibility and extensibility. Users can summarize the atomic data any which way. Likewise, atomic data can be extended with additional attributes, measures, or dimensions without disrupting existing reports and queries.
FABLE: A dimensional model can't support complex data. It eliminates many-to-many relationships between entities, allowing only many-to-one relationships. A dimensional model can be created from an entity-relationship (ER) model; however, an ER model can't be created from a dimensional model.
FACT: The logical content of dimensional models and normalized models are identical. Every data relationship expressed in one model can be accurately expressed in the other model. Dimensional models are always based on fact tables, which are completely general many-to-many relationships. A dimensional model is a form of an ER model with unnecessary snowflaking (normalization of dimension attributes) suppressed.
Integration is the Goal, Not Normalization
Some people believe normalization solves the data integration challenge. Normalizing data contributes nothing to integration, except forcing data analysts to confront the inconsistencies across data sources.
Data integration is a process apart from any specific modeling approach. It requires identifying incompatible labels and measures used by the organization, then reaching consensus to establish and administer common labels and measures enterprise-wide. In dimensional modeling, these labels and measures reside in conformed dimensions and conformed facts, respectively. As represented in the bus architecture, conformed dimensions are the integration "glue" across measurement business processes. Conformed dimensions are typically built and maintained as centralized persistent master data during ETL, then reused across dimensional models to enable data integration and ensure semantic consistency.
FABLE: Dimensional modeling concepts like conformed dimensions put an undue burden on the ETL effort.
FACT: Data integration depends on standardized labels, values, and definitions. It's hard work to reach organizational consensus and implement the corresponding ETL system rules, but you can't dodge the effort, regardless of whether you're dealing with a normalized or dimensional model.
FABLE: Dimensional modeling isn't appropriate when there are more than two unique source systems due to the complexities of integrating data from multiple sources.
FACT: The challenges of data integration have nothing to do with the modeling approach. Paradoxically, dimensional modeling and the bus architecture reveal the labels and measures of a business so clearly that an organization has no choice but address the integration problems directly.
FABLE: Changes to dimension attributes are only an issue for dimensional models.
FACT: Every data warehouse must deal with time variance. When the characteristic of an entity like customer or product changes, we need a systematic approach for recording the change. Dimensional modeling uses a standard technique known as slowly changing dimensions (SCDs). When normalized models step up to the issue of time variance, they typically add timestamps to the entities. These timestamps serve to capture every entity change (just like a type 2 SCD does), but without using a surrogate key for each new row, the query interface must issue a double-barreled join that constrains both the natural key and timestamp between every pair of joined tables, putting an unnecessary, unfriendly burden on every reporting application or query.
FABLE: Multiple data marts can't be integrated. They're built bottoms up, catering to the needs of an individual, not the needs of an enterprise. Data mart chaos is the inevitable outcome.
FACT: It's definitely a struggle to integrate data marts that have been built as departmental, standalone solutions that haven't been architected with conformed dimensions. That's precisely why we advise against this approach! Chaos won't result if you use the bus architecture for the enterprise framework of conformed dimensions, then tackle incremental development based on business measurement processes. Organizational and cultural obstacles are inevitable as consistent definitions, business rules, and practices are established across the enterprise. The technology is the easy part.
This column clarifies the realities surrounding common dimensional modeling myths. As some of you know, the Kimball Design Tips newsletter discusses a new fable each month. If you'd like to receive a monthly tip and technique from the Kimball Group, please visit www.kimballgroup.com to register. The newsletter's sole purpose is to deliver practical insights and news about Kimball University classes to DW/BI professionals. The names and email addresses on the list will not be shared with anyone.
Ralph Kimball founder of the Kimball Group, teaches dimensional data warehouse design through Kimball University and critically reviews large data warehouse projects. He has four best-selling data warehousing books in print, including the newly released The Data Warehouse ETL Toolkit (Wiley, 2004).
Margy Ross is president of the Kimball Group and instructor with Kimball University. She cowrote The Data Warehouse Lifecycle Toolkit (Wiley, 1998) and The Data Warehouse Toolkit, 2nd Edition.
© 2006 CMP Media LLC