domingo, 16 de dezembro de 2007

Fables and Facts: Do you know the difference between dimensional modeling truth and fiction? (M. Ross, R. Kimball, 2004)


Outubro 2004

Fables and Facts

Do you know the difference between dimensional modeling truth and fiction?
By Margy Ross & Ralph Kimball

According to Merriam-Webster, fables are fictitious statements. Unfortunately, fables about dimensional modeling circulate throughout our industry. These false claims and assertions are a distraction, especially if you're trying to align a team. In this column, we'll describe the root misunderstandings that perpetuate these myths so you understand why they're as unfounded as fairy tales about two-headed creatures.

Not all Dimensional Models are Created Equal

We review a lot of dimensional models in our work. They often illustrate best practice design principles from our Toolkits and columns. However, not all supposed dimensional models are designed appropriately. Some blatantly violate core dimensional modeling tenets. Given the abysmal sample star schemas found in seemingly authoritative books and training presentations, this comes as no surprise. However, you shouldn't lump all dimensional models into a "bad" category based on misguided imposters.

Most of the fabled assertions are rooted in several basic mistakes regarding dimensional modeling best practices. Dimensional modeling can't be blamed if its fundamental concepts aren't embraced. Likewise, criticisms lobbed by individuals who don't understand its key premises need to be taken with a grain of salt. Once we clarify these misunderstandings, you'll be prepared to personally distinguish fables from facts.

Focus on Measurement Processes, not Departmental Reports

We advocate a four-step approach for designing dimensional models. The first step is to identify the business process, followed by declaring the grain, then selecting the dimensions and facts. Nowhere do we recommend specifying the business's top 10 report layouts or queries.

If requirements are gathered by focusing exclusively on report or query templates, you're susceptible to modeling data to produce a specific report, rather than capturing the key metrics and related dimensions for analysis. Obviously, it's important to consider business usage when designing dimensional models. The dimension attributes must support the BI environment's filtering and labeling requirements. Robust dimension attributes translate into nearly endless analytic slicing-and-dicing combinations. However, don't blindly focus on a top-10 list in isolation because priorities and "hot" reports will inevitably evolve.

Instead of concentrating on specific reports or departmental needs in a vacuum, we suggest focusing the dimensional design on the most critical performance measurement process. In doing so, you can put the following fables to rest.

FABLE: Data marts are built to address a specific business report or application. When the business needs a new report, another star schema is built.

FACT: Data marts with dimensional models should be built around physical measurement processes or events. A fact table row is created when a measurement occurs. The associated dimension attributes reflect contextual characteristics and hierarchies. If the business identifies a new report based on the same measurement process, there's no need to build a new mart, model, or schema. Measurement processes are relatively stable in most organizations; the analytics performed against these metrics are more fluid.

FABLE: Dimensional models are departmental solutions. When a different department needs access to the data, a new star schema is built and labeled with the department's vocabulary. Data marts require multiple extracts from the same source data repeatedly.

FACT: Dimensional models shouldn't be departmentally bound. A fact table representing a fundamental measurement process need only have one physical instance that's shared across business functions or departments. There's no reason to create multiple extracts from the same source. Metrics resulting from the invoicing process, for example, are made available in a single dimensional model for access across the enterprise; there's no reason to replicate invoice performance metrics in separate departmental solutions for finance, marketing, and sales. Even if these departmental solutions were sourced from the same repository, they likely use similar, but slightly different naming conventions, definitions, and business rules, defeating the promise of a single version of the truth. The departmental approach is highly vulnerable to inconsistent, nonintegrated point solutions. We've never advocated this approach.

FABLE: You can't incorporate new data sources without rebuilding the original star schema or creating separate fact tables or data marts.

FACT: If the new data source is another capture system for an existing measurement process in the BI environment, then the new data can be gracefully combined with the original data without altering any existing reporting applications, presuming the granularity is the same. If the new data source is at a different grain representing a new measurement process, then a new fact table must be created. This has nothing to do with dimensional modeling. Any data representation would create a new entity when a new table with different keys is introduced.

FABLE: With dimensional modeling, the fact table is forced to a single grain that is inflexible.

FACT: Having the discipline to create fact tables with a single level of detail assures that measurements aren't inappropriately double counted. A table with mixed-grain facts can only be queried by a custom application knowledgeable about the varying levels of detail, effectively ruling out ad hoc exploration. If measurements naturally exist at different grains, then the most foolproof design establishes a fact table for each level. Far from being inflexible, this approach protects existing applications from breaking or recoding as changes occur.

Begin with Atomic Details, Not Summarized Data

Some claim that data marts with dimensional models are intended for managerial, strategic analysis and, therefore, should be populated with summarized data, not operational details. We strongly disagree. Dimensional models should be populated with atomic data so business users can ask very precise questions. Even if users don't care about the details of a single transaction, their "question of the moment" involves summarizing the details in unpredictable ways. Database administrators may presummarize some information, either physically or via materialized views, to avoid on-the-fly summarization with every query. However, these aggregates are performance-tuning complements to the atomic level, not replacements. If you create dimensional models with atomic details, the following fables are nonissues.

FABLE: Star schemas and dimensional models presuppose the business question. When the requirements change, the model must be modified.

FACT: When you presummarize information, you've presupposed the business question. However, dimensional models with atomic data are independent of the business question as users can roll up or drill down ad infinitum. They answer new, previously unspecified questions without database changes. Obviously, business requirements are key input to any DW/BI initiative.

Page 2

FABLE: Star schemas and dimensional models are only appropriate when there's a predictable pattern of usage. Dimensional models aren't appropriate for exploratory queries.

FACT: Both normalized and dimensional models contain the same information and data relationships; both are capable of answering exactly the same questions, albeit with varying difficulty. Dimensional models naturally represent the "physics" of a measurement event; fact tables contain the measurements and dimension tables contain the context. A single dimensional model based on the most atomic data is capable of answering all possible questions against that data.

FABLE: Dimensional models aren't scalable. If detailed data is stored in a dimensional data mart, performance will be degraded. Data marts only contain recent information and are restricted from storing history.

FACT: Dimensional star schemas are extremely scalable. It isn't unusual for modern fact tables to have billions of rows corresponding to the billions of measurement transactions captured. Million-row dimension tables are common. Dimensional models should contain as much history as required to address the business requirements. There's nothing about dimensional modeling that prohibits the storage of substantial history.

FABLE: Dimensional models aren't extensible and are unable to address future needs of the data warehouse.

FACT: Dimensional models that express data at the lowest level of detail deliver maximum flexibility and extensibility. Users can summarize the atomic data any which way. Likewise, atomic data can be extended with additional attributes, measures, or dimensions without disrupting existing reports and queries.

FABLE: A dimensional model can't support complex data. It eliminates many-to-many relationships between entities, allowing only many-to-one relationships. A dimensional model can be created from an entity-relationship (ER) model; however, an ER model can't be created from a dimensional model.

FACT: The logical content of dimensional models and normalized models are identical. Every data relationship expressed in one model can be accurately expressed in the other model. Dimensional models are always based on fact tables, which are completely general many-to-many relationships. A dimensional model is a form of an ER model with unnecessary snowflaking (normalization of dimension attributes) suppressed.

Integration is the Goal, Not Normalization

Some people believe normalization solves the data integration challenge. Normalizing data contributes nothing to integration, except forcing data analysts to confront the inconsistencies across data sources.

Data integration is a process apart from any specific modeling approach. It requires identifying incompatible labels and measures used by the organization, then reaching consensus to establish and administer common labels and measures enterprise-wide. In dimensional modeling, these labels and measures reside in conformed dimensions and conformed facts, respectively. As represented in the bus architecture, conformed dimensions are the integration "glue" across measurement business processes. Conformed dimensions are typically built and maintained as centralized persistent master data during ETL, then reused across dimensional models to enable data integration and ensure semantic consistency.

FABLE: Dimensional modeling concepts like conformed dimensions put an undue burden on the ETL effort.

FACT: Data integration depends on standardized labels, values, and definitions. It's hard work to reach organizational consensus and implement the corresponding ETL system rules, but you can't dodge the effort, regardless of whether you're dealing with a normalized or dimensional model.

FABLE: Dimensional modeling isn't appropriate when there are more than two unique source systems due to the complexities of integrating data from multiple sources.

FACT: The challenges of data integration have nothing to do with the modeling approach. Paradoxically, dimensional modeling and the bus architecture reveal the labels and measures of a business so clearly that an organization has no choice but address the integration problems directly.

FABLE: Changes to dimension attributes are only an issue for dimensional models.

FACT: Every data warehouse must deal with time variance. When the characteristic of an entity like customer or product changes, we need a systematic approach for recording the change. Dimensional modeling uses a standard technique known as slowly changing dimensions (SCDs). When normalized models step up to the issue of time variance, they typically add timestamps to the entities. These timestamps serve to capture every entity change (just like a type 2 SCD does), but without using a surrogate key for each new row, the query interface must issue a double-barreled join that constrains both the natural key and timestamp between every pair of joined tables, putting an unnecessary, unfriendly burden on every reporting application or query.

FABLE: Multiple data marts can't be integrated. They're built bottoms up, catering to the needs of an individual, not the needs of an enterprise. Data mart chaos is the inevitable outcome.

FACT: It's definitely a struggle to integrate data marts that have been built as departmental, standalone solutions that haven't been architected with conformed dimensions. That's precisely why we advise against this approach! Chaos won't result if you use the bus architecture for the enterprise framework of conformed dimensions, then tackle incremental development based on business measurement processes. Organizational and cultural obstacles are inevitable as consistent definitions, business rules, and practices are established across the enterprise. The technology is the easy part.

Learn more

This column clarifies the realities surrounding common dimensional modeling myths. As some of you know, the Kimball Design Tips newsletter discusses a new fable each month. If you'd like to receive a monthly tip and technique from the Kimball Group, please visit to register. The newsletter's sole purpose is to deliver practical insights and news about Kimball University classes to DW/BI professionals. The names and email addresses on the list will not be shared with anyone.

Ralph Kimball founder of the Kimball Group, teaches dimensional data warehouse design through Kimball University and critically reviews large data warehouse projects. He has four best-selling data warehousing books in print, including the newly released The Data Warehouse ETL Toolkit (Wiley, 2004).

Margy Ross is president of the Kimball Group and instructor with Kimball University. She cowrote The Data Warehouse Lifecycle Toolkit (Wiley, 1998) and The Data Warehouse Toolkit, 2nd Edition.

© 2006 CMP Media LLC

5 comentários:

Anônimo disse...

With the ability to actually alter the texture and condition of
the hair with clever and fashion-forward hair styling products, Rusk Being products are sure to intrigue the hair product lover in
all of us. As far as medical treatment is concerned, it's often suggested to consult your doctor before any further proceeding. The product should be one that is made for your type of hair, whether it be dry, oily, curly or straight.

Feel free to surf to my webpage ... hair products

Anônimo disse...

For instance, they may select a type of lettuce that has purple coloring
in it as well as green. So, I propose to you that you might want to consider
doing your own organic gardening right in your own back yard.
In some cases, the quantity of materials for making compost in the amounts desired
(an annual layer 2 inches deep across the garden) cannot be

Feel free to surf to my page; conjunctive

Anônimo disse...

Most of these senior citizens are not comfortable with the
use of internet or are incapable of visiting the State Comptrollers office and thus for them newspaper
listings does the job. The advert designs can range from
being simple text affairs that are very basic, which also tend to be less expensive,
to the more expensive, fully graphic designed adverts with photo quality images that can take up a whole page of the newspaper.
Information regarding different websites that contain large
chunk of databases of unclaimed properties
and money are also provided in different newspapers. That’s
why, while drafting ads for newspapers selling or
buying a property, make your USP as your headline. The contempt he held for America had little to do with politics and democracy.

Check out my page - click here

Anônimo disse...

Pilots complete flights for their airline, using their simulator, either online (using a network such as Sim - Miles) or offline, and then file a pilot report.
If you make mistakes and you use a simulator, then
it will cost you nothing. The End of Military Flight Sim
Pro With this program you can take advantage of the military aspects of the game, landing and takeoff of
an airplane taking includes carrier. We would like to invite you to join
our virtual airline in Ireland if you have some time that you can spend with is
to fly online. 3) Physical parameters: The simulators are almost real, but
actually they are on the ground, I mean around the sea level.

Anônimo disse...

And while investigators say the murder victim is linked
to a "feud," they are not sharing the details about that
part of the investigation. First, ready mixed concrete in Sunderland can help you get your construction
job done faster. Repairing larger cracks or holes in driveways
takes a little more time and preparation.