segunda-feira, 31 de dezembro de 2007
quarta-feira, 26 de dezembro de 2007
segunda-feira, 24 de dezembro de 2007
COMPUTERWORLD - IBM compra empresa de base de dados Solid
"A IBM anunciou nesta sexta-feira (21/12) que está adquirindo a Solid Information Technology, uma fabricante de bases de dados de alta performance e parceiro próximo da rival da IBM MySQL.
A Solid faz uma base de dados embutida com um sistema de base de dados in-memory, que significa que pode armazenar e recuperar dados da memória principal, garantindo desempenho mais rápido do que os tradicionais sistemas baseados em disco.
Isso torna popular para aplicações que requerem tempos de processo muito rápidos, como a rota de ligações em uma rede de telefone ou cotação de ações. Os clientes da Solid incluem a Cisco Systems, a Siemens, a TeliaSonera e a Nokia, de acordo com seu site.
A IBM concordou em adquirir a empresa por quantia não revelada e espera fechar o negócio no primeiro trimestre de 2008, segundo a própria IBM. A empresa diz ainda que o negócio vai incrementar a linha de base de dados por meio da adição de capacidades de acesso de dados em tempo real. A Oracle, rival da IBM, também adquiriu uma empresa de base de dados in-memory há dois anos, da TimesTen.
A aquisição da IBM pode ser vista como uma derrota para o MySQL, desde que isso marque a perda da independência de outra companhia que faz transação de alta performance para a base de dados MySQL."
Folha Online - Informática - FBI prepara maior banco de dados biométrico do planeta - 22/12/2007
"FBI prepara maior banco de dados biométrico do planeta
Publicidade
da Folha Online
da France Presse, em Washington
A polícia federal norte-americana desenvolve um banco de dados biométricos sem precedentes, que já conta com dezenas de milhões de fotos e impressões digitais. Foi o que anunciou neste sábado o porta-voz do FBI Richard Kolko.
Quando concluído, o projeto deverá dar ao governo norte-americano uma capacidade única para identificar suspeitos de crimes, tráfico ou terrorismo em todo o mundo, disse Kolko.
'Procuramos constantemente atualizar e melhorar o processo de recolhimento das informações', disse Kolko. Ele alegou que o sistema também vai ser feito pensando na 'proteção da vida privada'.
O FBI se prepara para assinar um contrato de 10 anos para ampliar seu banco de dados estocado num imenso complexo subterrâneo na Virginia Ocidental (leste dos EUA), onde já estão conservadas 55 milhões de séries de impressões digitais."
domingo, 16 de dezembro de 2007
Fables and Facts: Do you know the difference between dimensional modeling truth and fiction? (M. Ross, R. Kimball, 2004)
Outubro 2004
http://www.intelligententerprise.com/showArticle.jhtml?articleID=49400912
Fables and Facts
Do you know the difference between dimensional modeling truth and fiction?
By Margy Ross & Ralph KimballAccording to Merriam-Webster, fables are fictitious statements. Unfortunately, fables about dimensional modeling circulate throughout our industry. These false claims and assertions are a distraction, especially if you're trying to align a team. In this column, we'll describe the root misunderstandings that perpetuate these myths so you understand why they're as unfounded as fairy tales about two-headed creatures.
Not all Dimensional Models are Created Equal
We review a lot of dimensional models in our work. They often illustrate best practice design principles from our Toolkits and columns. However, not all supposed dimensional models are designed appropriately. Some blatantly violate core dimensional modeling tenets. Given the abysmal sample star schemas found in seemingly authoritative books and training presentations, this comes as no surprise. However, you shouldn't lump all dimensional models into a "bad" category based on misguided imposters.
Most of the fabled assertions are rooted in several basic mistakes regarding dimensional modeling best practices. Dimensional modeling can't be blamed if its fundamental concepts aren't embraced. Likewise, criticisms lobbed by individuals who don't understand its key premises need to be taken with a grain of salt. Once we clarify these misunderstandings, you'll be prepared to personally distinguish fables from facts.
Focus on Measurement Processes, not Departmental Reports
We advocate a four-step approach for designing dimensional models. The first step is to identify the business process, followed by declaring the grain, then selecting the dimensions and facts. Nowhere do we recommend specifying the business's top 10 report layouts or queries.
If requirements are gathered by focusing exclusively on report or query templates, you're susceptible to modeling data to produce a specific report, rather than capturing the key metrics and related dimensions for analysis. Obviously, it's important to consider business usage when designing dimensional models. The dimension attributes must support the BI environment's filtering and labeling requirements. Robust dimension attributes translate into nearly endless analytic slicing-and-dicing combinations. However, don't blindly focus on a top-10 list in isolation because priorities and "hot" reports will inevitably evolve.
Instead of concentrating on specific reports or departmental needs in a vacuum, we suggest focusing the dimensional design on the most critical performance measurement process. In doing so, you can put the following fables to rest.
FABLE: Data marts are built to address a specific business report or application. When the business needs a new report, another star schema is built.
FACT: Data marts with dimensional models should be built around physical measurement processes or events. A fact table row is created when a measurement occurs. The associated dimension attributes reflect contextual characteristics and hierarchies. If the business identifies a new report based on the same measurement process, there's no need to build a new mart, model, or schema. Measurement processes are relatively stable in most organizations; the analytics performed against these metrics are more fluid.
FABLE: Dimensional models are departmental solutions. When a different department needs access to the data, a new star schema is built and labeled with the department's vocabulary. Data marts require multiple extracts from the same source data repeatedly.
FACT: Dimensional models shouldn't be departmentally bound. A fact table representing a fundamental measurement process need only have one physical instance that's shared across business functions or departments. There's no reason to create multiple extracts from the same source. Metrics resulting from the invoicing process, for example, are made available in a single dimensional model for access across the enterprise; there's no reason to replicate invoice performance metrics in separate departmental solutions for finance, marketing, and sales. Even if these departmental solutions were sourced from the same repository, they likely use similar, but slightly different naming conventions, definitions, and business rules, defeating the promise of a single version of the truth. The departmental approach is highly vulnerable to inconsistent, nonintegrated point solutions. We've never advocated this approach.
FABLE: You can't incorporate new data sources without rebuilding the original star schema or creating separate fact tables or data marts.
FACT: If the new data source is another capture system for an existing measurement process in the BI environment, then the new data can be gracefully combined with the original data without altering any existing reporting applications, presuming the granularity is the same. If the new data source is at a different grain representing a new measurement process, then a new fact table must be created. This has nothing to do with dimensional modeling. Any data representation would create a new entity when a new table with different keys is introduced.
FABLE: With dimensional modeling, the fact table is forced to a single grain that is inflexible.
FACT: Having the discipline to create fact tables with a single level of detail assures that measurements aren't inappropriately double counted. A table with mixed-grain facts can only be queried by a custom application knowledgeable about the varying levels of detail, effectively ruling out ad hoc exploration. If measurements naturally exist at different grains, then the most foolproof design establishes a fact table for each level. Far from being inflexible, this approach protects existing applications from breaking or recoding as changes occur.
Begin with Atomic Details, Not Summarized Data
Some claim that data marts with dimensional models are intended for managerial, strategic analysis and, therefore, should be populated with summarized data, not operational details. We strongly disagree. Dimensional models should be populated with atomic data so business users can ask very precise questions. Even if users don't care about the details of a single transaction, their "question of the moment" involves summarizing the details in unpredictable ways. Database administrators may presummarize some information, either physically or via materialized views, to avoid on-the-fly summarization with every query. However, these aggregates are performance-tuning complements to the atomic level, not replacements. If you create dimensional models with atomic details, the following fables are nonissues.
FABLE: Star schemas and dimensional models presuppose the business question. When the requirements change, the model must be modified.
FACT: When you presummarize information, you've presupposed the business question. However, dimensional models with atomic data are independent of the business question as users can roll up or drill down ad infinitum. They answer new, previously unspecified questions without database changes. Obviously, business requirements are key input to any DW/BI initiative.
Page 2
FABLE: Star schemas and dimensional models are only appropriate when there's a predictable pattern of usage. Dimensional models aren't appropriate for exploratory queries.
FACT: Both normalized and dimensional models contain the same information and data relationships; both are capable of answering exactly the same questions, albeit with varying difficulty. Dimensional models naturally represent the "physics" of a measurement event; fact tables contain the measurements and dimension tables contain the context. A single dimensional model based on the most atomic data is capable of answering all possible questions against that data.
FABLE: Dimensional models aren't scalable. If detailed data is stored in a dimensional data mart, performance will be degraded. Data marts only contain recent information and are restricted from storing history.
FACT: Dimensional star schemas are extremely scalable. It isn't unusual for modern fact tables to have billions of rows corresponding to the billions of measurement transactions captured. Million-row dimension tables are common. Dimensional models should contain as much history as required to address the business requirements. There's nothing about dimensional modeling that prohibits the storage of substantial history.
FABLE: Dimensional models aren't extensible and are unable to address future needs of the data warehouse.
FACT: Dimensional models that express data at the lowest level of detail deliver maximum flexibility and extensibility. Users can summarize the atomic data any which way. Likewise, atomic data can be extended with additional attributes, measures, or dimensions without disrupting existing reports and queries.
FABLE: A dimensional model can't support complex data. It eliminates many-to-many relationships between entities, allowing only many-to-one relationships. A dimensional model can be created from an entity-relationship (ER) model; however, an ER model can't be created from a dimensional model.
FACT: The logical content of dimensional models and normalized models are identical. Every data relationship expressed in one model can be accurately expressed in the other model. Dimensional models are always based on fact tables, which are completely general many-to-many relationships. A dimensional model is a form of an ER model with unnecessary snowflaking (normalization of dimension attributes) suppressed.
Integration is the Goal, Not Normalization
Some people believe normalization solves the data integration challenge. Normalizing data contributes nothing to integration, except forcing data analysts to confront the inconsistencies across data sources.
Data integration is a process apart from any specific modeling approach. It requires identifying incompatible labels and measures used by the organization, then reaching consensus to establish and administer common labels and measures enterprise-wide. In dimensional modeling, these labels and measures reside in conformed dimensions and conformed facts, respectively. As represented in the bus architecture, conformed dimensions are the integration "glue" across measurement business processes. Conformed dimensions are typically built and maintained as centralized persistent master data during ETL, then reused across dimensional models to enable data integration and ensure semantic consistency.
FABLE: Dimensional modeling concepts like conformed dimensions put an undue burden on the ETL effort.
FACT: Data integration depends on standardized labels, values, and definitions. It's hard work to reach organizational consensus and implement the corresponding ETL system rules, but you can't dodge the effort, regardless of whether you're dealing with a normalized or dimensional model.
FABLE: Dimensional modeling isn't appropriate when there are more than two unique source systems due to the complexities of integrating data from multiple sources.
FACT: The challenges of data integration have nothing to do with the modeling approach. Paradoxically, dimensional modeling and the bus architecture reveal the labels and measures of a business so clearly that an organization has no choice but address the integration problems directly.
FABLE: Changes to dimension attributes are only an issue for dimensional models.
FACT: Every data warehouse must deal with time variance. When the characteristic of an entity like customer or product changes, we need a systematic approach for recording the change. Dimensional modeling uses a standard technique known as slowly changing dimensions (SCDs). When normalized models step up to the issue of time variance, they typically add timestamps to the entities. These timestamps serve to capture every entity change (just like a type 2 SCD does), but without using a surrogate key for each new row, the query interface must issue a double-barreled join that constrains both the natural key and timestamp between every pair of joined tables, putting an unnecessary, unfriendly burden on every reporting application or query.
FABLE: Multiple data marts can't be integrated. They're built bottoms up, catering to the needs of an individual, not the needs of an enterprise. Data mart chaos is the inevitable outcome.
FACT: It's definitely a struggle to integrate data marts that have been built as departmental, standalone solutions that haven't been architected with conformed dimensions. That's precisely why we advise against this approach! Chaos won't result if you use the bus architecture for the enterprise framework of conformed dimensions, then tackle incremental development based on business measurement processes. Organizational and cultural obstacles are inevitable as consistent definitions, business rules, and practices are established across the enterprise. The technology is the easy part.
Learn more
This column clarifies the realities surrounding common dimensional modeling myths. As some of you know, the Kimball Design Tips newsletter discusses a new fable each month. If you'd like to receive a monthly tip and technique from the Kimball Group, please visit www.kimballgroup.com to register. The newsletter's sole purpose is to deliver practical insights and news about Kimball University classes to DW/BI professionals. The names and email addresses on the list will not be shared with anyone.
Ralph Kimball founder of the Kimball Group, teaches dimensional data warehouse design through Kimball University and critically reviews large data warehouse projects. He has four best-selling data warehousing books in print, including the newly released The Data Warehouse ETL Toolkit (Wiley, 2004).
Margy Ross is president of the Kimball Group and instructor with Kimball University. She cowrote The Data Warehouse Lifecycle Toolkit (Wiley, 1998) and The Data Warehouse Toolkit, 2nd Edition.
© 2006 CMP Media LLC
Bringing up Supermarts (R. Kimball, 1998)
DBMS, January 1998
One of the most widely debated issues in data warehousing is how to go about planning the warehouse construction. Do we build the whole data warehouse all at once from a central, planned perspective (the monolithic approach) or do we opportunistically build separate subject areas whenever we feel like it (the stovepipe approach)? In this article I set out to demolish two powerful myths. First, nobody believes in a totally monolithic approach, and yet nobody defends a totally stovepipe approach either. All the leading data warehouse practitioners use some kind of architectural step-by-step approach to build an enterprise data warehouse. This article describes a specific variation of that step-by-step approach called "supermarts." A supermart is a data mart that has been carefully built with a disciplined architectural framework.
The second myth I want to demolish is that we have passed out of the phase in data warehouse development where a data mart must be restricted to a highly aggregated subset of a nonqueryable data warehouse. This view of data marts is the source of many problems and misunderstandings. You will see that a supermart is naturally a complete subset of the overall data warehouse and that every useful supermart must be based on the most granular (atomic) data that can possibly be collected and stored.
The Planning Crisis
The task of planning an enterprise data warehouse is daunting. The newly appointed manager of the data warehousing effort in a large enterprise is faced with two huge and seemingly unrelated challenges. On the one hand, the manager is supposed to understand the content and location of the most complicated asset owned by the enterprise: the legacy data. Somehow (usually overnight) the new data warehouse manager is supposed to become an authority on exactly what is contained in all those VSAM, ISAM, IMS, DB2, and Oracle tables. Every field in every table must be understood. The data warehouse manager must be able to retrieve any such element of data and, if necessary, clean it up and correct it. If all this werenýt enough, the data warehouse manager is also supposed to understand exactly what keeps management awake at night. The data warehouse is expected to contain the exact data needed to answer everyoneýs burning questions. Of course, the data warehouse manager is "free" to drop in on senior management at any time to discuss current corporate priorities. Just make sure you get this data warehouse done pretty soon.
The pressure of this daunting task has built up to the point where it has a name: data mart. Regardless of specific definitions, the phrase "data mart" means avoiding the impossibility of tackling the enterprise data warehouse-planning job all at once. Data warehouse planners take refuge in carving off a little piece of the whole data warehouse, bringing it to completion, and calling it a data mart.
Unfortunately, in many cases building separate data marts rather than a single data warehouse has become an excuse for ignoring any kind of design framework that might tie the data marts together. Vendorsý marketing claims for a "data mart in a box" and a "15-minute data mart" are pandering to the marketýs need for a simple solution, but these claims are a real disservice to the data warehousing manager who must make these data marts fit together into a coherent whole.
Isolated stovepipe data marts that cannot be tied together usefully are the bane of the data warehouse movement. They are much worse than a simple lost opportunity for analysis. Stovepipe data marts perpetuate incompatible views of the enterprise. Stovepipe data marts enshrine the reports that cannot be compared with each other. And stovepipe data marts become legacy implementations in their own right, where, by their very existence, they block the development of an integrated enterprise data warehouse. Some people have begun calling these legacy data marts "lega marts." So if building the data warehouse all at once is too daunting, and building it as isolated pieces defeats the overall goal, what is to be done?
Data Marts with an Architecture
The answer to this dilemma is to start the data warehouse planning process with a short overall architecture phase that has finite and specific goals. Next, follow this architecture phase with a step-by-step implementation of separate data marts where each implementation step closely adheres to the architecture. In this way the data warehouse manager gets the best of both worlds. The architecture phase produces specific guidelines that the separate data mart development teams can follow, and the data mart development teams can work fairly independently and asynchronously. As the separate data marts come online, they will fit together like the pieces of a puzzle. We call these coordinated data marts supermarts. At some point, enough supermarts exist to make good on the promise of an integrated enterprise data warehouse.
To succeed in building an enterprise data warehouse you must inevitably perform the following two steps: First, create a surrounding architecture that defines the scope and implementation of the complete data warehouse, and second, oversee the construction of each piece of the complete data warehouse. Now stop and consider the second step. The biggest task in constructing a data warehouse is designing the extract system. This is the mechanism that gets the data from a specific legacy system and moves it into the data staging area, where it can be transformed into the various load record images needed by the final database that presents the data for querying. Since the implementation of the extract logic is largely specific to each original data source, it really doesnýt matter whether you think of the task as a whole or break it into pieces. Either way, you have to put your pants on one leg at a time. You will in effect be implementing your data marts one at a time no matter how you plan your project.
Conformed Dimensions and Standard Fact Definitions
In the architecture phase that precedes the implementation of any data mart, the goals are to produce a master suite of conformed dimensions and standardize the definitions of facts. I am assuming you have a proper dimensional design for all the data marts. Any given data mart is assumed to consist of a set of fact tables, each with a multipart key made up of dimension key components (foreign keys joining to the dimension tables). The fact tables all contain zero or more facts that represent measurements taken at each combination of the dimension key components. Every fact table is surrounded by a "halo" of dimension tables, where the dimension keys are primary keys in each respective dimension table. This design is also called a star join. If you are not familiar with dimensional modeling, there are many references on this subject. See, for example, my article "A Dimensional Modeling Manifesto," DBMS, August 1997.
A conformed dimension is a dimension that means the same thing with every possible fact table to which it can be joined. Generally this means that a conformed dimension is identical in each data mart. Examples of obvious conformed dimensions include Customer, Product, Location, Deal (Promotion), and Calendar (Time). (See Figure 1.) A major responsibility of the central data warehouse design team is to establish, publish, maintain, and enforce the conformed dimensions.
Establishing a conformed dimension is a very significant step. A conformed Customer dimension is a master table of customers with a clean customer key and many well-maintained attributes describing each customer. It is likely that the conformed Customer dimension is an amalgamation and distillation of data from several legacy systems and possibly outside sources. The address fields in the Customer dimension, for instance, should constitute the best mailable address that is known for each customer anywhere within the enterprise. It is often the responsibility of the central data warehouse team to create the conformed Customer dimension and provide it as a resource to the rest of the enterprise, both for legacy and data warehouse use. I described this special role for the central data warehouse team in more detail in my Data Warehouse Architect column in DBMS, June 1997.
The conformed Product dimension is the enterpriseýs agreed-upon master list of products, including all product attributes and all product rollups such as category and department. A good Product dimension, like a good Customer dimension, should have at least 50 separate textual attributes.
Ideally, the conformed Location dimension should be based on specific points on the map, like specific street addresses or even precise latitudes and longitudes. Specific points in space roll up to every conceivable geographic hierarchy, including city-county-state-country, as well as ZIP codes and idiosyncratic sales territories and sales regions.
The conformed Calendar dimension will almost always be a table of individual days, spanning a decade or more. Each day will have many useful attributes drawn from the legal calendars of the various states and countries the enterprise deals with, as well as special fiscal calendar periods and marketing seasons relevant only to internal managers.
Conformed dimensions are enormously important to the data warehouse. Without a strict adherence to conformed dimensions, the data warehouse cannot function as an integrated whole. If a dimension like Customer or Product is used in an nonconformed way, then either the separate data marts cannot be used together or, worse yet, attempts to use them together will produce incorrect results. To state this more positively, conformed dimensions make possible a single dimension table to be used against multiple fact tables in the same database space, consistent user interfaces and data content whenever the dimension is used, and a consistent interpretation of attributes and, therefore, rollups across data marts.
Designing Conformed Dimensions
The task of identifying and designing the conformed dimensions should take a few weeks. Most conformed dimensions will naturally be defined at the most granular level possible. The grain of the Customer dimension will be the individual customer. The grain of the Product dimension will be the lowest level at which products are tracked in the source systems. The grain of the Time dimension will usually be individual day. Those systems also tracking specific Time-of-Day should probably make Time-of-Day a separate dimension.
Conformed dimensions should almost always have an anonymous data warehouse key that is not a production system key from one of the legacy systems. There are many reasons for the data warehouse keys to be independent from production. The administrative goals of the production systems are not the same as those of the data warehouse. Sooner or later, the production system will step on the data warehouse, either by reusing the same key or by changing the administrative assumptions in some way. Also, the data warehouse has to produce generalized keys for various situations, including the problem of slowly changing dimensions. For a discussion of slowly changing dimensions, see my April 1996 DBMS column.
The Data Staging Area
A good conformed dimension is often a combination of attributes from multiple sources. In the data staging area of the data warehouse, the contributions from these multiple sources are merged under the overall data warehouse key for each record, such as Customer. The main activities of the data staging area are sorting, cleaning, looking up, checking relationships, and merging. An interesting and provocative view is that these data staging activities are not a very good match for relational databases. Data staging is more about processing sequential files on a single pass under program control. In a future article, I will propose some technologies and architectures for data staging that avoid the investment and overhead of a full-blown relational DBMS for this step of the extract pipeline.
Taking the Pledge
If the central data warehouse team succeeds in defining and providing a set of master conformed dimensions for the enterprise, it is extremely important for the separate data mart teams actually to use these dimensions. The commitment to use the conformed dimensions is much more than a technical decision. It is a business-policy decision that is key to making the enterprise data warehouse function. The use of the conformed dimensions should be supported at the highest executive levels. This issue should be a sound bite for the enterprise CIO.
Using data effectively in a large enterprise is intimately connected to how the enterprise is organized and how it communicates internally. The data warehouse is the vehicle for delivering the data to all the affected parties. Changing the way an enterprise is organized, how it communicates, and how it uses its data assets is mainline business reengineering. The CIO should make all the separate data mart teams "take the pledge" always to use the conformed dimensions.
Permissible Variations of Conformed Dimensions
It is possible to create a subset of a conformed dimension table for certain data marts if you know that the domain of the associated fact table only contains that subset. For example, the master Product table can be restricted to just those products manufactured at a particular location if the data mart in question pertains only to that location. We could call this a simple data subset, because the reduced dimension table preserves all the attributes of the original dimension and exists at the original granularity.
A rollup data subset systematically removes both rows and columns from the original dimension table. For example, it is common to restrict the Calendar table from days down to months. In this case we may only keep the record describing the first day of each month, but we must also remove all those attributes like Day-of-Week and Holiday-Flag that only make sense at a daily grain.
Perhaps you are wondering how to create queries in an environment where the conformed dimensions can be subsetted. Which dimension table should be used where? Actually, it is much simpler than it sounds. Each dimension table is paired with its companion fact table in a particular data mart. Any application that drills across data marts must inevitably use multipass SQL to query each supermart separately and in sequence. It is usually the case that a separate SQL query is generated for each column in a drill-across report. The beauty of using conformed dimensions is that the report will run to completion if and only if the dimension attributes used in the report are found in each dimension table. Since the dimensions are conformed, the business answers are guaranteed to be consistent. The numbers will also be comparable if we have established standard fact definitions.
Establishing the Standard Fact Definitions
I have talked thus far about the central task of setting up conformed dimensions to tie supermarts together. This is 80 percent of the up-front architectural effort. The remaining 20 percent is establishing standard fact definitions.
Fortunately, the task of identifying the standard fact definitions is done at the same time as the identification of the conformed dimensions. You need standard fact definitions when you use the same terminology across data marts and when you build single reports that drill across the data marts as described in the previous section.
Examples of facts that must be standardized include Revenue, Profit, Standard Prices, and Standard Costs. The underlying equations that derive these measures must be the same if they are to be called the same thing. These standard fact definitions need to be defined in the same dimensional context and with the same units of measurement from data mart to data mart. Revenues and profits need to be reported in the same time periods and in the same geographies.
Sometimes a fact has a natural unit of measure in one fact table and another natural unit of measure in another fact table. For example, the flow of a product down a typical manufacturing value chain may best be measured in shipping cases at the manufacturer but should be measured in scanned units at the retailer. Even if all the dimensional considerations have been correctly taken into account, it would be difficult to use these two incompatible units of measure in one drill-across report. The usual solution to this is to refer the user to a conversion factor buried in the product dimension table and hope the user can find the conversion factor and use it correctly. This is unacceptable overhead in my opinion. The correct solution is to carry the fact in both shipping cases and in scanned units in the manufacturerýs table or in a view of the manufacturerýs table. That way a report can easily glide down the value chain, picking off comparable facts.
If it is difficult or impossible to standardize a fact exactly, then care must be taken to give the different interpretations different names. We must distinguish Month End Revenue from Billing Cycle Revenue. The most serious mistake is to call both of these facts Revenue.
The Importance of Granularity
The conformed dimensions will usually be granular because each record in these tables most naturally corresponds to a single description of a customer, a product, a promotion, or a day. This makes it quite easy to bring in the associated fact tables at the intersection of all these granular dimensions. In other words, the base-level fact tables in each data mart should be at the natural lowest levels of all the constituent dimensions.
There is tremendous power and resiliency in granular fact-table data. By expressing the bedrock data of the data mart at the lowest grain, the data mart becomes almost impervious to surprises or changes. Such a granular fact table can be gracefully extended by adding newly sourced facts, newly sourced dimension attributes, and whole dimensions. When I say "gracefully extended," I mean specifically that all old queries and applications continue to run after a graceful change has been made; no fact tables have to be dropped and reloaded, and no keys have to be changed. This notion of graceful extension is one of the strong characteristics of the dimensional modeling approach. I discussed this in detail in the September 1997 "Manifesto" article mentioned earlier.
When the fact tables are granular, they serve as the natural destination for current operational data that may be extracted frequently from the operational systems. The current rolling snapshot of an operational system finds a happy home in the granular fact table defined at the Account by Day by Transaction level. In this issue, my regular column (See Data Warehouse Architect) describes two companion tables, the Transaction and Snapshot fact tables, that should be the bedrock foundation of many data marts.
A new and growing justification for extremely granular data is the desire to do data mining and to understand customer behavior. Data mining is generally much less effective on aggregated data. Well, suppose we agree with all this but we donýt have enough room on our data mart server to store the big granular fact tables. Does this invalidate the whole approach of this article?
No! This dilemma is simply a matter of terminology. Let us broaden the definition of data mart (or supermart) to include all the fact tables and dimension tables associated with a business process, regardless of where they are physically stored. We are moving so rapidly into a network-oriented view of our data warehouses that tying the definition of a data mart to a specific hardware box is unnecessarily restricting. The queries against our data marts are increasingly being handled by a navigator layer that sorts out the physical location of the data based on the details of the userýs request. If the user calls for relatively aggregated data, perhaps that data is on a local server. But if the user calls for more detailed data, we switch to a larger centralized machine at another location. There are many reasons for encouraging this kind of indirection at the physical storage level, because it gives the back-room DBAs much more flexibility to mix and match hardware without worrying about tweaking the usersý applications.
Higher-Level Data Marts
This article has mostly focused on first-level data marts that are recognizable images of legacy applications. In other words, if we have an Orders system, then we have an Orders data mart. If we have a Payments and Collections system, then we have a Payments and Collections data mart.
I recommend starting with these kinds of first-level data marts because I believe this minimizes the risk of signing up to an implementation that is too ambitious. Most of the risk comes from biting off too big an extract programming job. Also, in many cases an efficiently implemented first-level data mart will provide users with enough interesting data to keep them happy and quiet while the data mart teams keep working on harder issues.
After several first-level data marts have been implemented, it is reasonable to combine these data marts into a second-level data mart. The classic example of a second-level data mart is the Profitability data mart, where separate components of revenue and cost are combined to allow a complete view of profitability. Profitability data marts at a granular level are tremendously exciting because they can be rolled up through the Customer dimension to generate customer profitability. They can be rolled up through the Product dimension to generate product profitability. And they can be rolled up through the Promotion dimension to generate promotion profitability. The lesson here is to be disciplined and not try to bring up a complete Profitability data mart on the first try. Otherwise you will drown in extract programming as you struggle to source all the separate components of revenue and cost. If you are absolutely forced to bring up profitability in your first data mart then you may need to assign costs with simple rule-of-thumb allocations rather than by doing the complete job of sourcing all the underlying cost detail. Later, when you have time to source the cost detail correctly, you can bring up an Activity Based Profitability second-level data mart to supersede your first-level data mart.
Rescuing Stovepipes
Can you rescue your stovepipes and convert them into supermarts? You can do this only if the dimensions used in the stovepipes can be mapped one-to-one or one-to-many with the proper conformed dimensions. If so, then the stovepipe dimension can be replaced with the conformed dimension. In some cases the conformed dimension can gracefully inherit some of the special attributes of the stovepipe dimension. But usually a stovepipe data mart has one or more dimensions that cannot easily be mapped into the conformed dimensions. The stovepipe sales geographies may be incompatible with the conformed sales geographies. Be careful about assuming that you can decompose something simple like stovepipe weeks back into conformed dimension days. While this sounds simple logically, you are probably making an unfounded assumption that you can spread a stovepipe week over weekdays and then turn around and add up again into months. As difficult as it is to admit, most of the time the stovepipe data marts must be shut down and rebuilt in the proper conformed dimensional framework.
When You Donýt Need Conformed Dimensions
If your customers and products are disjointed and you donýt manage your separate business lines together, then there is little point in building a data warehouse that tightly integrates these businesses. For example, if you are a conglomerate business whose subsidiaries span food businesses, hardware businesses, and services, it probably doesnýt make sense to think of your customers as leveraging your brand name from product to product or service to service. Even if you tried to build a set of conformed Product and Customer dimensions in a single data warehouse spanning all the lines of business, most of the reports would end up weirdly "diagonalized" with the data from each business line in rows and columns not shared by the other lines. In this case your data warehouse would mimic your management structure, and you would build separate and self-contained data warehouses for each subsidiary.
Clear Vision
In this article, I have described a rational approach to planning an enterprise data warehouse. We have achieved the best of two worlds. We have created an architectural framework with supermarts that guides the overall design, but we have divided the problem into bite-sized chunks that can be implemented by real human beings. Along the way, we have adjusted the conventional definitions of the data mart, calling it a supermart. We see now that a supermart is a complete subset of the overall data warehouse, physically distributed across multiple hardware and software platforms, and always based on the most granular data possible we can extract from our legacy systems. Every supermart is a family of similar tables sharing conformed dimensions. These conformed dimensions have a uniform interpretation across the enterprise. Finally, we see the overall data warehouse for what it is: a collection of separately implemented supermarts bound together with a powerful architecture based on conformed dimensions and standardized facts.
An example of two data marts with conformed dimensions (Time, Customer, Product, and Promotion). For example, these two data marts can work together to produce a combined report with Product and Promotion as row headers.
Ralph Kimball was co-inventor of the Xerox Star workstation, the first commercial product to use mice, icons, and windows. He was vice president of applications at Metaphor Computer Systems and is the founder and former CEO of Red Brick Systems. He now works as an independent consultant designing large data warehouses. He is the author of The Data Warehouse Toolkit: How to Design Dimensional Data Warehouses (Wiley, 1996). You can reach Ralph through his Web page at http://www.rkimball.com/.
What did you think of this article? Send a letter to the editor.
January 1998 Table of Contents Other Contents Article Index Search Site Index Home
DBMS and Internet Systems (http://www.dbmsmag.com/)
Copyright © 1998 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Dimensional Relational vs. OLAP: The Final Deployment Conundrum
The choice between deploying relational tables or OLAP cubes is not a trivial matter. Weigh these 34 pros and cons of each approach early in the design of your extract-transform-load system.
By Ralph Kimball
Abril 27, 2007
Ralph Kimball |
It has become fashionable to regard the final deployment step of building an extract-transform-load (ETL) system as a mere tactical choice between delivering dimensional relational tables or OLAP cubes into the end-user environment. But is this choice quite so superficial? Should we defer thinking about this choice until just before roll-out? In this article we take a hard look at this final deployment conundrum and urge you to resolve this question very early in the design process Business intelligence (BI) developers have largely accepted the premise that data is most user-friendly when delivered in a dimensional format. If you define the data warehouse as the platform for all forms of BI, then the last step of the ETL process in the data warehouse is exposing the data in a dimensional format. Many BI developers have recognized that a properly designed set of dimensional relational tables can be transformed into an OLAP cube in a virtual one-to-one mapping. For various reasons explained below, I recommend that all OLAP cubes be built directly from dimensional models. The dimension tables in such a relational schema become the OLAP cube dimensions, often referred to as the edges of the cube. The fact tables from the relational schema provide the content of the OLAP cube cells.
While there can be some small semantic differences between the relational dimensions and the OLAP cube dimensions, the profound overlap between these two approaches has made it tempting to regard the final deployment choice as a tactical maneuver executed at the very end of the data warehouse development. Worse yet, the argument is sometimes made that BI applications can be switched between relational and OLAP implementations because of this similarity.
Not so fast! Under the right circumstances, the ETL pipeline can be mostly insulated from the final deployment choice, but the relational vs. OLAP choice is multi-faceted decision with lots of issues to consider. Let's look at the advantages and disadvantages of both choices before we jump on either bandwagon.
- Dimensional Relational Advantages
- Relational database structures are largely vendor independent and especially dimensional structures are pretty easily ported. However, ETL command line scripts and proprietary code like PL/SQL may not be very portable.
- All the main relational DBMS's have high-volume bulk data loaders, and they're especially effective if you turn off transaction logging.
- A wide variety of SQL-generating tools from many vendors are able to directly access the data. Usually these tools can be repurposed to point at a new relational DBMS.
- SQL expertise is widely available in the marketplace, since the main features of SQL have long been standardized, and SQL is routinely taught at the college level.
- Hand-coded SQL is generally readable, although the SQL emitted by high-end BI tools is overwhelming and cannot reasonably be changed by the BI applications development staff.
- There are many different ways to control DBMS performance, including schema designs, indexes, aggregates and materialized views. Dimensional relational structures, because of their predictable characteristics, have well-understood performance tuning techniques, although these techniques vary somewhat by vendor.
- Relational databases are extremely stable and are suitable for serious archiving and backup.
- Relational database structures are not vulnerable to catastrophic invalidation, unlike OLAP cubes that may involuntarily rebuild themselves if a SCD Type 1 change is made to a dimension.
- Database sizes are pretty much unlimited, and single dimensional fact tables in the multi-terabyte range are increasingly common.
- High-end relational databases can join many large tables.
- Hybrid (mixed workload) applications involving both querying and updating are usually easy to construct.
Relational Disadvantages
- SQL is a truly horrible language for powerful analysis and complex applications.
- SQL is severely asymmetric: you can perform complex constraints and calculations generally only WITHIN records, not across records.
- In spite of performance tuning capabilities, it is still too easy to lose control of performance.
OLAP Advantages
- Generally much better performance than relational when the cubes are designed correctly, with less need to play complex performance tuning games compared to relational.
- Much more powerful analytic capabilities than relational. MDX, for example, has more powerful semantics for traversing ragged unbalanced hierarchies, such as organization charts, than does SQL.
- Vendor-supplied OLAP tools for reporting and querying have historically been superior to relational tools, although I'm impressed by the steady investment relational tool vendors have made in improving ease of use and feature sets.
- OLAP doesn't suffer from the symmetry problem that limits SQL.
- The decision to load OLAP directly from dimensional relational tables doesn't affect the ETL backroom very much: the deployment step occurs at the very end.
- Certain loading scenarios can be very fast.
- Certain vertical industries, especially financial services, have developed awesome OLAP solutions.
- OLAP encourages more complex security scenarios, especially for ad hoc access. By comparison, it's difficult to set up a relational database to protect detailed data (sales by sales rep), but provide more open access to summarized data (sales by region). This is especially true for ad hoc access on the relational side. Security is significantly more powerful on OLAP because of the semantics about parents and children inherent in the access languages.
OLAP Disadvantages
- The big objection to OLAP is the proprietary, non-standard character of each vendor's OLAP offering.
- Don't expect to port an OLAP implementation to another vendor's product. Discard everything including all your application development
- There is no accepted, universally implemented access language for OLAP, although Microsoft's MDX is the closest thing to a standard access language. Significantly, Oracle has not embraced MDX, preferring to rely on SQL for all forms of database access.
- MDX in its full glory may be too complex for IT personnel to write by hand, or understand a complex application. But to be fair, I have looked at incomprehensible SQL many times!
- Historically there has been much less industry expertise in MDX than in SQL although Microsoft has successfully sponsored a cottage industry of training organizations.
- OLAP application development expertise is fragmented by vendor.
- OLAP cubes can be catastrophically invalidated if you are not careful, i.e. a Type 1 SCD change to a dimension may cause all the cubes using that dimension to rebuild!
- OLAP cubes are not considered stable enough for serious archiving and backup: this is a strong reason for creating a set of dimensional relational tables duplicating the content of the cube for those purposes.
- OLAP vendors have certain size limits not present in relational implementations including possibly the number of members in a dimension, the number of distinct values at various levels in a hierarchy and the overall size of the cube.
When you must rebuild a cube, it may be a VERY time consuming process.
Equally Easy in Either Approach
The basic foundations of dimensional modeling, including slowly changing dimensions, the three fundamental grains of fact tables, and achieving integration through conformed dimensions, are equally easy to deliver with either approach, especially since these dimensional structures are carried to the very last step of the ETL pipeline.
Making the Final Choice
So how do senior management and the enterprise BI system designers resolve the final deployment conundrum: Dimensional relational or OLAP? As I hope you appreciate, there is no slam dunk answer since there are significant advantages and disadvantages for both approaches. But let's consider two extremes. If you are a large distributed enterprise with a number of different database vendors and you are struggling to establish more commonality across your BI deployments, as well as creating an enterprise wide pool of BI development expertise without being beholden to any single vendor, then I recommend dimensional relational. On the other hand, if you are looking for the most potent single-subject solution with high performance and killer analytics, and you are confident that you can source the development expertise you need, then I recommend OLAP supported by underlying dimensional relational schemas. Otherwise, it depends.
Ralph Kimball, founder of the Kimball Group, teaches dimensional data warehouse and ETL design through Kimball University and reviews large warehouses. He has four best-selling data warehousing books in print, including The Data Warehouse ETL Toolkit. Write to him at ralph@kimballgroup.com.
The Problem with Dimensional Modeling
Information Management: Charting the Course
- Bill Inmon
- DM Review Magazine, May 2000
The problem with dimensional modeling is... well there really isn't a problem with dimensional modeling. In terms of capturing requirements for DSS analysis, dimensional modeling is undoubtedly the best technique for design that there is. Hundreds of organizations have used dimensional modeling successfully and are quite satisfied with the results. There are classes and conferences where dimensional modeling is the focus and the attendees walk away with a happy smile on their face.
So whatever could be the problem with dimensional modeling? In fact, is there really a problem here?
In order to understand what the issues are, consider the results of a dimensional modeling exercise. Typically the database design that results from a dimensional model is called a star join (or sometimes a snowflake structure). A star join is made up of one or more fact tables coupled with any number of related dimensions. A star join is built by gathering and assimilating user requirements. A star join represents the optimal denormalization of data that best suits the requirements of the users.
The starting point for the design of the star join and dimensional modeling is the process of requirements gathering. Users are interviewed and requirements are synthesized from the users' statements. Once the requirements are gathered, an assimilation of requirements from all users interviewed takes place, leading to the determination of the shape and content of the star join. The database designer knows how to build the star join because of the requirements analysis that precede the design activities.
The result of the star join when built properly is that the shape and content of the star join are optimal for the requirements of the users that have participated in the requirements-gathering process. Star joins are great for representing the views of people who are of a like mind, but different groups of people want their own star join.
While the star join fits the needs of the users who have been interviewed, not surprisingly the star join does not fit the needs of all users. For a variety of reasons, there will be users who do not have input into the dimensional modeling process. Indeed, the problem with a star join design is that it optimizes the access of data for one group of people at the expense of everyone else. A star join with one shape and set of contents is optimal for one group of people and another star join with another shape and contents is optimal for another group of people. Because the star join is shaped around user requirements and because user requirements vary from one type of user to the next, not surprisingly different star joins are optimal for different types of users.
The advocates of dimensional modeling would like to believe that a single star join can be created for the optimization of everyone's requirements. Such is not the case at all. The world has long ago discovered that the single database for all purposes was pie in the sky. The single database to serve all purposes and all requirements simply has never been a reality.
There are a host of very good reasons why different organizations need their own unique star joins and cannot share a star join with another organization. Some of these reasons are:
- Sequencing of data. Finance likes to see data sequenced one way; marketing likes to see data sequenced another way.
- Data definitions. Sales defines a sale as closed business; accounting defines a sale as booked business.
- Granularity. Finance looks at things in terms of monthly revenue; accounting looks at things in terms of quarterly revenue.
- Geography. Sales looks at things in terms of sales territories by ZIP code; marketing looks at things in terms of states.
- Products. Engineering looks at things in terms of existing products; sales looks at things in terms of future products.
- Data relationships. Finance looks at the relationship between a customer and a sale; accounting looks at things through the eyes of an account/sales relationship.
- Time. Accounting looks at the world through scheduled closing dates; finance looks at the world through calendar dates.
- Technologies. One star join (from finance) is in one technology and another star join (from marketing) is in another technology.
- Sources of data. One source system feeds one star join while another source system feeds another star join.
- Management. Management looks at things through the organization chart, the investors look at things through the marketplace.
The differences in the way that parts of the business operate are much more vast and much more profound than the obvious examples presented here.
The way that the different departments of the organization conduct their business is like the six blind men describing an elephant. Each blind man is touching a different part of the elephant and is describing it. To a casual observer, there appears to be no cohesive pattern in the very different descriptions given by the six blind men. Having different organizations care for different aspects of the business is simply the way business runs. There are good reasons for different departments looking at the data and the business in very different lights. The consequence of this need to look at the same business in different ways results in the need for very different star joins for different departments. The star join that is optimal for one department is hardly optimal (or even recognizable) for another department. The result is that because of the nature of the way business is run different departments require different star joins.
When each department has its own star join which has been designed specifically for it, the world is at peace. But herein lies a host of problems.
The problems that arise with star joins and dimensional modeling do not become apparent until multiple star joins appear on the horizon. (The problems to be discussed with multiple star joins cohabiting in the same environment is well documented and will be only lightly touched upon here in the interest of not repeating what has been well discussed in the industry.)
When there are multiple independent star join environments, the same detailed data appears in each star join. There is no reconcilability of data, and new star joins require the same amount of work for creation as older star joins. As a result:
- Each star joins contains much of the same detailed data as each other star join. Star joins grow unnecessarily large when each star join thinks that it has to go back and gather the same detailed data that all other star joins have already gathered.
- The results obtained from each star join are inconsistent with the results obtained from every other star join. Furthermore, the ability to reconcile the differences is almost nil.
- New star joins require as much work to create as the original star joins. There is no foundation to build on when each star join is built independently.
- The interface to the supporting applications that feed the star joins becomes unmanageable.
In short, simply doing dimensional modeling as a basis for data warehouse design leads down a dark path when multiple star joins are considered. It is never apparent that there is a problem with star joins when you are looking at just one star join. But when you look at multiple star joins, the limitations of dimensional modeling become apparent.
Does this mean that dimensional modeling is invalid as a database design technique for data warehousing? The answer is: not at all. Dimensional modeling and star joins fit the bill very nicely for data marts. In fact, if I had to design a data mart tomorrow, I would not consider using any other approach.
But when it comes to the foundation data, that's another story. Figure 1 shows that the foundation data the data warehouse requires an altogether different treatment than dimensional modeling. The data warehouse which is the proper foundation for all DSS activity, including star joins, requires very granular, very flexible data. The ideal structure for the data warehouse is normalized data. The normalized data can be bent and shaped any old way.
Trying to superimpose a star join on the data warehouse implies that the data warehouse is going to be optimal for the support of one set of users at the expense of everyone else.
Not only does dimensional modeling not fit the data warehouse, dimensional modeling does not fit in many other places in the data warehouse/corporate information factory environment, such as:
- exploration warehouses,
- near-line storage,
- project warehouses and
- parts of the operational data store (ODS).
In short, dimensional modeling really only fits the data mart environment, where requirements for processing are known before the infrastructure is built.
The advocates of dimensional modeling are quite proud of their pioneering efforts and well they should be. They have rightfully advanced the industry in a proper direction. But in their zeal to convert people to their way of thinking, they seem to have forgotten that there are other valid approaches to solving the database design problem for the corporate information factory. The dimensional modeling approach fits in a few places in the DSS environment, but it does not fit everywhere. The dimensional modelers have crafted a fine hammer. Now everything looks like a nail to them.
Bill Inmon is universally recognized as the father of the data warehouse. He has more than 35 years of database technology management experience and data warehouse design expertise. His books have been translated into nine languages. He is known globally for his seminars on developing data warehouses and has been a keynote speaker for many major computing associations. For more information, visit http://www.inmongif.com/ and http://www.inmoncif.com/. Inmon may be reached at (303) 681-6772.
For more information on related topics, visit the following channels:
A Dimensional Modeling Manifesto
Dimensional modeling (DM) is the name of a logical design technique often used for data warehouses. It is different from, and contrasts with, entity-relation modeling (ER). This article points out the many differences between the two techniques and draws a line in the sand. DM is the only viable technique for databases that are designed to support end-user queries in a data warehouse. ER is very useful for the transaction capture and the data administration phases of constructing a data warehouse, but it should be avoided for end-user delivery.
What is ER?
ER is a logical design technique that seeks to remove the redundancy in data. Imagine that we have a business that takes orders and sells products to customers. In the early days of computing (long before relational databases) when we first transferred this data to a computer, we probably captured the original paper order as a single fat record with many fields. Such a record could easily have been 1,000 bytes distributed across 50 fields. The line items of the order were probably represented as a repeating group of fields embedded in the master record. Having this data on the computer was very useful, but we quickly learned some basic lessons about storing and manipulating data. One of the lessons we learned was that data in this form was difficult to keep consistent because each record stood on its own. The customer's name and address appeared many times, because this data was repeated whenever a new order was taken. Inconsistencies in the data were rampant, because all of the instances of the customer address were independent, and updating the customer's address was a messy transaction.
Even in the early days, we learned to separate out the redundant data into distinct tables, such as a customer master and a product master -- but we paid a price. Our software systems for retrieving and manipulating the data became complex and inefficient because they required careful attention to the processing algorithms for linking these sets of tables together. We needed a database system that was very good at linking tables. This paved the way for the relational database revolution, where the database was devoted to just this task.
The relational database revolution bloomed in the mid 1980s. Most of us learned what a relational database was by reading Chris Date's seminal book on the subject, An Introduction to Relational Databases (Addison-Wesley), first published in the early 1980s. As we paged through Chris's book, we worked through all of his Parts, Suppliers, and Cities database examples. It didn't occur to most of us to ask whether the data was completely "normalized" or whether any of the tables could be "snowflaked," and Chris didn't develop these topics. In my opinion, Chris was trying to explain the more fundamental concepts of how to think about tables that were relationally joined. ER modeling and normalization were developed in later years as the industry shifted its attention to transaction processing.
The ER modeling technique is a discipline used to illuminate the microscopic relationships among data elements. The highest art form of ER modeling is to remove all redundancy in the data. This is immensely beneficial to transaction processing because transactions are made very simple and deterministic. The transaction of updating a customer's address may devolve to a single record lookup in a customer address master table. This lookup is controlled by a customer address key, which defines uniqueness of the customer address record and allows an indexed lookup that is extremely fast. It is safe to say that the success of transaction processing in relational databases is mostly due to the discipline of ER modeling.
However, in our zeal to make transaction processing efficient, we have lost sight of our original, most important goal. We have created databases that cannot be queried! Even our simple order-taking example creates a database of dozens of tables that are linked together by a bewildering spider web of joins. (See Figure 1, page 60.) All of us are familiar with the big chart on the wall of the IS database designer's cubicle. The ER model for the enterprise has hundreds of logical entities! High-end systems such as SAP have thousands of entities. Each of these entities usually turns into a physical table when the database is implemented. This situation is not just an annoyance, it is a showstopper:
- End users cannot understand or remember an ER model. End users cannot navigate an ER model. There is no graphical user interface (GUI) that takes a general ER model and makes it usable by end users.
- Software cannot usefully query a general ER model. Cost-based optimizers that attempt to do this are notorious for making the wrong choices, with disastrous consequences for performance.
- Use of the ER modeling technique defeats the basic allure of data warehousing, namely intuitive and high-performance retrieval of data.
Ever since the beginning of the relational database revolution, IS shops have noticed this problem. Many of them that have tried to deliver data to end users have recognized the impossibility of presenting these immensely complex schemas to end users, and many of these IS shops have stepped back to attempt "simpler designs." I find it striking that these "simpler" designs all look very similar! Almost all of these simpler designs can be thought of as "dimensional." In a natural, almost unconscious way, hundreds of IS designers have returned to the roots of the original relational model because they know the database cannot be used unless it is packaged simply. It is probably accurate to say that this natural dimensional approach was not invented by any single person. It is an irresistible force in the design of databases that will always appear when the designer places understandability and performance as the highest goals. We are now ready to define the DM approach.
What is DM?
DM is a logical design technique that seeks to present the data in a standard, intuitive framework that allows for high-performance access. It is inherently dimensional, and it adheres to a discipline that uses the relational model with some important restrictions. Every dimensional model is composed of one table with a multipart key, called the fact table, and a set of smaller tables called dimension tables. Each dimension table has a single-part primary key that corresponds exactly to one of the components of the multipart key in the fact table. (See Figure 2.) This characteristic "star-like" structure is often called a star join. The term star join dates back to the earliest days of relational databases.
A fact table, because it has a multipart primary key made up of two or more foreign keys, always expresses a many-to-many relationship. The most useful fact tables also contain one or more numerical measures, or "facts," that occur for the combination of keys that define each record. In Figure 2, the facts are Dollars Sold, Units Sold, and Dollars Cost. The most useful facts in a fact table are numeric and additive. Additivity is crucial because data warehouse applications almost never retrieve a single fact table record; rather, they fetch back hundreds, thousands, or even millions of these records at a time, and the only useful thing to do with so many records is to add them up.
Dimension tables, by contrast, most often contain descriptive textual information. Dimension attributes are used as the source of most of the interesting constraints in data warehouse queries, and they are virtually always the source of the row headers in the SQL answer set. In Figure 2, we constrain on the Lemon flavored products via the Flavor attribute in the Product table, and on Radio promotions via the AdType attribute in the Promotion table. It should be obvious that the power of the database in Figure 2 is proportional to the quality and depth of the dimension tables.
The charm of the database design in Figure 2 is that it is highly recognizable to the end users in the particular business. I have observed literally hundreds of instances where end users agree immediately that this is "their business."
DM vs. ER
Obviously, Figure 1 and Figure 2 look quite different. Many designers react to this by saying, "There must be less information in the star join," or "The star join is only used for high-level summaries." Both of these statements are false.
The key to understanding the relationship between DM and ER is that a single ER diagram breaks down into multiple DM diagrams. Think of a large ER diagram as representing every possible business process in the enterprise. The master ER diagram may have Sales Calls, Order Entries, Shipment Invoices, Customer Payments, and Product Returns, all on the same diagram. In a way, the ER diagram does itself a disservice by representing on one diagram multiple processes that never coexist in a single data set at a single consistent point in time. It's no wonder the ER diagram is overly complex. Thus the first step in converting an ER diagram to a set of DM diagrams is to separate the ER diagram into its discrete business processes and to model each one separately.
The second step is to select those many-to-many relationships in the ER model containing numeric and additive nonkey facts and to designate them as fact tables. The third step is to denormalize all of the remaining tables into flat tables with single-part keys that connect directly to the fact tables. These tables become the dimension tables. In cases where a dimension table connects to more than one fact table, we represent this same dimension table in both schemas, and we refer to the dimension tables as "conformed" between the two dimensional models.
The resulting master DM model of a data warehouse for a large enterprise will consist of somewhere between 10 and 25 very similar-looking star join schemas. Each star join will have four to 12 dimension tables. If the design has been done correctly, many of these dimension tables will be shared from fact table to fact table. Applications that drill down will simply be adding more dimension attributes to the SQL answer set from within a single star join. Applications that drill across will simply be linking separate fact tables together through the conformed (shared) dimensions. Even though the overall suite of star join schemas in the enterprise dimensional model is complex, the query processing is very predictable because at the lowest level, I recommend that each fact table should be queried independently.
The Strengths of DM
The dimensional model has a number of important data warehouse advantages that the ER model lacks. First, the dimensional model is a predictable, standard framework. Report writers, query tools, and user interfaces can all make strong assumptions about the dimensional model to make the user interfaces more understandable and to make processing more efficient. For instance, because nearly all of the constraints set up by the end user come from the dimension tables, an end-user tool can provide high-performance "browsing" across the attributes within a dimension via the use of bit vector indexes. Metadata can use the known cardinality of values in a dimension to guide the user-interface behavior. The predictable framework offers immense advantages in processing. Rather than using a cost-based optimizer, a database engine can make very strong assumptions about first constraining the dimension tables and then "attacking" the fact table all at once with the Cartesian product of those dimension table keys satisfying the user's constraints. Amazingly, by using this approach it is possible to evaluate arbitrary n-way joins to a fact table in a single pass through the fact table's index. We are so used to thinking of n-way joins as "hard" that a whole generation of DBAs doesn't realize that the n-way join problem is formally equivalent to a single sort-merge. Really.
A second strength of the dimensional model is that the predictable framework of the star join schema withstands unexpected changes in user behavior. Every dimension is equivalent. All dimensions can be thought of as symmetrically equal entry points into the fact table. The logical design can be done independent of expected query patterns. The user interfaces are symmetrical, the query strategies are symmetrical, and the SQL generated against the dimensional model is symmetrical.
A third strength of the dimensional model is that it is gracefully extensible to accommodate unexpected new data elements and new design decisions. When we say gracefully extensible, we mean several things. First, all existing tables (both fact and dimension) can be changed in place by simply adding new data rows in the table, or the table can be changed in place with a SQL alter table command. Data should not have to be reloaded. Graceful extensibility also means that that no query tool or reporting tool needs to be reprogrammed to accommodate the change. And finally, graceful extensibility means that all old applications continue to run without yielding different results. In Figure 2, I labeled the schema with the numbers 1 through 4 indicating where you can, respectively, make the following graceful changes to the design after the data warehouse is up and running by:
- Adding new unanticipated facts (that is, new additive numeric fields in the fact table), as long as they are consistent with the fundamental grain of the existing fact table
- Adding completely new dimensions, as long as there is a single value of that dimension defined for each existing fact record
- Adding new, unanticipated dimensional attributes
- Breaking existing dimension records down to a lower level of granularity from a certain point in time forward.
A fourth strength of the dimensional model is that there is a body of standard approaches for handling common modeling situations in the business world. Each of these situations has a well-understood set of alternatives that can be specifically programmed in report writers, query tools, and other user interfaces. These modeling situations include:
- Slowly changing dimensions, where a "constant" dimension such as Product or Customer actually evolves slowly and asynchronously. Dimensional modeling provides specific techniques for handling slowly changing dimensions, depending on the business environment. See my DBMS article of April 1996 on slowly changing dimensions.
- Heterogeneous products, where a business such as a bank needs to track a number of different lines of business together within a single common set of attributes and facts, but at the same time it needs to describe and measure the individual lines of business in highly idiosyncratic ways using incompatible measures.
- Pay-in-advance databases, where the transactions of a business are not little pieces of revenue, but the business needs to look at the individual transactions as well as report on revenue on a regular basis. For this and the previous bullet, see my DBMS article of December 1995, the insurance company case study.
- Event-handling databases, where the fact table usually turns out to be "factless." See my DBMS article of September 1996 on factless fact tables.
A final strength of the dimensional model is the growing body of administrative utilities and software processes that manage and use aggregates. Recall that aggregates are summary records that are logically redundant with base data already in the data warehouse, but they are used to enhance query performance. A comprehensive aggregate strategy is required in every medium- and large-sized data warehouse implementation. To put it another way, if you don't have aggregates, then you are potentially wasting millions of dollars on hardware upgrades to solve performance problems that could be otherwise addressed by aggregates.
All of the aggregate management software packages and aggregate navigation utilities depend on a very specific single structure of fact and dimension tables that is absolutely dependent on the dimensional model. If you don't adhere to the dimensional approach, you cannot benefit from these tools. Please see my DBMS articles on aggregate navigation and the various products serving aggregate navigation in the September 1995 and August 1996 issues.
Myths About DM
A few myths floating around about dimensional modeling deserve to be addressed. Myth number one is "Implementing a dimensional data model will lead to stovepipe decision-support systems." This myth sometimes goes on to blame denormalization for supporting only specific applications that therefore cannot be changed. This myth is a short-sighted interpretation of dimensional modeling that has managed to get the message exactly backwards! First, we have argued that every ER model has an equivalent set of DM models that contain the same information. Second, we have shown that even in the presence of organizational change and end-user adaptation, the dimensional model extends gracefully without altering its form. It is in fact the ER model that whipsaws the application designers and the end users!
A source of this myth, in my opinion, is the designer who is struggling with fact tables that have been prematurely aggregated. For instance, the design in Figure 2 is expressed at the individual sales-ticket line-item level. This is the correct starting point for this retail database because this is the lowest possible grain of data. There just isn't any further breakdown of the sales transaction. If the designer had started with a fact table that had been aggregated up to weekly sales totals by store, then there would be all sorts of problems in adding new dimensions, new attributes, and new facts. However, this isn't a problem with the design technique, this is a problem with the database being prematurely aggregated.
Myth number two is "No one understands dimensional modeling." This myth is absurd. I have seen hundreds of excellent dimensional designs created by people I have never met or had in my classes. A whole generation of designers from the packaged-goods retail and manufacturing industries has been using and designing dimensional databases for the last 15 years. I personally learned about dimensional models from existing A.C. Nielsen and IRI applications that were installed and working in such places as Procter & Gamble and The Clorox Company as early as 1982.
Incidentally, although this article has been couched in terms of relational databases, nearly all of the arguments in favor of the power of dimensional modeling hold perfectly well for proprietary multidimensional databases such as Oracle Express and Arbor Essbase.
Myth number three is "Dimensional models only work with retail databases." This myth is rooted in the historical origins of dimensional modeling but not in its current-day reality. Dimensional modeling has been applied to many different business areas including retail banking, commercial banking, property and casualty insurance, health insurance, life insurance, brokerage customer analysis, telephone company operations, newspaper advertising, oil company fuel sales, government agency spending, and manufacturing shipments.
Myth number four is "Snowflaking is an alternative to dimensional modeling." Snowflaking is the removal of low-cardinality textual attributes from dimension tables and the placement of these attributes in "secondary" dimension tables. For instance, a product category can be treated this way and physically removed from the low-level product dimension table. I believe that this method compromises cross-attribute browsing performance and may interfere with the legibility of the database, but I know that some designers are convinced that this is a good approach. Snowflaking is certainly not at odds with dimensional modeling. I regard snowflaking as an embellishment to the cleanliness of the basic dimensional model. I think that a designer can snowflake with a clear conscience if this technique improves user understandability and improves overall performance. The argument that snowflaking helps the maintainability of the dimension table is specious. Maintenance issues are indeed leveraged by ER-like disciplines, but all of this happens in the operational data store, before the data is loaded into the dimensional schema.
The final myth is "Dimensional modeling only works for certain kinds of single-subject data marts." This myth is an attempt to marginalize dimensional modeling by individuals who do not understand its fundamental power and applicability. Dimensional modeling is the appropriate technique for the overall design of a complete enterprise-level data warehouse. Such a dimensional design consists of families of dimensional models, where each family describes a business process. The families are linked together in an effective way by insisting on the use of conformed dimensions.
In Defense of DM
Now it's time to take off the gloves. I firmly believe that dimensional modeling is the only viable technique for designing end-user delivery databases. ER modeling defeats end-user delivery and should not be used for this purpose.
ER modeling does not really model a business; rather, it models the micro relationships among data elements. ER modeling does not have "business rules," it has "data rules." Few if any global design requirements in the ER modeling methodology speak to the completeness of the overall design. For instance, does your ER CASE tool try to tell you if all of the possible join paths are represented and how many there are? Are you even concerned with such issues in an ER design? What does ER have to say about standard business modeling situations such as slowly changing dimensions?
ER models are wildly variable in structure. Tell me in advance how to optimize the querying of hundreds of interrelated tables in a big ER model. By contrast, even a big suite of dimensional models has an overall deterministic strategy for evaluating every possible query, even those crossing many fact tables. (Hint: You control performance by querying each fact table separately. If you actually believe that you can join many fact tables together in a single query and trust a cost-based optimizer to decide on the execution plan, then you haven't implemented a data warehouse for real end users.)
The wild variability of the structure of ER models means that each data warehouse needs custom, handwritten and tuned SQL. It also means that each schema, once it is tuned, is very vulnerable to changes in the user's querying habits, because such schemas are asymmetrical. By contrast, in a dimensional model all dimensions serve as equal entry points to the fact table. Changes in users' querying habits don't change the structure of the SQL or the standard ways of measuring and controlling performance.
ER models do have their place in the data warehouse. First, the ER model should be used in all legacy OLTP applications based on relational technology. This is the best way to achieve the highest transaction performance and the highest ongoing data integrity. Second, the ER model can be used very successfully in the back-room data cleaning and combining steps of the data warehouse. This is the ODS, or operational data store.
However, before data is packaged into its final queryable format, it must be loaded into a dimensional model. The dimensional model is the only viable technique for achieving both user understandability and high query performance in the face of ever-changing user questions.
Ralph Kimball was coinventor of the Xerox Star workstation, the first commercial product to use mice, icons, and windows. He was vice president of applications at Metaphor Computer Systems and is the founder and former CEO of Red Brick Systems. He now works as an independent consultant designing large data warehouses. His book The Data Warehouse Toolkit: How to Design Dimensional Data Warehouses (John Wiley, 1996) is now available. You can reach Ralph through his Web page at www.rkimball.com.
Figure 1
--A high-level overview of an ER model. Each box on the diagram is actually many entities. Each business process shown is probably a separate legacy application. The equivalent DM design would isolate each business process and surround it with just its relevant dimensions.
Figure 2
--A detailed dimensional model for retail point of sales. Numbers 1 through 4 show places where the design may be gracefully extended.
Dimensional modeling resources:
- The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses, Ralph Kimball, John Wiley, 1996.
- OLAP Solutions: Building Multidimensional Information Systems, Eric Thomsen, John Wiley, 1997.
- Planning and Designing the Data Warehouse, Ramon Barquin and Herb Edelstein, chapter 10, Prentice Hall PTR, 1996.
- Building a Data Warehouse for Decision Support, Vidette Poe, Prentice Hall, 1995.
DBMS and Internet Systems (http://www.dbmsmag.com/)
Copyright © 1997 Miller Freeman, Inc. ALL RIGHTS RESERVED