|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
It is 09:49 PST on Friday 01/31/2025 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
April 5, 2003 Drilling across means asking for the same row headers from another fact table The three fundamental themes that make up the soul of every data warehouse are drilling down, drilling across, and handling time. In Part One of "The Soul of the Data Warehouse," I showed that drilling down was nothing more than adding a row header, any row header, to an existing query. Although we often grouse about SQL's limitations as a report writer, when it comes to drilling down, SQL gracefully expands an existing query whenever a new row header is added. This simple result led to the powerful realization that when data is organized in a symmetrical, predictable fashion starting at the most atomic level, all queries and applications benefit. If drilling down is the most fundamental maneuver in a data warehouse, drilling across is a close second. From the perspective of an answer set, drilling across adds more data to an existing row. Note that this result isn't what you get from a UNION of rows from separate queries. It's better described as the column accretion from separate queries. Drilling across by adding another measured fact to the SELECT list from the existing fact table mentioned in the query is a trivial accomplishment. What's more interesting and important is adding another measured fact from a new fact table. The issues raised by this simple view of drilling across are at the heart of data warehouse architecture. These issues boil down to an observation and a choice. Drill-across observation: The new fact table called for in the drill-across operation must share certain dimensions with the fact table in the original query. Certain dimensions will be named in the original query because they contribute row headers. Remember that these row headers are the basis of the grouping that creates the answer-set row. These dimensions will appear in the FROM clause of the SQL code and will be joined to the fact table through the relationship of a foreign key to primary key. The new fact table must also support exactly these same row headers, or the context of the answer-set row is meaningless. Drill-across choice: Either send a single, simultaneous SQL request to the two fact tables or send two separate requests. Although sending a single SQL request to the two fact tables seems cleaner, this choice can become a showstopper. Sending a single request means mentioning both fact tables in the FROM clause of the SQL code and joining both fact tables in some way to the common dimension tables I just discussed. This commingling of two fact tables in the same SQL statement causes these problems:
Implementing Drill-Across: If you've followed the logic of the observation and the choice, the architecture to support drill-across begins to emerge. I use some modern data warehouse words to describe the two key aspects of this architecture:
The simplest definition of conformed dimensions is that two instances of a conformed dimension are identical. So if two fact tables have a "Customer" dimension, then "Customer" is conformed if the two dimensions are exactly the same. But this definition is unnecessarily restrictive. Here's the precise definition of conformed dimensions: Two dimensions are conformed if the fields that you use as common row headers have the same domain. When you bring two separate queries together in a drill-across operation, both queries must have the same number of row headers, arranged from left to right in the same order. All the rest of the columns (the computed facts) in the two queries, by definition, are not row headers. In other words, an independent examination of both queries shows that neither query has rows that duplicate the same row headers. To put it another way, the row headers form a unique key for each row of the answer set. To sort/merge the two queries, you must sort them the same way. At this point, it becomes possible to merge the rows of the two queries together in a single pass. The resulting merged answer set has a single set of row headers plus the combined set of computed facts returned from both queries. Because traditional sort/merge is the same as an outer join, it is possible for a row in the final merged answer set to have nulls for either the first set of computed facts or the second set, but not both! Once you've visualized the sort/merge step in drilling across, you really understand conformed dimensions. With conformed dimensions, the only thing you care about is matching row headers. If the contents of the respective fields you're using for the sort/merge are drawn from the same domains, then the match makes sense. If you try to match row headers from two dimensions that aren't conformed, you're guaranteed to get garbage. The sort/merge will fail, and the SQL engine will post the results from the two queries on separate lines Ñ and probably in separate sorting locations in the merged answer set. Amazing Magic: In my classes, I sometimes describe conformed dimensions as either dimensions that are exactly equal (the trivial case) or dimensions where "one is a subset of the other." For example, a brand dimension may be a subset of a more detailed product dimension. In this case, you can drill across two fact tables, one at the brand level with a brand dimension (such as a forecast), and the other at a detailed product level with a product dimension (such as sales transactions). Assume that the product dimension is a nice flat table containing the low cardinality brand attribute. If the row header of the two queries is simply "brand," then some amazing magic takes place. The engine automatically aggregates both data sets to the brand level, which is exactly the right level for the drill-across. If the names of the brands are drawn from the same domain, you can complete the drill-across (of forecast vs. actual) by confidently merging the rows with the same brand names. Many commercial query and report-writing tools perform this drill-across operation. You can see that it's possible to conform two dimensions even when they have some incompatible fields. You just need to be careful to avoid using these incompatible fields as row headers in drill-across queries. Not avoiding it lets a dimension contain some private fields that are meaningful only to a local user group. Centralizing vs. Conforming One final thought: In a recent guest column, my colleague Margy Ross pointed out that the decision to physically centralize a data warehouse has very little to do with conformed dimensions. (See Resources below.) Her point was that if you combine two data sets in a drill-across operation, you have to label them the same way, which is true whether the data sets are tightly administered on a single piece of hardware by one DBA or are loosely administered by remote IT organizations that merely agree to create a set of overlapping labels. Ralph Kimball (founder of the Ralph Kimball Group) co-invented the Star Workstation at Xerox and founded Red Brick Systems. He has three best-selling data warehousing books in print, including The Data Warehouse Toolkit, Second Edition (Wiley, 2002). He teaches dimensional data warehouse design through Kimball University and critically reviews large data warehouse projects. You can reach him through his Web site, www.ralphkimball.com. RESOURCES: Related Article at IntelligentEnterprise.com: "To Be or Not To Be Centralized," by Margy Ross, Feb. 1, 2003 "The Soul of the Data Warehouse, Part One: Drilling Down," March 20, 2003
This Page has been accessed " 4332" times. |