Thursday, January 24, 2013

Data Warehouse design


Line of business:
LOB is a general term which often refers to a set of one or more highly related products which service a particular customer transaction or business need.

Data Warehouse:
Analyzing data from databases that support line-of-business (LOB) applications is usually not an easy task. The normalized
relational schema used for an LOB application can consist of thousands of tables. Naming conventions are frequently not
enforced. Therefore, it is hard to discover where the data you need for a report is stored.

Common solution to these problems is to create a data warehouse (DW). A DW is a centralized data silo for an enterprise that contains merged, cleansed, and historical data. 
Queries often involve reading huge amounts of data and require large scans. To support such queries, it is imperative to use an appropriate physical design for a DW.

Common problem designing DWH:

  1. Normalization is a process in which you define entities in such a way that a single table represents exactly one entity. The goal is to have a complete and non-redundant schema. In a database that supports an LOB application for an enterprise, you might finish with thousands of tables!  
  2. Finding the appropriate tables and columns you need for a report can be painful in a normalized database simply because of the number of tables involved.
  3. In addition, a query that joins 10 tables, as would be required in reporting sales by    countries and years, would not be very fast.
Resolution:
The Star and Snowflake schemas are both simplified and narrative. A data warehouse should use Star and/or Snowflake designs
Start Schema:
It resembles a star. There is a single central table, called a fact table, surrounded by multiple tables called dimensions.
The fact table is connected to all the dimensions with foreign keys. Usually, all foreign keys taken together uniquely identify each row in the fact table, and thus collectively form a unique
key, so you can use all the foreign keys as a composite primary key of the fact table. You can also add a simpler key. The fact table is on the “many” side of its relationships with the dimensions.
If you were to form a proposition from a row in a fact table, you might express it with a sentence such as, “Customer A purchased product B on date C in quantity D for amount E.”
This proposition is a fact; this is how the fact table got its name.
Dimensions with connections to multiple fact tables are called shared or conformed dimensions.

Snowflake Schema:
Star schema with normalized dimensions is called a Snowflake schema. Queries on a Star schema are simpler and faster than queries on a Snowflake schema, because they involve fewer joins.
You should use a Snowflake schema only for quick POC projects.

Auditing and Lineage:
For every update, you should audit who made the update, when it was made, and how many rows were transferred to each dimension and fact table in your DW. you might want to know where
each row in a dimension and/or fact table came from and when it was added. In such cases, you must add appropriate columns to the dimension and fact tables. Such detailed auditing information
is also called lineage in DW terminology.

Download SQL Queries from link to build DWH.

2 comments:

  1. Replies
    1. This comment has been removed by a blog administrator.

      Delete