Sunday, 17 November 2013

Data Warehouse Introduction and Basic Concepts-OLTP,Facts,Dimensions,OLAP,Star Schema,Snowflake Schema

Data Warehouse Introduction and Basic Concepts
What is data warehouse 
Data warehousing is the science of storing data for the purpose of meaningful future analysis. A subject-oriented, integrated, time-variant, non volatile collection of data used in support of management decision-making processes.

Data warehouse key features :
Subject Oriented
This means a data warehouse has a defined scope and focusing on particular subject matter and it only stores data under that scope. So for example, if the sales team of your company is creating a data warehouse - the data warehouse by definition is required to contain data related to sales (and not the data related to production management for example).
This means that data once stored in the data warehouse are not removed or deleted from it and always stay there no matter what.
This means that the data stored in a data warehouse make sense. Fact and figures are related to each other and they are integrable and projects a single point of truth.
Time variant 
This means that data is not constant, as new and new data gets loaded in the warehouse, data warehouse also grows in size.

Key Note
Note here, Non-volatile means that the data once loaded in the warehouse will not get deleted later. Time-variant means the data will change with respect to time.

Data warehouse Architecture :

What are the benefits of data warehouse :
A data warehouse helps to integrate data and store them historically so that we can analyze different aspects of business including, performance analysis, trend, prediction etc. over a given time frame and use the result of our analysis to improve the efficiency of business processes.
Why Data Warehouse is used :
For a long time in the past and also even today, Data warehouses are built to facilitate reporting on different key business processes of an organization, known as KPI. Data warehouses also help to integrate data from different sources and show a single-point-of-truth values about the business measures.
Data warehouse can be further used for data mining which helps trend prediction, forecasts, pattern recognition etc. 
What is the difference between OLTP and OLAP :
OLTP is the transaction system that collects business data. Whereas OLAP is the reporting and analysis system on that data. OLTP systems are optimized for INSERT, UPDATE operations and therefore highly normalized. On the other hand, OLAP systems are deliberately denormalized for fast data retrieval through SELECT operations.
Key Note:
In a departmental shop, when we pay the prices at the check-out counter, the sales person at the counter keys-in all the data into a "Point-Of-Sales" machine. That data is transaction data and the related system is a OLTP system. 
On the other hand, the manager of the store might want to view a report on out-of-stock materials, so that he can place purchase order for them. Such report will come out from OLAP system
What is data mart :
Data marts are generally designed for a single subject area. An organization may have data pertaining to different departments like Finance, HR, Marketing etc. stored in data warehouse and each department may have separate data marts. These data marts can be built on top of the data warehouse.
What is ER model:
ER model or entity-relationship model is a particular methodology of data modeling wherein the goal of modeling is to normalize the data by reducing redundancy. This is different than dimensional modeling where the main goal is to improve the data retrieval mechanism.
What is dimensional modeling :
Dimensional model consists of dimension and fact tables. Fact tables store different transactional measurements and the foreign keys from dimension tables that qualify the data. The goal of Dimensional model is not to achieve high degree of normalization but to facilitate easy and faster data retrieval.
Ralph Kimball is one of the strongest proponents of this very popular data modeling technique which is often used in many enterprise level data warehouses.
What is dimension :
A dimension is something that qualifies a quantity (measure).
For an example, consider this: If I just say… “20kg”, it does not mean anything. But if I say, "20kg of Rice (Product) is sold to Tamatam(customer) on 5th April (date)", then that gives a meaningful sense. These product, customer and dates are some dimension that qualified the measure - 20kg.
Dimensions are mutually independent. Technically speaking, a dimension is a data element that categorizes each item in a data set into non-overlapping regions.
What is Fact :
A fact is something that is quantifiable (Or measurable). Facts are typically (but not always) numerical values that can be aggregated.
What are additive, semi-additive and non-additive measures:
Non-additive Measures
Non-additive measures are those which cannot be used inside any numeric aggregation function (e.g. SUM(), AVG() etc.). One example of non-additive fact is any kind of ratio or percentage. Example, 5% profit margin, revenue to asset ratio etc. A non-numerical data can also be a non-additive measure when that data is stored in fact tables, e.g. some kind of varchar flags in the fact table.
Semi Additive Measures
Semi-additive measures are those where only a subset of aggregation function can be applied. Let’s say account balance. A sum() function on balance does not give a useful result but max() or min() balance might be useful. Consider price rate or currency rate. Sum is meaningless on rate; however, average function might be useful.
Additive Measures
Additive measures can be used with any aggregation function like Sum(), Avg() etc. Example is Sales Quantity etc.
Schemas :
A schema is a collection of database objects, including tables, views, indexes, and synonyms. 

There is a variety of ways of arranging schema objects in the schema models designed for data warehousing. The most common data-warehouse schema model is a star schema.

However, a significant but smaller number of data warehouses use third-normal-form (3NF) schemas, or other schemas which are more highly normalized than star schemas. These 3NF data warehouses are typically very large data warehouses, which are used primarily for loading data and for feeding data marts. These data warehouses are not typically used for heavy end-user query workloads. 
Star Schema
The star schema is the simplest data warehouse schema. It is called a star schema because the diagram of a star schema resembles a star, with points radiating from a center. The center of the star consists of one or more fact tables and the points of the star are the dimension tables. 

This schema is used in data warehouse models where one centralized fact table references number of dimension tables so as the keys (primary key) from all the dimension tables flow into the fact table (as foreign key) where measures are stored. This entity-relationship diagram looks like a star, hence the name. 
Consider a fact table that stores sales quantity for each product and customer on a certain time. Sales quantity will be the measure here and keys from customer, product and time dimension tables will flow into the fact table. 
A star schema is characterized by one or more very large fact tables that contain the primary information in the data warehouse and a number of much smaller dimension tables (or lookup tables), each of which contains information about the entries for a particular attribute in the fact table.
A star query is a join between a fact table and a number of lookup tables. Each lookup table is joined to the fact table using a primary-key to foreign-key join, but the lookup tables are not joined to each other.
Cost-based optimization recognizes star queries and generates efficient execution plans for them. (Star queries are not recognized by rule-based optimization.)
A typical fact table contains keys and measures.
For example, a simple fact table might contain the measure Sales, and keys Time, Product, and Market. In this case, there would be corresponding dimension tables for Time, Product, and Market. The Product dimension table, for example, would typically contain information about each product number that appears in the fact table.
A measure is typically a numeric or character column, and can be taken from one column in one table or derived from two columns in one table or two columns in more than one table.
The main advantages of star schemas :
Provide a direct and intuitive mapping between the business entities being analyzed by end users and the schema design.
Provides highly optimized performance for typical data warehouse queries. 
Architecture :

Example :

Snowflake Schema
The snowflake schema is a more complex data warehouse model than a star schema, and is a type of star schema. It is called a snowflake schema because the diagram of the schema resembles a snowflake.

May have more than 1 dimension table for each dimension due to the further normalization of each dimension table.  Dimension tables are in the third normal form (3NF).
Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has been grouped into multiple tables instead of one large table.

For example, a product dimension table in a star schema might be normalized into a Product table but in a snowflake schema it is in normalized into  Product_Category  table , Product_Manufacturer table. While this saves space, it increases the number of dimension tables and requires more foreign key joins.
The result is more complex queries and reduced query performance

Architecture :


Star schema vs. Snowflake schema

Star SchemaSnowflake Schema
UnderstandabilityEasier for business users and analysts to query data.May be more difficult for business users and analysts due to number of tables they have to deal with.
Dimension tableOnly have one dimension table for each dimension that groups related attributes. Dimension tables are not in the third normal form.May have more than 1 dimension table for each dimension due to the further normalization of each dimension table.  Dimension tables are in the third normal form (3NF).
Query complexityThe query is very simple and easy to understandMore complex query due to multiple foreign key joins between dimension tables
Query performanceHigh performance. Database engine can optimize and boost the query performance based on predictable framework.More foreign key joins therefore longer execution time of query in compare with star schema
When to useWhen dimension tables store relative small number of rows, space is not a big issue we can use star schema.When dimension tables store large number of rows with redundancy data and space is such an issue, we can choose snowflake schema to save space.
Foreign Key JoinsFewer JoinsHigher number of joins
Data warehouse systemWork best in any data warehouse / data martBetter for small data warehouse/ data mart

No comments:

Post a Comment

Hi User, Thank You for Visiting My Blog. Please Post Your Feedback/Comments/Query.

Subscribe to Blog Posts by Email

ExcelKingdom-Popular Posts

ExcelKingdom-Random Posts