Tuesday, 13 August 2013

Understanding Drill down analysis in Star Schema

Problem Statement:  Assuming the users of the Auto Sales data warehouse is the Sales departments.
The Executives might want to do the analysis as follows:
1.       How many Sedan class cars were sold in the Northern Zone for the year 2006?
2.       Quarter wise Analysis of the above query.
3.       Quarter and State (in the Zone) wise analysis of the above query.
4.       Quantity of cars by make within the Sedan Class (e.g.: Swift, Maruti Dx etc.) by State (in the Zone) and by Quarter for the year 2006.

We see that in every query, the level of granularity increases.


Characteristics of Dimensional tables

1.       Primary Key of the table uniquely identifies each row.
2.       Large number of attributes.
3.       Attributes are generally textual in nature.
4.       Attributes in the dimensional table are not directly related to other attributes of the table.
5.       Dimensional table is not Normalized or Low level of normalization (for efficient query performance, query should pick the attribute and directly go to fact table)
6.       Compared to the fact table, it has fewer records.
7.       Facilitates drilling down and rolling up of data since the data is hierarchal. E.g.: Year, Month, Day etc.

 Characteristics of a Fact Table

1.       Concatenated Key:  A row in the fact table relates to a combination of rows from all dimensional tables.
2.       Data Grain: Level of detail for the measurement or metrics.
3.       Fully Additive Measures: The attributes, if can be summed up by simple addition are known as fully additive. When we run a query to aggregate the measures in a fact table, the output will be correct only, if the measures are fully additive in nature.
4.       Semi additive Measures:  These are attributes like percentage etc. which have been derived from the additive measures.
5.       Fewer attributes but more rows compared to dimensional tables.
6.       Sparse Data: for a particular combination of dimensional table, it might be the case that the fact table doesn’t have any data. E.g. : Form the month of February 2006, there might be no sale of Maruti Swift in Rajasthan, hence no entry would be present in the fact table This is an example of sparse data.
7.       Degenerate Dimensions: When we are selecting facts and dimensions from the operational systems, there may be some attributes, which are neither measures (facts) nor belong to dimension tables. E.g.: invoice number, Order number. They are useful for analysis and kept in the fact table.

The Fact less fact table

Assuming that a fact table is made for recording the attendance of the students. Dimensions of this model would be student, course and time.  If the student is present then it will be recorded as ‘1’ in the fact table.  The presence of any entry in the fact table would represent that that the student is present and the fact hence would not contain any specific measure i.e. it is fact less. Such types of situations arise when the fact is measuring /recording events.

Conclusion

The entity relationship model is not suitable for Decision support System. Dimensional modeling is apt for designing Decision support systems as they facilitate more analysis. It not only is easy for users to understand but also optimizes navigation and is most suitable for query processing. Hierarchies within different dimensions facilitate Drill down and roll up analysis of data, providing more flexibility for analysis.

Principles of Dimensional Modeling

Overview

The requirements definition completely drives the data design for the data warehouse.

Data design consists of putting together the data structures. A group of data elements form a data structure. Logical data design includes determination of the various data elements that are needed and combination of the data elements into structures of data. Logical data design also includes establishing the relationships among the data structures.  An essential component of this document is the set of information package diagrams. Remember that these are information matrices showing the metrics, business dimensions, and the hierarchies within individual business dimensions. The information package diagrams form the basis for the logical data design for the data warehouse. The data design process results in a dimensional data model.


So far we have formed the fact table and the dimension tables. How should these tables be arranged in the dimensional model? What are the relationships and how should we mark the relationships in the model? The dimensional model should primarily facilitate queries and analyses. What would be the types of queries and analyses? These would be queries and analyses where the metrics inside the fact table are analyzed across one or more dimensions using the dimension table attributes. Before we decide how to arrange the fact and dimension tables in our dimensional model and mark the relationships, let us go over what the dimensional model needs to achieve and what its purposes are. Here are some of the criteria for combining the tables into a dimensional model.
1.       The model should provide the best data access.
2.       The whole model must be query-centric.
3.       It must be optimized for queries and analyses.
4.       The model must show that the dimension tables interact with the fact table.
5.       It should also be structured in such a way that every dimension can interact equally with the fact table.

The model should allow drilling down or rolling up along dimension hierarchies. With these requirements, we find that a dimensional model with the fact table in the middle and the dimension tables arranged around the fact table satisfies the conditions. In this arrangement, each of the dimension tables has a direct relationship with the fact table in the middle. This is necessary because every dimension table with its attributes must have an even chance of participating in a query to analyze the attributes in the fact table. Such an arrangement in the dimensional model looks like a star formation, with the fact table at the core of the star and the dimension tables along the spikes of the star. The dimensional model is therefore called a STAR schema.

FACT and DIMENSIONAL Tables in Start Schema representation




Case Study

Problem Statement:
Let us say that the goal is to analyze sales for an automobile company e.g.  :  Maruti Suzuki. We want to build a data warehouse that will allow the user to analyze automobile sales in a number of ways. The output information from the data warehouse shall facilitate the following types of analysis on the sales of the car.
1.       How many Alto cars are sold in the state of Karnataka, during the marriage seasons?
2.       Which car maximized the profits for the hatchback sections? Were these profits less than or more than compared to the last quarter?
3.       Which locations shall more manufacturing units be setup to cater to the demands?

We shall try to identify the business dimensions, analyzing the problem statement.

1.       Product :  
a.       What kind of cars Maruti Suzuki is manufacturing? Examples would be Maruti 800, Maruti Zen, Wagon R, SX4, Swift and Alto.
b.      What would be the product line of each of these cars? – For instance, hatchback, sedan, SUV, Sports or Luxury. Where would each of the products fall into?
c.       What are the colors (interior as well as exterior) in which each model is available in?
d.      The first model year of each model?

2.       Dealer :
a.       Name of the Dealer?
b.      Location of the dealer? i.e. State and City
c.       Single Brand flag?  Does the dealer sells only Maruti Suzuki cars exclusively or he sells cars of Toyota, Ford as well? This can be answered in Yes or No hence we use a flag.
d.      Date of first operation?

3.       Customer Demographics :
a.       Name of the Customer?
b.      Gender?
c.       Income range?
d.      Marital Status?
e.      Vehicles owned?

4.       Payment Method  :
a.       Finance Type – Bought on Loan or Paid fully?
b.      Term of Loan in month.
c.       Agent facilitating the finances
d.      Interest rate of the loan
5.       Time: This is a common dimension. It can be interpreted as follows in the context of the problem.
a.       Date
b.      Month
c.       Quarter
d.      Year
e.      Day of month
f.        Season –{Winter,Summer,Autumn,Spring}
g.       Holiday Flag

But using these business dimensions, what exactly are the users analyzing? What numbers are they analyzing? The numbers the users analyze are the measurements or metrics that measure the success of their departments. These are the facts that indicate to the users how their departments are doing in fulfilling their departmental objectives. In the case of the automaker, these metrics relate to the sales. These are the numbers that tell the users about their performance in sales. These are numbers about the sale of each individual automobile. The set of meaningful and useful metrics for analyzing automobile sales is as follows:

1.       Actual sale price - The price at which a particular was sold to the customer
2.       MSRP sale price - Manufactured Suggested retail price of a product is the price which the manufacturer recommends that the retailer sell the product. The intention was to help to standardize prices among locations. While some stores always sell at, or below, the suggested retail price, others do so only when items are on sale or closeout/clearance.
3.       Dealer add-ons
4.       Dealer credits
5.       Dealer invoice
6.       Amount of down payment by the customer
7.       Amount financed


Information package diagram


PRODUCT
DEALER
CUSTOMER DEMOGRAPHICS
PAYMENT  METHOD
TIME
Model Name
Name of the dealer
Customer Name
Finance Type
Year
Product Line
Location of the dealer
Marital Status
Term (Months)
Quarter
Launching Year
Single brand flag
Gender
Agent Name
Month
Interior colors
Date of First operation
Income range
Interest Rate
Date
Exterior Colors
Vehicles owned
Holiday Flag
Season
FACTS : Actual Sale Price, MSRP Price, Dealer add-ons , Dealer credits, Dealer Invoice , Amount Financed , Amount of down payment , Quantity Sold




Information package: Automaker Sales

Dimensional Modeling: Basic Concepts

Difference between OLTP and Decision Support Systems

The comparison can be well understood by looking at the above figure which depicts a moving car.
The OLTP systems are operational systems, which participate in day to day activities of the business. If business is the Car, OLTP systems are the wheels of the car, which help it move by loading data in the dbases about a single entity or “make” the wheels of the car turn. Examples would be
1.       Take an order
2.       Process  claim
3.       Make a shipment
4.       Generate an invoice
5.       Reserve an airline seat
The technological implementation would be a Java or .NET based web form at the front end, which is used as an input to store data in the backend which could be any dbase.
The Decision support system, on the other hand is more analytical in nature. With reference to our example, it would be “watching” the wheels of the car turn and predicting the behavior of car at different speeds and on different surfaces. In Business terminology, decision support systems would do the following:
1.       Show the top selling product for a particular product in a particular region
2.       Alert the manager, when a particular store sells a particular product, below a certain quantity.
Since the output data in the decision support systems is used for analytical purposes. It follows a different data model as compared to the traditional OLTP systems.  This is called as Dimensional Modeling.
We do need different types of decision support systems to provide strategic information. The type of information needed for strategic decision making is different from that available from operational systems. We need a new type of system environment for the purpose of providing strategic information for analysis, discerning trends, and monitoring performance.

Let us examine the desirable features and processing requirements of this new type of system environment. Let us also consider the advantages of this type of system environment designed for strategic information.

A New Type of System Environment

The desired features of the new type of system environment are:
1.       Database designed for analytical tasks - (achieved through Dimensional modeling)
2.       Data from multiple applications- (achieved through ETL tools like Informatica, Abinitio)
3.       Easy to use and conducive to long interactive sessions by users
4.       Read-intensive data usage
5.       Direct interaction with the system by the users without IT assistance (achieved through reporting  tools like Cognos )
6.       Content updated periodically and stable (achieved through Dimensional modeling)
7.       Content to include current and historical data (achieved through Dimensional modeling)
8.       Ability for users to run queries and get results online (achieved through reporting  tools like Cognos )
9.       Ability for users to initiate reports (achieved through reporting  tools like Cognos )

This new system environment that users desperately need to obtain strategic information happens to be the new paradigm of data warehousing. Enterprises that are building data warehouses are actually building this new system environment. This new environment is kept separate from the system environment supporting the day-to-day operations. The data warehouse essentially holds the business intelligence for the enterprise to enable strategic decision making.

Defining Business Requirements: Key to a successful data warehouse


The new methodology for determining requirements for a data warehouse system is based on business dimensions. It flows out of the need of the users to base their analysis on business dimensions. The new concept incorporates the basic measurements and the business dimensions along which the users analyze these basic measurements. Using the new methodology, you come up with the measurements and the relevant dimensions that must be captured and kept in the data warehouse. You come up with what is known as an information package for the specific subject.
 Our primary goal in the requirements definition phase is to compile information packages for all the subjects for the data warehouse. Once we have firmed up the information packages, we’ll be able to proceed to the other phases.
Essentially, information packages enable you to:
·         Define the common subject areas
·         Design key business metrics
·         Decide how data must be presented
·         Determine how users will aggregate or roll up
·         Decide the data quantity for user analysis or query
·         Decide how data will be accessed

The Requirement gathering phase involves extensive client interaction and interviews between the Vendor Team with the senior management, middle management, Business Analysts and the IT department of the client. Executives will give you a sense of direction and scope for your data warehouse. They are the ones closely involved in the focused area. The key departmental managers are the ones that report to the executives in the area of focus. Business analysts are the ones who prepare reports and analyses for the executives and managers. The operational system DBAs and IT applications staff will give you information about the data sources for the warehouse.

How to enable Version Control in Informatica 9.5



1. Log into the Admin Console
2. Click on the repository > Properties
3. Under repository Properties > Edit (Operating mode)


4. Choose the mode as Exclusive and enable Version Control as shown


Static-Deployment Groups


1. Used to migrate code between two different repositories only.
2. Cannot be used to migrate code between folders of the same repository
3. Login to Repository manager.
4. Click on the repository Service
5. Goto > Tools > Deployment > Groups




6. Create a Deployment group and add permissions
7. Click the repository object to be added to the deployment group.
8. Go to Tools > Add to Deployment group ..
9. Repeat the process for all repository objects across folders.
10. Drag and Drop the static deployment group to the target repository.

Informatica Installation on Linux- (Distributed Environment)

Challenges

1              Setting up of DISPLAY variable                  
2              Setting up of Environment variables for oracle client installation                                               
3              Setting up of Kernel Parameters                                              
4              Downloading oracle specific packages                                   
5              Installing oracle client on the linux box                                  

6              Making soft links to the shared objects 

Risks     
 1              Firewalls of the Linux box need to be opened to install oracle specific packages using yum install                   <packagename>
 2              Check if the RS needs to be re-assigned shared objects via soft links at Linux level every time the                  machine is re-started           

Top Technologies/Tools to work for in Indian IT industry



1 Salesforce.com – One of the fastest growing technologies. Provides the platform for custom development of applications on the cloud. It will gradually end the reign of legacy databases like Oracle.

2MDM – Another hot skill to have. Clients are looking for MDM Gurus who could sync the massive volume of enterprise data. Tools: Informatica MDM, Dell Boomi MDM

3 SAP – Consider yourself lucky if you are getting into any of the SAP modules. Excellent technology to work to have a lasting career in IT.

4Data Warehouse – Development Projects in DWH are gradually decreasing. Excellent place for Prod-Support projects and can ensure good job security upto 5-7 years. Tools- Informatica, Abinitio, DataStage


Business Intelligence Reporting – This is front end to DWH. Tools- Cognos, SAP BO