MSBI-SSIS(SQL Server Integration Services) Architecture and ETL Process
SSIS is one of the most powerful features which were introduced in SQL Server 2005.
SSIS is an ETL (extract transform and load features) tool.
SSIS can be defined in various ways which are listed below.
- ETL(extract transform and load features) tool.
- Control flow engine.
- Application platform.
- High-performance data transformation pipeline.
- Data import/export wizard.
SSIS ( SQL Server Integration Services ) is one of technologies which is in high demand currently in IT industry.
All organisations (be it small, medium or large) have to maintain their data in different databases like spreadsheet, flat files, RDMS systems etc.
Sometimes we need to access data from different sources and need to modify the data according to business needs. Need for accessing the data from multiple data sources and need for performing broad range of data migration tasks and transforming them according to the business needs has been very much in demand now.
In 2000, Microsoft released DTS (Data Transformation Services) to transform data from one source to different destination.
After 5 years in 2005, Microsoft launched another version of DTS with more powerful features. This is named as SSIS. SSIS is more powerful than DTS and its user friendly. SSIS has wide range of new features added to it. SSIS is part of SQL SERVER Business Intelligence Studio.
Following Services are part of SQL SERVER Business Intelligence Studio.:
SSIS (SQL Server Integration Services )
SSRS (SQL SERVER Reporting Services)
SSAS (SQL SERVER Analysis Services)
As already mentioned SSIS is the successor of DTS. But, DTS and SSIS have very few things in common. DTS code is completed rewritten to design and build SSIS.
Lots of new features are added in SSIS. Some of the new features in SSIS are listed below.
Latest features in SSIS:
- For Each loop containers
- Package configurations
- Property expressions
- Better Active X controllers
- Script tasks
- Control flow and Data flow are separated to provide better design and to make project/code more readable and understandable
- Event Handlers
- Precedence constraints are improvised to handle conditional checks/Conditional Expressions.
- Pivot and Unpivot.
SSIS Architecture :
Building Blocks of SSIS:
Control Flow: Work flow or Process flow in SSIS is known as Control flow. Control flow consists if one of more tasks that will be executed when SSIS package runs.
Data Flow: Data Flow in SSIS defines/indicates how data should flow from one data source to other destination. It holds information about data source, data destination and data transformation.
Even Handlers: Event handlers are the tasks that are executed when some event occurs during SSIS package execution. Ex: If some error occurs during SSIS package execution, then event handler can be programmed to run to ignore that error and continue to the next step.
Package Explorer: Package Explorer contains complete information about the Variables, Precedence constrains, Event handlers, Connection Managers, Log providers, Executable.
Precedence Constraints: Precedence Constraints links various tasks in SSIS. In simple words it is the arrow marks that connect 2 difference tasks in SSIS. Based on the direction of Precedent Constraints tasks will be executed in order. In other words Precedence constraints are needed for ordering / organising the control flow in SSIS.
Connection Managers: Connection Managers contain information that is needed to connect to various data sources and to data destinations.
Toolbox: Toolbox contains is collection of Control Flow items, Maintenance plan Tasks, Data Flow Sources, Data Flow Destinations and Data Flow Transformations. In simple words Toolbox in SSIS contains different tasks/ containers. Task is nothing but a work unit to perform certain job/work/action.
Variables: Variables parameters store information that can be used by Containers/tasks in SSIS during the SSIS package execution.
Advantages of SSIS:
- SSIS has a very user friendly Graphical User Interface (GUI) through which difficult tasks can also be done with very much ease.
- We can perform different types of tasks (like loading data, extracting data, renaming file, sending mails, sending files through FTP , Data mining and lot more) in a single SSIS package without any manual intervention.
- SSIS package can be scheduled to run at a given time as per the business needs.
- SSIS can be used to connect to different data sources (like flat file, MS Access, Excel, SQLS ERVER, SYBASE, MYSQL, ORACLE etc.,). In simple words using SSIS we can connect to almost all the external data sources.
- Deploying SSIS package is very easy.
SSIS is an ETL tool (Extract, Transform and Load) which is very much needed for the Data warehousing applications. Also SSIS is used to perform the operations like loading the data based on the need, performing different transformations on the data like doing calculation’s (Sum, Average, etc.) and to define a workflow of how the process should flow and perform some tasks on the day to day activity.
Prior to SSIS, Data Transformation Services (DTS) in SQL Server 2000 performs the tasks with fewer features. With the introduction of SSIS in SQL Server 2005 many new features can be used. To develop your SSIS package you need to get installed with the SQL Server Business Intelligence Development Studio which will be available as client tool when installing SQL Server Management Studio (SSMS).
SSMS Vs. BIDS
SSMS provides different options to develop your SSIS package starting with Import and Export wizard with which you can copy the data from one server to the other or from one data source to the other. With these wizards we can create a structure on how the data flow should happen and make a package and deploy it based on our needs to execute in any environment.
Business Intelligence Development Studio (BIDS) is a tool which can be used to develop the SSIS packages. BIDS is available with SQL Server as an interface which provides the developers to work on the work flow of the process that can be made step by step. Once the BIDS is installed with the SQL Server installation we can locate it and start our process as shown in the steps below.
ETL Process :
ETL stands for Extract Transform and Load. These are simple day-to-day words we use in our daily lives. The figure below depicts ETL in real world scenario.
E - Extract data: from various homogeneous or non-homogeneous source systems. Data could be stored in any of the following forms though not limited to them: Flat file, Database, XML, Web queries etc. When we can have sources of such variety, the job of extraction is to fetch the data from these sources and make it available for the next step.
T - Transform Data: As already discussed that the data are coming from various sources and we cannot assume that the data is structured in the same way across all the sources. Therefore, we need to transform the data to a common format so that the other transformations can be done on them. Once we have the data we need to perform various activities like:
- Data cleansing
- Mandatory check
- Data type check
- Check for foreign key constraints
- Check for business rules and apply business rules
- Creation of surrogate keys
- Sorting the data
- Aggregating the data
- Transposing the data
- Trim the data to remove blanks.
L - Load Data: Once the transformations are done and the data takes the form as per the requirement, we have to load the data to the destination systems. The destinations can also be as varied as the sources. Once the data reaches the destination, it is consumed by other systems, which either stores it as historical data, generate reports out of it, build modes to take business decisions etc.