programming4us
programming4us
DATABASE

Using SQL Server 2005 Integration Services : Working with Integration Services Packages (part 2) - Data Flow

6/5/2012 9:20:16 AM

Data Flow

One of the Integration Services tasks is the Data Flow task. Its function is to support ETL scenarios in which data enters into the data flow from one or more sources, is transformed in the middle of the flow, and is then consumed by destinations at the end of the flow. As the data goes through the flow, the flow itself can be split and merged, and the data can be conditionally routed through the multiple flow paths. The Data Flow task is such a powerful and flexible part of Integration Services that it has its own object model, and the Integration Services designer has a separate design surface area and toolbox to facilitate working visually with data flows. Figure 2 shows the data flow design surface.

Figure 2. The data flow design surface and toolbox

Anatomy of a Data Flow

A data flow consists of components connected by paths. Each component is a point of operation on the data as the data flows from component to component. Components that operate on the data are called transforms. Components can also be sources that generate the data at the start of a flow, or destinations that consume the data at the end of a flow. Those components are called source adapters and destination adapters, respectively. Components have collections of inputs and outputs, and paths are used to connect the output of one component to the input of another. Inputs and outputs have collections of columns that describe the data flow from the output to the input. The data that flows through the components are held in memory buffers represented as rows of columns, just as data is represented in a table in a relational database.

Source Adapters

Source adapters are data flow components that generate data for the start of a flow. The data can be generated by reading from an external source, such as a database, or it can be created internally by the source adapter. Source adapters include:

  • OLE DB source adapter Works with the OLE DB Connection Manager to read data into the flow using an OLE DB provider.

  • XML source adapter Reads data into the flow from XML files. This adapter is unique in Integration Services because it is the only stock source adapter that supports multiple non-error outputs.

Destination Adapters

Destination adapters are data flow components that consume data at the end of a data flow. The components can insert the data into a database or simply end the data flow without doing anything with the data. Destination adapters include:

  • OLE DB destination adapter The counterpart to the OLE DB source adapter. This adapter inserts data from the flow into a database that can be accessed using an OLE DB provider.

  • Data Reader destination adapter Makes the data at the end of a flow available to other clients using Microsoft ADO.NET.

Transforms

Transforms are used within the data flow to perform operations on the data, direct data through the flow, or modify the flow itself. Operations can be row-based or set-based. For example, a transform that performs row-based operations is the Copy Column transform. Given a column in a row, the transform needs access to only one row at a time because it copies the data for that column into another column in the same row. An example of a transform that performs set-based operations is the Aggregate transform. To aggregate rows based on data in a column, this transform needs access to the entire set of rows. Transforms that perform row-based operations are called synchronous in Integration Services because the operation occurs on one row at a time such that the count and order of rows that flow out of the transforms are the same, or synchronous with, the rows that flow in. Likewise, transforms that perform set-based operations are called asynchronous because their operations generate rows that flow out of the transform independently of, or asynchronously with, the rows that flow in. It is worth noting that aynchronous transforms can generate rows based on either subsets or the entire set of input rows.

Synchronous, or row-based, transforms include:

  • Derived Column transform Applies a user-defined expression to the data in a column.

  • Character Mapping transform Applies a user-selected, string-related operation to the data in a column.

Asynchronous, or set-based, transforms include:

  • Aggregate transform Aggregates data in a set of rows.

  • Sort transform Sorts a set of rows based on columns that are designed as keys.

Some transforms are designed to affect the flow of the data by splitting or merging it. These transforms include:

  • Conditional Split transform Applies a user-defined expression to the data and, based on the result, directs each row to one of multiple output flows. This transform is synchronous.

  • Union All transform Combines data from multiple incoming flows into one outgoing flow. This transform is asynchronous. Figure 3 shows data flow containing multiple flow paths.

Figure 3. Data flow that combines two sources and is then transformed and loaded into a database table

Data Flow Layout

The Data Flow task does not need to know the operational details of its components for it to execute. To the data flow engine, each component is a black box. All that is needed from each component is the correct setting of its metadata, which includes the component’s properties, inputs, outputs, and columns. For example, the Aggregate transform indicates to the data flow that its operation is asynchronous by setting its output’s SynchronousInputID property to 0. How the component’s metadata is set up and how the components are connected to each other is called the data flow layout.

More Info

All objects in the layout—including components, inputs, outputs, paths, columns, and properties—are identified with an integer that is unique within that layout. These object IDs are used for identifying objects, and appear in parentheses in object identification strings (for example, ‘Component “OLE DB Source” (34)’).


Data Flow Validation

When you design a package or before you execute a data flow, validation takes place. Validation ensures that the data flow layout is correctly constructed and that its components are configured correctly. The layout validation is done by the data flow engine, and the component validation is done by the components. The data flow engine does not allow execution to start if any part of layout or component validation fails.

Data Flow Execution

Given the many possible layouts, how does the data flow engine know how to execute the flow? Before execution begins, the data flow task first analyzes the layout to construct execution trees.

Execution Trees

Each tree describes a segment of the layout for which only one buffer is needed. In general, the segment described by a tree has the following pattern: The segment starts at an asynchronous output, such as the output of a source adapter, where rows are added to the buffer. That segment continues on the path from the output and includes any synchronous components along the way before the segment ends. Finally, the segment ends at the first input that isn’t synchronous with any output, such as the input of a destination adapter where the rows added to the buffer by the initial output that started the segment are consumed.

More Info

Along with destination adapters, asynchronous components such as the Aggregate end execution trees with their inputs. However, they also start new trees with their asynchronous outputs.


Execution Plans

Once all the trees are constructed, they are used to generate an execution plan. This plan determines how the data buffers are queued and passed to the components, and how the data flow engine’s worker threads are allocated across the layout. The execution trees and plan can be logged for analysis to help with debugging and improving performance, via the Logging dialog box in the designer (as shown in Figure 4).

Figure 4. Configuring the data flow task to log its execution trees and execution plan

Execution Phases

As the data flow is executed, components are given the chance to respond to different phases of execution. This gives the data flow engine and components a chance to perform any additional processing needed for the execution. Here, in order, are the execution phases.

  • PrepareForExecute The data flow engine and components can perform any one-time preparation needed before execution starts.

  • PreExecute As with PrepareForExecute, the data flow engine and components can perform any one-time preparation needed before execution starts. PreExecute occurs closer to the main execution of the data flow, so you should use PreExecute instead of PrepareForExecute to perform any work that requires acquiring and holding onto an expensive or shared resource to minimize the time the resource is needed. Also, any connections needed by components are acquired after PrepareForExecute and before PreExecute, so any preparation that requires connections should occur in PreExecute instead of PrepareForExecute.

  • PrimeOutput Source adapters and asynchronous transforms are primed with data buffers via PrimeOutput. However, they handle the buffers differently. Source adapters immediately start adding data to those buffers. As those buffers become full, the data flow engine pushes them downstream to be consumed by other components. Asynchronous transforms do not add immediately data to their buffers. Instead, they hold onto the buffers and end their PrimeOutput phase. Because asynchronous components, like the Aggregate, add output data based on their input data, they must wait until they get enough input data to start adding rows to their cached buffers.

  • ProcessInput Destination adapters and transforms are asked to process their input data via ProcessInput. For asynchronous components, whose output buffers are cached in PrimeOutput, data is added to those buffers when enough input data has been processed.

  • PostExecute This phase counterbalances PreExecute and is used for any post-execution processing, usually releasing any resources acquired in PreExecute. Connections that were made before PreExecute are also released after PostExecute.

  • Cleanup This phase counterbalances PrepareForExecute and is used for any necessary cleanup. Once the execution phases are complete, the data flow task signals to the Integration Services runtime that it has either succeeded or failed in its execution.

Viewing Data in the Flow During Execution

While the data flow task executes in the designer, you can see the data passing from one component to another by using data viewers, which is great for debugging if you need to see how the components transform the data. To do this, double-click on a path between the two components where you want to view the data, select Data Viewers, and then click Add. Four types of data viewers can be added to display the data in different forms, with the grid format being the most traditional because it displays the data in tabular form.

One thing to keep in mind when using the data viewers is that when they access the data flow buffers to display the data, they actually pause the data flow. To allow the flow to continue, you must click the Play button on the viewer. This means if the row in the data flow that you are trying to view is near the end of the flow, you might have to click through a lot of pauses before seeing the row. One solution, although not perfect, is to fast-forward through the data flow, hopefully pausing again closer to the row you want. You can do this by clicking the Detach button to detach the viewer to let more rows through. When enough rows have gone through, click the Attach button to attach the viewer again. Figure 5 shows the data viewer in action.

Figure 5. Data viewer showing the data during execution


Handling Component Errors Within the Data Flow

Some components support error outputs, which are represented by red outputs in the designer. The term error output, however, is misleading because what constitutes an error is component specific. In other words, these are outputs used by the component for pushing rows into the data flow that have not been successfully processed by the component. For example, the Lookup component might consider a failed lookup an error and then route that row to its error output. Component errors in Integration Services are divided into two groups: truncation errors and non-truncation errors. (An example of a truncation error is when some string data used by the component is too long and needs to be trimmed.) These two categories were created because in many business scenarios, truncations are either not considered errors or must be handled differently from other errors. Most components that support error outputs also have the option to fail on an error or truncation, ignore the error or truncation, or route the row with the error or truncation to their error outputs. You can set this option by modifying the component’s error or truncation disposition on the column that the error or truncation can occur on. Figure 6 shows what a data flow can look like when error outputs are connected.

Figure 6. A source adapter with its error output connected

Other  
 
video
 
Video tutorials
- How To Install Windows 8

- How To Install Windows Server 2012

- How To Install Windows Server 2012 On VirtualBox

- How To Disable Windows 8 Metro UI

- How To Install Windows Store Apps From Windows 8 Classic Desktop

- How To Disable Windows Update in Windows 8

- How To Disable Windows 8 Metro UI

- How To Add Widgets To Windows 8 Lock Screen

- How to create your first Swimlane Diagram or Cross-Functional Flowchart Diagram by using Microsoft Visio 2010
programming4us programming4us
programming4us
 
 
programming4us