Data Flow
One of the Integration
Services tasks is the Data Flow task. Its function is to support ETL
scenarios in which data enters into the data flow from one or more
sources, is transformed in the middle of the flow, and is then consumed
by destinations at the end of the flow. As the data goes through the
flow, the flow itself can be split and merged, and the data can be
conditionally routed through the multiple flow paths. The Data Flow task
is such a powerful and flexible part of Integration Services that it
has its own object model, and the Integration Services designer has a
separate design surface area and toolbox to facilitate working visually
with data flows. Figure 2 shows the data flow design surface.
Anatomy of a Data Flow
A data flow consists of
components connected by paths. Each component is a point of operation on
the data as the data flows from component to component. Components that
operate on the data are called transforms.
Components can also be sources that generate the data at the start of a
flow, or destinations that consume the data at the end of a flow. Those
components are called source adapters and destination adapters,
respectively. Components have collections of inputs and outputs, and
paths are used to connect the output of one component to the input of
another. Inputs and outputs have collections of columns that describe
the data flow from the output to the input. The data that flows through
the components are held in memory buffers represented as rows of
columns, just as data is represented in a table in a relational
database.
Source Adapters
Source adapters are
data flow components that generate data for the start of a flow. The
data can be generated by reading from an external source, such as a
database, or it can be created internally by the source adapter. Source
adapters include:
OLE DB source adapter Works with the OLE DB Connection Manager to read data into the flow using an OLE DB provider.
XML source adapter
Reads data into the flow from XML files. This adapter is unique in
Integration Services because it is the only stock source adapter that
supports multiple non-error outputs.
Destination Adapters
Destination adapters are
data flow components that consume data at the end of a data flow. The
components can insert the data into a database or simply end the data
flow without doing anything with the data. Destination adapters include:
OLE DB destination adapter
The counterpart to the OLE DB source adapter. This adapter inserts data
from the flow into a database that can be accessed using an OLE DB
provider.
Data Reader destination adapter Makes the data at the end of a flow available to other clients using Microsoft ADO.NET.
Transforms
Transforms are used
within the data flow to perform operations on the data, direct data
through the flow, or modify the flow itself. Operations can be row-based
or set-based. For example, a transform that performs row-based
operations is the Copy Column transform. Given a column in a row, the
transform needs access to only one row at a time because it copies the
data for that column into another column in the same row. An example of a
transform that performs set-based operations is the Aggregate
transform. To aggregate rows based on data in a column, this transform
needs access to the entire set of rows. Transforms that perform
row-based operations are called synchronous
in Integration Services because the operation occurs on one row at a
time such that the count and order of rows that flow out of the
transforms are the same, or synchronous with, the rows that flow in.
Likewise, transforms that perform set-based operations are called asynchronous
because their operations generate rows that flow out of the transform
independently of, or asynchronously with, the rows that flow in. It is
worth noting that aynchronous transforms can generate rows based on
either subsets or the entire set of input rows.
Synchronous, or row-based, transforms include:
Derived Column transform Applies a user-defined expression to the data in a column.
Character Mapping transform Applies a user-selected, string-related operation to the data in a column.
Asynchronous, or set-based, transforms include:
Some transforms are designed to affect the flow of the data by splitting or merging it. These transforms include:
Conditional Split transform
Applies a user-defined expression to the data and, based on the result,
directs each row to one of multiple output flows. This transform is
synchronous.
Union All transform Combines data from multiple incoming flows into one outgoing flow. This transform is asynchronous. Figure 3 shows data flow containing multiple flow paths.
Data Flow Layout
The Data Flow task does
not need to know the operational details of its components for it to
execute. To the data flow engine, each component is a black box. All
that is needed from each component is the correct setting of its
metadata, which includes the component’s properties, inputs, outputs,
and columns. For example, the Aggregate transform indicates to the data
flow that its operation is asynchronous by setting its output’s SynchronousInputID property to 0. How the component’s metadata is set up and how the components are connected to each other is called the data flow layout.
More Info
All
objects in the layout—including components, inputs, outputs, paths,
columns, and properties—are identified with an integer that is unique
within that layout. These object IDs are used for identifying objects,
and appear in parentheses in object identification strings (for example,
‘Component “OLE DB Source” (34)’). |
Data Flow Validation
When you design a
package or before you execute a data flow, validation takes place.
Validation ensures that the data flow layout is correctly constructed
and that its components are configured correctly. The layout validation
is done by the data flow engine, and the component validation is done by
the components. The data flow engine does not allow execution to start
if any part of layout or component validation fails.
Data Flow Execution
Given the many possible
layouts, how does the data flow engine know how to execute the flow?
Before execution begins, the data flow task first analyzes the layout to
construct execution trees.
Execution Trees
Each tree describes a
segment of the layout for which only one buffer is needed. In general,
the segment described by a tree has the following pattern: The segment
starts at an asynchronous output, such as the output of a source
adapter, where rows are added to the buffer. That segment continues on
the path from the output and includes any synchronous components along
the way before the segment ends. Finally, the segment ends at the first
input that isn’t synchronous with any output, such as the input of a
destination adapter where the rows added to the buffer by the initial
output that started the segment are consumed.
More Info
Along
with destination adapters, asynchronous components such as the
Aggregate end execution trees with their inputs. However, they also
start new trees with their asynchronous outputs. |
Execution Plans
Once all the trees are
constructed, they are used to generate an execution plan. This plan
determines how the data buffers are queued and passed to the components,
and how the data flow engine’s worker threads are allocated across the
layout. The execution trees and plan can be logged for analysis to help
with debugging and improving performance, via the Logging dialog box in
the designer (as shown in Figure 4).
Execution Phases
As the data flow is
executed, components are given the chance to respond to different phases
of execution. This gives the data flow engine and components a chance
to perform any additional processing needed for the execution. Here, in
order, are the execution phases.
PrepareForExecute The data flow engine and components can perform any one-time preparation needed before execution starts.
PreExecute As with PrepareForExecute, the data flow engine and components can perform any one-time preparation needed before execution starts. PreExecute occurs closer to the main execution of the data flow, so you should use PreExecute instead of PrepareForExecute
to perform any work that requires acquiring and holding onto an
expensive or shared resource to minimize the time the resource is
needed. Also, any connections needed by components are acquired after PrepareForExecute and before PreExecute, so any preparation that requires connections should occur in PreExecute instead of PrepareForExecute.
PrimeOutput Source adapters and asynchronous transforms are primed with data buffers via PrimeOutput.
However, they handle the buffers differently. Source adapters
immediately start adding data to those buffers. As those buffers become
full, the data flow engine pushes them downstream to be consumed by
other components. Asynchronous transforms do not add immediately data to
their buffers. Instead, they hold onto the buffers and end their PrimeOutput
phase. Because asynchronous components, like the Aggregate, add output
data based on their input data, they must wait until they get enough
input data to start adding rows to their cached buffers.
ProcessInput Destination adapters and transforms are asked to process their input data via ProcessInput. For asynchronous components, whose output buffers are cached in PrimeOutput, data is added to those buffers when enough input data has been processed.
PostExecute This phase counterbalances PreExecute and is used for any post-execution processing, usually releasing any resources acquired in PreExecute. Connections that were made before PreExecute are also released after PostExecute.
Cleanup This phase counterbalances PrepareForExecute
and is used for any necessary cleanup. Once the execution phases are
complete, the data flow task signals to the Integration Services runtime
that it has either succeeded or failed in its execution.
Viewing Data in the Flow During Execution
While the data flow
task executes in the designer, you can see the data passing from one
component to another by using data viewers, which is great for debugging
if you need to see how the components transform the data. To do this,
double-click on a path between the two components where you want to view
the data, select Data Viewers, and then click Add. Four types of data
viewers can be added to display the data in different forms, with the
grid format being the most traditional because it displays the data in
tabular form.
One thing to keep in
mind when using the data viewers is that when they access the data flow
buffers to display the data, they actually pause the data flow. To allow
the flow to continue, you must click the Play button on the viewer.
This means if the row in the data flow that you are trying to view is
near the end of the flow, you might have to click through a lot of
pauses before seeing the row. One solution, although not perfect, is to
fast-forward through the data flow, hopefully pausing again closer to
the row you want. You can do this by clicking the Detach button to
detach the viewer to let more rows through. When enough rows have gone
through, click the Attach button to attach the viewer again. Figure 5 shows the data viewer in action.
Handling Component Errors Within the Data Flow
Some components support error outputs, which are represented by red outputs in the designer. The term error output,
however, is misleading because what constitutes an error is component
specific. In other words, these are outputs used by the component for
pushing rows into the data flow that have not been successfully
processed by the component. For example, the Lookup component might
consider a failed lookup an error and then route that row to its error
output. Component errors in Integration Services are divided into two
groups: truncation errors and non-truncation errors. (An example of a
truncation error is when some string data used by the component is too
long and needs to be trimmed.) These two categories were created because
in many business scenarios, truncations are either not considered
errors or must be handled differently from other errors. Most components
that support error outputs also have the option to fail on an error or
truncation, ignore the error or truncation, or route the row with the
error or truncation to their error outputs. You can set this option by
modifying the component’s error or truncation disposition on the column
that the error or truncation can occur on. Figure 6 shows what a data flow can look like when error outputs are connected.