Traditional BLOB Strategies
Prior to SQL Server 2008, there were two traditional solutions for combining structured table data with unstructured BLOB
data: either keep BLOBs in the database (with all your structured table
data), or store them outside the database (in the file system). In the
former case, BLOBs are stored right inside the database. In the latter
case, the database merely holds references (or, possibly, references
are derived from other values in the database) that point to locations
in the file system where the BLOBs actually live. Each of these
strategies has pros and cons with respect to storage, manageability,
performance, and programming complexity that we’ll discuss—but neither
of them are intrinsically native to the core database engine.
You can, of course, simply store BLOB data directly in the columns of your database tables. You do this by declaring a column as a varbinary(max) data type, which will allow it to store a single BLOB up to 2 gigabytes (GB) in size.
Important
You should no longer use the image data type that was used to store BLOBs prior to Microsoft SQL Server 2005. The varbinary(max) data type should be used instead of image, which has been deprecated and may be removed in a future version of SQL Server.
When BLOB
data is stored directly inside of tables this way, it is very tightly
integrated with the database. It’s easy to access BLOB data, because
the BLOB is right there in the table’s column. Because everything is
contained in a single database unit, management is also simplified.
Backup, restore, detach, copy, and attach operations on the database
include structured and BLOB data together. Transactional consistency is
another important benefit that you enjoy with this approach. Because
BLOB data is a physical part of the tables in the database, it is
eligible to participate in transactions. If you begin a transaction,
update some data, and then roll back the transaction, any BLOB data
that was updated is also rolled back. Overall, the mixture of
structured and BLOB data is handled very seamlessly with this model.
Despite
all these advantages, however, physically storing BLOBs in the database
is practical only for small-scale applications having very few/small
BLOBs. Because BLOB content is stored in-line with structured data, you
can severely impair scalability by bloating your filegroups. Query
performance will degrade rapidly as a result, because the query
processor must sift through much larger amounts of data in your tables
that are consumed with BLOB content. The BLOBs also don’t stream nearly
as efficiently when backed by SQL Server varbinary(max) columns as they would if they were held externally in the file system or on a dedicated BLOB store. Finally, there is also a 2 GB limit on the varbinary(max) data type.
Tip
If
you have modest BLOB requirements (that is, you are dealing with very
few or very small BLOBs) you should store them in the database using
the varbinary(max) data
type instead of using the file system (either directly, or via
FILESTREAM). Furthermore, you should consider caching small, frequently
accessed BLOBs rather than repeatedly retrieving them from the database
or the file system.
To counter these concerns, you can store BLOBs outside the
database and in the file system instead. With this approach, the
structured data in your relational tables merely contains path
information to the unstructured BLOB data, which is held externally as
ordinary files in the file system (alternatively, path information can
be derived from other structured data in a row). Applications use this
path information as a reference for locating and tracking the BLOB
content associated with rows in the database tables. Because they are
physically held in the file
system, a BLOB’s size is limited only by the host file system and
available disk space. This approach also delivers much better streaming
performance, because the file system is a native environment that’s
highly optimized for streaming. And because the physical database is
much smaller without the BLOBs inside it, the query processor can
continue to deliver optimal performance.
Although
physically separating structured and unstructured content this way does
address the performance concerns of BLOBs, it also raises new issues
because the data is now separated not only physically, but logically as
well. SQL Server has absolutely no awareness of the association between
data in the database and files stored externally in the file system. Their coupling exists solely at the application level. Backup,
restore, detach, copy, and attach operations on the database files
therefore include only structured table data without any of the BLOB
data that’s in the file system. You won’t get complete backups, unless
you back up the file system as well, so now it’s another administrative
burden to separately manage the file system.
Application
development against this model is also more complex because of the
extra effort required for linking between the database and the file
system. It’s up to the developer to establish and maintain the
references between structured data and external BLOBs on their own, and
according to their own custom design. Last, although perhaps most
significant, there is no unified transactional control across both the
database and the file system. Naturally, rolling back a database
transaction won’t undo changes you’ve made in the file system.
Introducing FILESTREAM
Both of the traditional BLOB solutions present tough challenges, so what do you do? With FILESTREAM,
SQL Server offers a way out of this conundrum. First, make sure you
understand that FILESTREAM is technically not a SQL Server data type.
Rather, it is implemented as an attribute that you apply to the varbinary(max)
data type—the same data type traditionally used to store BLOBs directly
inside structured tables. However, merely applying this attribute tells
SQL Server to store the BLOB in the file system rather than the table’s
structured filegroup. With the FILESTREAM attribute applied, you continue to treat the varbinary(max) column as though
its contents were stored in-line with your structured table data. Under
the covers, however, SQL Server stores and maintains the data in the
server’s local NTFS file system, separately from the structured content
of the database that remains in the normal filegroups.
With FILESTREAM, structured and unstructured data are logically connected while physically separated.
The unstructured data is configured as a special filegroup in the
database, so it’s actually considered part of the database—it is
available in all logical database operations, including queries,
transactions, and backup/restore. On disk, however, the BLOBs are
stored as individual physical files in the NTFS file system that are
created and managed automatically behind the scenes. SQL Server
establishes and maintains the link references between the structured
file groups and the file system. It knows about the unstructured BLOB
data in the file system and considers the files holding BLOB data to be
an integral part of the overall database. But the unstructured data
does not impede query performance because it is not physically stored
in-line with table data. It’s stored in the file system, which is a
native BLOB environment (and where it ostensibly belongs). Logically,
however, the database encompasses both the relational tables and the
individual BLOB files in the file
system. You therefore continue to treat BLOB data as though you were
storing it in-line, from both a development and an administrative
perspective.
Tip
Backing
up the database includes all the BLOB data from the file system in the
backup automatically. However, because the BLOB data is contained in
its own database filegroup, you can easily exclude it from backups if
desired or as needed.
The end result is that SQL Server
uses the appropriate mechanism for structured and unstructured
data—storing relational (structured) data in tables and BLOB
(unstructured) data in ordinary files—so it can deliver the best
possible performance all around. Because it does this completely
transparently, you enjoy integrated management benefits over the
database. You also enjoy simplified application development as you are
no longer burdened with the additional complexities of manually
associating the database with the file system and keeping the two in
sync. Last, by leveraging the transactional capabilities of the NTFS
file system, BLOB updates participate seamlessly with database
transactions. If you’re starting to get excited by all this, that’s the
idea! You’re now ready to dive in to some real code that puts FILESTREAM to work for you.