3.1 SQL Server and the Windows I/O Subsystem
Microsoft SQL Server is an application
that utilizes the Windows I/O subsystem. Rather than covering the
minutia of how SQL Server reads and writes from the NTFS file system,
we are going to explore the specific Windows I/O systems that will
report errors to the Windows event logs. This should aid you in
troubleshooting many storage errors.
The storage system components, shown in Figure 8,
report errors to the Windows system Event Log. SQL Server reports
errors to the Windows application log. You can either directly check
the Event Logs or use System Center to scrape the Event Logs for
The Volume Manager driver (ftdisk.sys) creates a
new I/O request packet that is passed to the Partition Manager driver
(partmgr.sys). Once the request packet is created, the Volume Manager
passes this packet to the Disk Class driver (disk.sys). Ftdisk.sys
continues to monitor for successful delivery of the request packet. If
problems are detected, then ftdisk.sys reports the errors to the system
Event Log. These ftdisk errors usually represent very serious storage
At this point the Disk Class driver passes the
storage request to either the Multipath System driver (mpio.sys) or the
Port driver (storport.sys). Multipath I/O is a Microsoft technology
that is utilized in storage area networks (SANs). Vendors can create an
MPIO device-specific module (DSM) driver that details how the Multipath
driver should load balance I/O across different storage pathways.
Microsoft offers a generic DSM that provides limited failover
capabilities. Non-SAN technologies do not use MPIO.
The HBA is the physical piece of hardware that
interfaces with disk drives and other storage devices. HBA
manufacturers create a miniport driver that interfaces with
storport.sys. Most HBA drivers will independently report communication
errors to the application Event Log.
Ideally, this entire chain of events takes no
longer than 20 milliseconds. Performance is governed by a myriad of
factors, the most important of which is latency. Both ftdisk.sys and
SQL Server time each I/O. If the round-trip duration exceeds 15 seconds
(for SQL Server or 1 minute for ftdisk.sys), then errors are reported
to the SQL logs and the Windows application Event Log. As you hopefully
noticed, a normal operation is measured in milliseconds, so one second
is an eternity.
3.2 Choosing the Right Storage Networks
The storage network can facilitate the sharing of
storage resources. Direct attached storage offers good performance for
a relatively low cost, but DAS storage can orphan performance and
capacity. Imagine several applications that grow in capacity at
different rates or are used at different times. Consolidated storage
that is attached to a SAN network enables users to share both storage
capacity and available performance.
Complex storage networks are often built using Fibre Channel (FC) technology.
spelling of “Fibre” is no accident. The creators of FC wanted to
differentiate the technology from fiber optic technologies that did not
support copper cabling at the time.
FC differs from most server network protocols in
that it is not routed. Routing enables the creation of large and
resilient networks, but routed networks require a lot of overhead to
If you are familiar with Fibre Channel you may
already be aware of routing solutions for it.
Because it is not routed, FC defines a standard
for both direct and switched storage network connections. Modern FC
networks utilize high-speed network switches to communicate.
Storage networks are not limited to Fibre
Channel. Several protocols define methods for sending storage data over
existing server IP networks. Fibre Channel Internet Protocol (FCIP)
allows Fibre Channel data frames to be encapsulated within an IP
packet. Internet Small Computer Systems Interface (iSCSI) allows the
transmission of SCSI data over IP networks.
FCIP and iSCSI transport different layers of the
storage network. Fibre Channel frames are analogous to Ethernet data
frames. SCSI is a storage control system comparable to Internet
Protocol. Transmission Control Protocol is an Internetworking protocol
and therefore has no analogue in storage networking. Emerging
technologies such as Fibre Channel Over Ethernet (FCOE) combine the
attributes of existing Fibre Channel networks with Ethernet routed
Regardless of the specific network technology
that is used to transport storage traffic, keep in mind that bandwidth
is not infinite. Excessive storage traffic not only negatively impacts
the performance of a single system, it can hamper all connected
components. Many applications must meet minimum performance
requirements spelled out in service-level agreements (SLAs). Storage network performance is critical to overall application performance.
3.3 Block-Based Storage vs. File-Based Storage
The operating system, in this case
Windows, uses NTFS to create a structure that enables it to use one or
more blocks to store files. When a server accesses a physical disk
directly, it is called block-based access. When data is accessed over a
server network, such as TCP/IP, it is called file data. Devices that
provide file access are called network-attached storage (NAS).
Disk drives store data in blocks. Each block
contains 512 bytes of data (some storage arrays use 520-byte blocks —
the extra 8 bits define a checksum used to guarantee data integrity).
Disk drive data blocks are individually numbered by the disk firmware in a scheme using what are called logical block numbers (LBNs).
best practices recommend that NTFS partitions used for data and log
files be formatted using 64K allocation clusters. This setting
maximizes efficiency by minimizing wasted space. NTFS reserves space
for the MFT$ based on the partition’s size. Each file needs a 1KB
allocation record in the MFT$. Because only a few data and log files
are stored on a partition, and 64KB clusters align with 64KB data
extents used by SQL Server, it makes perfect sense to minimize the size
of an MFT$. If the partition is used for storing many smaller files
then it should be formatted using the standard 4KB cluster size.
Running out of allocation units will result in a fragmented MFT$, thus
significantly harming file access performance on that partition.
Starting with SQL Server 2008 R2, storage
administrators have the option of using Server Message Block (SMB)
networks to access data files. Technet offers a detailed overview of
the advantages of SMB here: http://technet.microsoft.com/en-us/library/ff625695(WS.10).aspx.
SQL Server 2012 supports SMB version 3.0 which
offers improved performance over earlier versions. For more information
on configuring SQL Server 2012 with SMB 3.0 visit: http://msdn.microsoft.com/en-us/library/hh759341.aspx.
Setting up an SMB network enables you to connect to your file over a UNC path (\\server_name\ share).
This access can greatly simplify the setup of network-based storage,
although you should use caution and specifically check to ensure that
your particular system is supported, as NAS devices often are not
supported for use in this configuration.
Contrast SMB with the use of an iSCSI network.
iSCSI is a protocol used for accessing block data over a server
network. It requires the use of initiator software on the host server
and a compatible iSCSI storage target.
Both SMB and iSCSI utilize a server network to
communicate. You must ensure that the server network is low latency and
has the bandwidth available to handle the demands that will be placed
on it by either technology. Most Fibre Channel networks are dedicated
to handling only storage traffic.
If you utilize a server network to transport
block or file SQL Server traffic, it may need to be dedicated to
transferring only the storage traffic. In lieu of dedicated networks,
consider implementing Quality of Service (QoS) that puts a higher
priority on storage traffic over normal network packets.
Keep in mind that no technology provides a magic
bullet. Even robust networks can be filled with traffic. Storage
transfers are extremely sensitive to delay.
3.4 Shared Storage Arrays
Shared array controllers are primarily
responsible for logically grouping disk drives. Sharing the storage
controller enables the creation of extremely large volumes that are
protected against failure. In addition to the normal features of direct
attached storage controllers, a shared array controller provides both
storage performance and capacity.
Shared array controllers, often called SAN
controllers, offer more advanced features than direct attached systems.
The feature sets are divided into three categories:
- Efficient capacity utilization
- Storage tiering
- Data replication
Before diving into the features of SAN
arrays, however, it would be helpful to look at some of the language
that storage administrators use to describe their systems.
3.5 Capacity Optimization
It has been our experience that most
information technology professionals are not very good at predicting
the future. When asked how much performance and space they anticipate
needing over the next three to five years, administrators do their best
to come up with an accurate answer, but unfortunately real life often
belies this estimate.
Meet Bob. Bob forecasted that his new OLTP
application would start at 10GB and grow at 10GB per year over the next
five years. Just to be on the safe side, Bob asked for 100GB in
direct-attached storage. Bob’s new widget sells like hotcakes and his
database grows at 10GB per month. Seven months in, Bob realizes that he
is in trouble. He asks his storage administrators for another 500GB in
space to cover the next five years of growth.
Unfortunately, the storage and server
administrators inform Bob that other users have consumed all the space
in his data center co-location. The information technology group is
working diligently to expand the space, but it will be six months
before they can clear enough space to accommodate his storage. Bob
notes that never again will he come up short on storage and go through
the pain of expanding his system.
Moving forward Bob always asks for 10 times his
original estimate. In his next venture Bob finds that his database will
also grow at 10GB per year over 5 years but this time Bob, having
“learned his lesson” asks for 10GB a month. Unfortunately Bob’s actual
storage requirement was closer to 5GB per year.
Bob has unwittingly become his own worst enemy.
When Bob needs storage for his second application there isn’t any
storage available because Bob is simultaneously holding on to unused
storage for his first application. He has underprovisioned his storage
requirements for his second application while massively
overprovisioning his first.
Bob’s example is not unique. IT shops the world
over consistently overprovision storage. Imagine the implications; over
the life of the storage and server, companies purchase a significant
amount of excess storage that requires powering and servicing. To
combat this wasteful use of space, several storage array manufacturers
now sell a technology called thin provisioning.
Thin provisioning uses the concept of
just-in-time storage allocation within storage pools, whereby many
physical disk drives are amalgamated into one large pool. Appropriate
RAID protection is applied to the disk drives within the pool. Many
volumes can be created from each pool. Synthetic or virtual volumes are
presented to the host servers.
When a volume is created as a thin device it
allocates only a minimum amount of storage. From the perspective of the
operating system, the volume is a certain set size, but the actual data
usage within the thin pool closely matches the size of written data. As
new data is written to the volume, the storage array allocates more
physical storage to the device. This enables the storage administrator
to directly regulate the amount of storage that is used within the
Because over-forecasting is no longer
under-utilizing space, the database administrator can focus on easing
operational complexity, versus trying to optimally forecast storage.
Creating a single data file within a file group and later adding files
while maintaining performance is an extremely painful operation. If a
database is built without planning for growth and instead is
concatenated over time by adding files, then data access is not uniform.
One possible growth-planning solution is to
create several data files on the same volume. If that volume becomes
full, the original data files can be moved. This data file movement
will require downtime, but it is preferable to reloading the entire
dataset. When utilizing storage pools it is possible to create large
thin volumes that may never be fully utilized. This is possible because
the storage systems are provisioning storage only as it is needed. Many
SAN array controllers also facilitate online volume growth.
Unfortunately, many DBAs subvert the concept of
thin provisioning by fully allocating their database at creation time.
Most database administrators realize that growing a data file can be a
painful operation, so they often allocate all the space they will ever
need when the database is created. Unfortunately, for our thin pool,
SQL Server allocates data and log files by writing zeros to the entire
If the Windows Server is set to use instant file
initialization, the file will be created in a thin-pool-friendly way.
New storage will be allocated in the pool only as the data actually
The DBA can also ensure that the file is thin-pool-friendly by creating
data and log files that are only as large or slightly larger than the
data the file will contain.
files are created using zeroes for security purposes. Physically
writing zeroes to new space helps to ensure that previous data is not
accidently exposed to the new host. Check your security requirements
prior to enabling instant file initialization.
If the data file has already been created as a
large file filled with zeros, then a feature called Zero Page Reclaim
can be used on the array to reclaim the unused space. Running Zero Page
Reclaim allows the array to return the zero space to the available
storage pool so it can be allocated to other applications.
Deleting data from within a database or even
deleting files from a volume will not return free space to the thin
storage pool. In the case of reclaiming deleted file space, most
storage vendors offer a host-side tool that checks the NTFS Master File
Table and reallocates space from deleted space. If you decided to
delete space from within a SQL Server data or log file you need to run
the DBCC SHRINKFILE command to
first make the file smaller, and then run the host-side storage
reclamation tool to return space to a given thin pool.
Unfortunately, thin storage pools have a dirty
little secret. In order to optimize storage use in a world where
storage forecasting is an inexact science, it is necessary to
overprovision the thin pools. This means that storage teams must
closely monitor the growth rate at which new storage is being used.
Microsoft Windows makes a write request against a thin volume that no
longer has free space, it will blue screen the server. To prevent this,
storage administrators have a few options: Run a storage array tool
like Zero Page Reclaim to gather unused space, add more physical
storage to the thin pool, or migrate the volume to a different array
with more space. It is critical to catch growth issues before they
become a serious problem.