Blobs by themselves are quite dumb and boring to anyone not
interested in the cloud. A blob is a binary large
object. It is any arbitrary piece of data, of any format or size, and with
any content. It can be an image, a text file, a .zip file, a video, or just about any arbitrary
sequence of ones and zeros.
Note: The name “blob” was coined by Jim Starkey at DEC. He resisted pressure from various
quarters (especially marketing) to rename the concept to something
“friendlier.” He claims the expansion “binary large object” was invented
much later because marketing found “blob” unprofessional. Visit http://www.cvalde.net/misc/blob_true_history.htm for the
full story.
Blobs become interesting when you look at how they’re used.
Similarly, the data that makes up YouTube’s videos (the actual bits on
disk) aren’t interesting, but the funny videos they make up can be
valuable. A good blob/data storage mechanism must be dumb and basic, and
leave all the smartness to the application.
The Windows Azure blob service allows you to store nearly unlimited
amounts of data for indefinite periods of time. All data stored in the
service is replicated multiple times (to protect against hardware
failure), and the service is designed to be highly scalable and available.
As of this writing, the internals of how this blob service works hadn’t
been made public, but the principles this service follows are similar to
the distributed systems that came before it. The best part is that you can
pay as you go, and pay only for the data you have up in the cloud. Instead
of having to sign a check for storage you may or may not use, you pay
Microsoft only for what you used during the preceding billing
cycle.
1. Using Blobs
Blobs are an important part of Windows Azure because they are
just so useful. Unlike hosted services (where you have to write code),
or tables and queues (where they’re part of a bigger application and
some code is again involved), blobs can be used for a lot of day-to-day
computer tasks. Here’s a short sample list:
Performing backup/archival/storage in the cloud
Hosting images or other website content
Hosting videos
Handling arbitrary storage needs
Hosting blogs or static websites
The list could go on, but you get the idea. So, when should you
think of taking the blob storage plunge? The answer to that question
depends on a few candidate scenarios you should keep in mind.
1.1. Filesystem replacement
Any large data that doesn’t have schema and doesn’t require
querying is a great candidate for blob storage. In other words, any
data for which you use a filesystem today (not a database) can be
moved easily to blob storage. In fact, you’ll find that several
filesystem concepts map almost 1:1 to blobs, and there are several
tools to help you move from one to the other.
Most organizations or services have an NFS/SMB share that stores
some unstructured data. Typically, databases don’t play well with
large unstructured data in them, so developers code up a scheme in
which the database contains pointers to actual physical files lying on
a share somewhere. Now, instead of having to maintain that filesystem,
you can switch to using blob storage.
1.2. Heavily accessed data
Large data that is sent as is to users without modification is
another type of good content to stick in blob storage. Since Windows
Azure blob storage is built to be highly scalable, you can be sure
that a sudden influx of users won’t affect access.
1.3. Backup server
Although you may be unable to leave your current
filesystem/storage system, you can still find a way to get some good
use out of blob storage. Everyone needs a backup service. For example,
home users often burn DVDs, and corporations often ship tapes off to a
remote site to facilitate some sort of backup. Having a cheap and
effective way to store backups is surprisingly tricky. Blob storage
fits nicely here. Instead of using tape backups, you could store a
copy of your data in blob storage (or several copies, if you’re
paranoid).
The fact that cloud storage is great for backups isn’t lost on
the multiple product vendors making backup software. Several backup
applications now use either Amazon S3 or Windows Azure. For example,
Live Mesh from Microsoft uses blob storage internally to store and
synchronize data from users’ machines.
1.4. File-share in the cloud
Employees at Microsoft often must share files, documents, or code
with other employees. SharePoint is one alternative, and that is often
utilized. But the easiest way to share some debug logs or some build
binaries is to create a share on your local machine and open access to
it. Since most Microsoft employees have reasonably powerful machines,
you don’t have to worry about several employees accessing the data at
the same time and slowing down your machine. Since file shares can be
accessed with a UNC name, you can email links around easily.
Blob storage is similar. Want to throw up something on the Web
and give other people access to it? Don’t know how many users will
wind up accessing it, but just want something that stays up? Blob
storage is great for that. Windows Azure blob storage is a great means
to throw up arbitrary data and give it a permanent home on the
Internet, be it large public datasets, or just a funny video.
2. Pricing
In short,
Windows Azure charges $0.15 per gigabyte per month stored, $0.01 for
every 10,000 storage transactions, and $0.10 for ingress and $0.15 for
egress bandwidth.
Though this pricing model applies equally across all of Windows
Azure’s storage services, blobs
have some interesting properties. Most importantly, blobs are the only
service for which anonymous requests over public HTTP are allowed (if
you choose to make your container public). Both queues and tables (which
you’ll learn more about in the next few chapters) must have requests
authenticated at all times. Of course, without anonymous requests, a lot
of the things that people use blobs for—hosting images, static websites,
and arbitrary files, and exposing them to any HTTP client—won’t work
anymore.
However, given that anonymous requests get billed (both for
transaction cost and for bandwidth cost), a potential risk is a
malicious user (or a sudden surge in visitors) that results in a huge
storage bill for you. There is really no good way to protect against
this, and this issue generally exists with almost all cloud
services.
3. Data Model
The data model for the Windows Azure blob service is quite
simple, and a lot of the flexibility stems from this simplicity. There
are essentially three kinds of “things” in the system:
Blob
Container
Storage account
Figure 1 shows
the relationship between the three.
3.1. Blob
In Windows Azure, a blob is any piece of data. Importantly, blobs
have a key or a name with which they are referred to. You might think
of blobs as files. Despite the fact that there are places where that
analogy breaks down, it is still a useful rule of thumb. Blobs can
have metadata associated with them, which are <name,value> pairs and are up to 8 KB
in size.
Blobs come in two flavors: block blobs and
page blobs. Let’s look at block blobs
first.
Block blobs can be split into chunks known as
blocks, which can then be uploaded separately.
The typical usage for blocks is for streaming and resuming uploads.
Instead of having to restart that multigigabyte transfer from the
beginning, you can just resume from the next block. You can also
upload blocks in parallel, and have the server constitute a blob out
of them. Block blobs are perfect for streaming upload
scenarios.
The second flavor is page blobs. Page blobs are split into an array of pages. Each page
can be addressed individually, not unlike sectors on a hard disk. Page
blobs are targeted at random read/write scenarios and provide the
backing store for Windows Azure XDrive. You’ll see more about them
later.
3.2. Container
Blobs are stored in things called containers, which you can think of as
partitions or root directories. Containers exist only to store a
collection of blobs, and can have only a little metadata (8 KB)
associated with them.
Apart from containing blobs, containers serve one other important task. Containers
control sharing policy. You can make containers either
public or private, and all
blobs underneath the container will inherit that setting. When a
container is public, anyone can read data from that container over
public HTTP. When a container is private, only authenticated API
requests can read from the container. Regardless of whether a
container is public or private, any creation, modification, or
deletion operations must be authenticated.
One authentication option that isn’t covered in this chapter is
preauthorized URIs or “signed URLs.” This refers to the ability to
create a special URL that allows users to perform a particular
operation on a blob or a container for a specified period of time.
This is useful in scenarios where you can’t put the storage access key
in your code—for example, client-side web apps. See http://msdn.microsoft.com/en-us/library/ee395415.aspx
for details on how to use this.
Note: For users familiar with Amazon’s S3, containers are not the
same as S3’s buckets. Containers are not a global resource; they
“belong” to a single account. This means you can create containers
in your code’s critical path, and the storage system will be able to
keep up. Deletion of a container is also instant. One “gotcha” to
keep in mind is that re-creation of a container just deleted might
be delayed. The reason is that the storage service must delete all
the blobs that were in the container (or, more accurately, put the
garbage-collection process in motion), and this takes a small amount
of processing time.
3.3. Storage account
You can think of these as the
“drives” where you place containers. A storage account can have any
number of containers—as of this writing, there is no limit on the
number of containers any storage account can have.
The containers also inherit the parent storage account’s
geolocation setting. If you specify that the storage account should be
in the South Central United States, all containers under the account
will show up in the same location, and by the same transitive
relationship, all blobs under those containers will show up in the
South Central United States as well.