SQL Azure: Building a Shard (part 3) - Reading Using the Shard & Caching

1/31/2011 5:26:29 PM

3. Reading Using the Shard

Now that you've reviewed how connections strings are handled in the application and the library, you need to know how the shard handles a SELECT operation against multiple databases. In its simplest form, the library executes the SELECT operation against the list of connections defined previously.

The client application calls ExecuteShardQuery, which in turn loops over the list of SqlConnection objects (see Figure 4 ). If you look at the code, you see that a copy of each connection object is made first; this is to avoid any potential collisions if the client code makes a call to this method multiple times (a connection can only make one call at a time). Then, for each connection, the code calls ExecuteSingleQuery, which is the method in the shard library that makes the call to the database.

Figure 4. Reading from the shard

The ExecuteShardQuery method is designed to call the ExecuteSingleQuery method multiple times in parallel using the TPL. The TPL provides many useful methods to easily handle parallel processing without having to manage threads. The shard library uses Parallel.ForEach, which is a construct that allows the code to execute an inline method concurrently, and automatically adjusts the number of threads depending on your computer's hardware configuration. So, the more CPUs you have, the faster the following code executes, if you have enough connections to loop through. Note, however, that you need to lock the data object (line 5), which is a DataTable, because it could be accessed by other executing threads. Finally, the Merge method on the DataTable object concatenates resultsets from the various connections. After the loop has executed, the resulting data object has the list of records retrieved from the shard, in no guaranteed order:

1) Parallel.ForEach(connections,
2)    delegate(SqlConnection c)

3)    {
4)    DataTable dt = ExecuteSingleQuery(command, c, exceptions);
5)    lock (data)
6)       data.Merge(dt, true, MissingSchemaAction.Add);
7)    }
8) );

The following code is a simplified version of the actual sample application. (For clarity, some code that calculates execution time and performs exception handling has been removed.) Line 4 sets the command text to be executed, such as a SELECT statement, and line 5 executes it against the shard. Instead of calling ExecuteReader, the code calls ExecuteShardQuery to use the shard. Line 7 binds the resulting DataTable and displays the records returned by the shard:

1) SqlCommand cmd = new SqlCommand();
2) DataTable dataRes = new DataTable();
3)
4) cmd.CommandText = this.textBox1.Text;
5) dataRes = cmd.ExecuteShardQuery();
6)
7) dataGridView2.DataSource = dataRes;

Figure 5 shows the outcome of this code. The SELECT statement is designed to return database object names and types. Executing the statement against the shard performs as expected. However, notice that an extra column has been added to the display: __guidDB__. This is the name of the GUID column introduced previously. This column doesn't help much for reading, but it enables updates and deletes, as you see later.

Figure 5. Showing records returned by the shard

The GUID provided is unique for each database connection, as long as one of the key parameters is different in the connection string of each connection. It's added by the ExecuteSingleQuery method described previously. Within this method, a column is added in front of all the others, which carries the GUID. In the following code extract, line 3 creates the data column of type string, and line 4 sets its default value to the connection's GUID. Line 7 fills the data table with the query's result, along with the added GUID column. The following is the logic used to add this GUID:

1) // Add the connection GUID to this set of records
2) // This helps us identify which row came from which connection
3) DataColumn col = dt.Columns.Add(_GUID_, typeof(string));
4) col.DefaultValue = connection.ConnectionGuid();
5)
6) // Get the data
7) da.Fill(dt);

4. Caching

To minimize roundtrips to the source databases, the shard library provides an optional caching mechanism. The caching technique used in this library offers basic capabilities and can be extended to address more complex scenarios. The objective of this library is to cache the entire DataTable of each database backend whenever requested. Figure 6 shows the logical decision tree of the caching approach. It's important to note that this library calculates a cache key based on each parameter, the parameter value, each SQL statement, and the database's GUID.

Figure 6. Caching logic

The effect of the cache is visible when you connect to SQL Azure. Considering that connecting to a SQL Azure database takes up to 250 milliseconds the first time, memory access is significantly faster. The importance of the cache increases as the number of records increases and the number of databases increases in the shard.

The cache provided by this library also provides a time to live (TTL) mechanism that implements an absolute expiration or a sliding expiration scheme. An absolute expiration resets the cache automatically at a specific time in the future, whereas the sliding setting moves the expiration time if the cache items are used before expiring. The following code shows how the caching is implemented. Line 1 creates a CacheItemPolicy used to define the behavior of the cache. Line 3 implements the sliding window cache, and line 5 implements the absolute approach:

1) CacheItemPolicy cip = new CacheItemPolicy();
2) if (UseSlidingWindow)
3)    cip.SlidingExpiration = defaultTTL;
4) else
5)    cip.AbsoluteExpiration =
   new DateTimeOffset(System.DateTime.Now.Add(defaultTTL));
6) MemoryCache.Default.Add(cacheKey, dt, cip);

You can enhance this caching technique in multiple ways. For example, the DataTable object stored in the cache can be compressed when it contains many rows. Compression algorithms tend to increase latency, but the overall performance benefits may be worth a slight delay.

Another way to enhance this caching storage is to create different cache containers, so you can control at a more granular level which container holds which kind of data. Doing so lets you control a different setting per container, for example; or you may decide to always compress one cache container but not another.

Finally, the cache provided in this library isn't distributed; it's local to the machine running the library. If you need to develop a more robust cache, consider looking into the Windows Server AppFabric; its caching technology provides enterprise-level capabilities.

NOTE

For more information about the Windows Server AppFabric, visit http://msdn.microsoft.com/appfabric.