Building a highly available and scalable system that
performs well is no trivial task.
Achieving performance objectives
There are many factors
that determine how long a particular operation takes. The choice of the
algorithm and data structures that are used to implement it will be a
major factor, so choosing the most appropriate ones for the problem at
hand is important.
However, when
building a distributed system, another important factor we need to
consider is network latency. The duration of every operation is the sum
of the time it takes to perform the operation, and the time it takes for
the request to reach the application and for the response to reach the
client.
In some environments, latency is
so low that it can often be ignored. For example, accessing properties
of an object within the same process is performed at in-memory speed
(nanoseconds), and therefore the latency is not a concern. However, as
soon as you start making calls across machine boundaries, the laws of
physics come into the picture.
Dealing with latency
Very often developers write
applications as if there is no latency. To make things even worse, they
test them in an environment where latency is minimal, such as their
local machine or a high-speed network in the same building.
When they deploy the
application in a remote datacenter, they are often surprised by the fact
that the application is much slower than what they expected. They
shouldn't be, they should have counted on the fact that the latency is
going to increase and should have taken measures to minimize its impact
on the application performance early on.
To illustrate the effect
latency can have on performance, let's assume that we have an operation
whose actual execution time is 20 milliseconds. The following table
shows the impact of latency on such an operation, depending on where the
server performing it is located. All the measurements in the table were
taken from my house in Tampa, Florida.
Location
|
Execution time (ms)
|
Average latency (ms)
|
Total time (ms)
|
Latency (% of total time)
|
---|
Local host
|
20
|
0.067
|
20.067
|
0.3%
|
VM running on the local host
|
20
|
0.335
|
20.335
|
1.6%
|
Server on the same LAN
|
20
|
0.924
|
20.924
|
4.4%
|
Server in Tampa, FL, US
|
20
|
21.378
|
41.378
|
51.7%
|
Server in Sunnyvale, CA, US
|
20
|
53.130
|
73.130
|
72.7%
|
Server in London, UK
|
20
|
126.005
|
146.005
|
86.3%
|
Server in Moscow, Russia
|
20
|
181.855
|
201.855
|
90.1%
|
Server in Tokyo, Japan
|
20
|
225.684
|
245.684
|
91.9%
|
Server in Sydney, Australia
|
20
|
264.869
|
284.869
|
93.0%
|
As you can see from the
previous table, the impact of latency is minimal on the local host, or
even when accessing another host on the same network. However, as soon
as you move the server out of the building it becomes significant. When
the server is half way around the globe, it is the latency that pretty
much determines how long an operation will take.
Of course, as the execution
time of the operation itself increases, latency as a percentage of the
total time will decrease. However, I have intentionally chosen 20
milliseconds for this example, because many operations that web
applications typically perform complete in 20 milliseconds or less. For
example, on my development box, retrieval of a single row from the MySQL
database using EclipseLink and rendering of the retrieved object using FreeMarker template takes 18 milliseconds on an average, according to the YourKit Profiler.
On the other hand, even if your
page takes 700 milliseconds to render and your server is in Sydney,
your users in Florida could still have a sub-second response time, as
long as they are able to retrieve the page in a single request.
Unfortunately, it is highly unlikely that one request will be enough.
Even the extremely simple Google front page requires four HTTP requests,
and most non-trivial pages require 15 to 20, or even more. Each image,
external CSS style sheet, or JavaScript file that your page references,
will add latency and turn your sub-second response time into 5 seconds
or more.
Each database query,
each call to a remote service, and each Coherence cache access will
incur some latency. Although it might be only a millisecond or less for
each individual call, it quickly gets compounded by the sheer number of
calls.
With Coherence for example,
the actual time it takes to insert 1,000 small objects into the cache is
less than 50 milliseconds. However, the elapsed wall clock time from a
client perspective is more than a second. Guess where the millisecond
per insert is spent.
This is the reason why you
will often hear advice such as "make your remote services coarse
grained" or "batch multiple operations together". As a matter of fact,
batching 1,000 objects from the previous example, and inserting them all
into the cache in one call brings total operation duration, as measured
from the client, down to 90 milliseconds!
Minimizing bandwidth usage
In general, bandwidth is
less of an issue than latency, because it is subject to Moore's Law.
While the speed of light, the determining factor of latency, has
remained constant over the years and will likely remain constant for the
foreseeable future, network bandwidth has increased significantly and
continues to do so.
However, that doesn't mean
that we can ignore it. As anyone who has ever tried to browse the Web
over a slow dial-up link can confirm, whether the images on the web page
are 72 or 600 DPI makes a big difference in the overall user
experience.
So, if we learned to
optimize the images in order to improve the bandwidth utilization in
front of the web server, why do we so casually waste the bandwidth
behind it? There are two things that I see time and time again:
The application
retrieving a lot of data from a database, performing some simple
processing on it, and storing it back in a database.
The
application retrieving significantly more data than it really needs.
For example, I've seen large object graphs loaded from database using
multiple queries in order to populate a simple drop-down box.
The first scenario above is an
example of the situation where moving the processing instead of data
makes much more sense, whether your data is in a database or in
Coherence (although, in the former case doing so might have a negative
impact on the scalability, and you might actually decide to sacrifice
performance in order to allow the application to scale).
The second scenario is
typically a consequence of the fact that we try to reuse the same
objects we use elsewhere in the application, even when it makes no sense
to do so. If all you need is an identifier and a description, it
probably makes sense to load only those two attributes from the data
store and move them across the wire.
In any case, keeping an eye
on how network bandwidth is used both on the frontend and on the
backend is another thing that you, as an architect, should be doing
habitually if you care about performance.
Coherence and performance
Coherence has powerful features that directly address the problems of latency and bandwidth.
First of all, by caching data
in the application tier, Coherence allows you to avoid disk I/O on the
database server and transformation of retrieved tabular data into
objects. In addition to that, Coherence also allows you to cache
recently used data in-process using its near caching feature, thus
eliminating the latency associated with a network call that would be
required to retrieve a piece of data from a distributed cache.
Another Coherence
feature that can significantly improve performance is its ability to
execute tasks in parallel, across the data grid, and to move processing
where the data is, which will not only decrease latency, but preserve
network bandwidth as well.
Leveraging these
features is important. It will be much easier to scale the application
if it performs well-you simply won't have to scale as much.
Achieving high availability
The last thing we need to talk about is availability.
At the most basic level, in order to make the application highly
available we need to remove all single points of failure from the
architecture. In order to do that, we need to treat every single
component as unreliable and assume that it will fail sooner or later.
It is important to realize
that the availability of the system as a whole can be defined as the
product of the availability of its tightly coupled components:
AS = A1 * A2 * ...* An
For example, if we have a web
server, an application server, and a database server, each of which is
available 99% of the time, the expected availability of the system as a
whole is only 97%:
0.99 * 0.99 * 0.99 = 0.970299 = 97%
This reflects the fact that if
any of the three components should fail, the system as a whole will fail
as well. By the way, if you think that 97% availability is not too bad,
consider this: 97% availability implies that the system will be out of
commission 11 days every year, or almost one day a month!
We can do two things to improve the situation:
The latter is
typically achieved by introducing asynchrony into the system. For
example, you can use messaging to decouple a credit card processing
component from the main application flow-this will allow you to accept
new orders even if the credit card processor is temporarily unavailable.
Coherence
is able to queue updates for a failed database and write them
asynchronously when the database becomes available. This is another good
example of using asynchrony to provide high availability.
Although the asynchronous
operations are a great way to improve both availability and scalability
of the application, as well as perceived performance, there is a limit
to the number of tasks that can be performed asynchronously in a typical
application. If the customer wants to see product information, you will
have to retrieve the product from the data store, render the page, and
send the response to the client synchronously.
To make synchronous operations highly available our only option is to make each component redundant.
Adding redundancy to the system
In order to explain how
redundancy helps improve availability of a single component, we need to
introduce another obligatory formula or two :
F = F1 * F2 * ... * Fn
Where F is the
likelihood of failure of a redundant set of components as a whole, and
F1 through Fn are the likelihoods of failure of individual components,
which can be expressed as:
Fc = 1 Ac
Going back to our previous example, if the availability of a single server is 99%, the likelihood it will fail is 1%:
Fc = 1 0.99 = 0.01
If we make each layer in our
architecture redundant by adding another server to it, we can calculate
new availability for each component and the system as a whole:
Ac = 1 - (0.01 * 0.01) = 1 0.0001 = 0.9999 = 99.99%
As = 0.9999 * 0.9999 * 0.9999 = 0.9997 = 99.97%
Basically, by adding
redundancy to each layer, we have reduced the application's downtime
from 11 days to approximately two and a half hours per year, which is
not nearly as bad.
Redundancy is not enough
Making components redundant
is only the first step on the road to high availability. To get to the
finish line, we also need to ensure that the system has enough capacity
to handle the failure under the peak load.
Developers often assume that if
an application uses scale-out architecture for the application tier and
a clustered database for persistence, it is automatically highly
available. Unfortunately, this is not the case.
If you determine during
load testing that you need N servers to handle the peak load, and you
would like the system to remain operational even if X servers fail at
the same time, you need to provision the system with N+X servers.
Otherwise, if the failure occurs during the peak period, the remaining
servers will not be able to handle the incoming requests and either or
both of the following will happen:
The response time will increase significantly, making performance unacceptable
Some users will receive "500 - Service Busy" errors from the web server
In either case, the application is essentially not available to the users.
To illustrate this, let's
assume that we need five servers to handle the peak load. If we
provision the system with only five servers and one of them fails, the
system as a whole will fail. Essentially, by not provisioning excess
capacity to allow for failure, we are turning "application will fail if
all 5 servers fail" into "application will fail if any of the 5 servers
fail". The difference is huge-in the former scenario, assuming 99%
availability of individual servers, system availability is almost 100%.
However, in the latter it is only 95%, which translates to more than 18
days of downtime per year.
Coherence and availability
Oracle Coherence provides
an excellent foundation for highly available architecture. It is
designed for availability; it assumes that any node can fail at any
point in time and guards against the negative impact of such failures.
This is achieved by
data redundancy within the cluster. Every object stored in the Coherence
cache is automatically backed up to another node in the cluster. If a
node fails, the backup copy is simply promoted to the primary and a new
backup copy is created on another node.
This implies that updating an
object in the cluster has some overhead in order to guarantee data
consistency. The cache update is not considered successful until the
backup copy is safely stored. However, unlike clustered databases that
essentially lock the whole cluster to perform write operations, the cost
of write operations with Coherence is constant regardless of the
cluster size. This allows for exceptional scalability of both read and
write operations and very high throughput.
However, as we
discussed in the previous section, sizing Coherence clusters properly is
extremely important. If the system is running at full capacity, failure
of a single node could have a ripple effect. It would cause other nodes
to run out of memory as they tried to fail over the data managed by the
failed node, which would eventually bring the whole cluster down.
It is also important to
understand that, although your Coherence cluster is highly available
that doesn't automatically make your application as a whole highly
available as well. You need to identify and remove the remaining single
points of failure by making sure that your hardware devices such as load
balancers, routers, and network switches are redundant, and that your
database server is redundant as well. The good news is that if you use
Coherence to scale the data tier and reduce the load on the database,
making the database redundant will likely be much easier and cheaper.
As a side note, while there
are many stories that can be used as a testament to Coherence's
availability, including the one when the database server went down over
the weekend without the application users noticing anything, my favorite
is an anecdote about a recent communication between Coherence support
team and a long-time Coherence user.
This particular customer has
been using Coherence for almost 5 years. When a bug was discovered that
affects the particular release they were using, the support team sent
the patch and the installation instructions to the customer. They
received a polite reply: