4. State-directed workers
Sometimes the code that a
worker role runs is large and complex, and this can lead to a long and
risky processing time. In this section, we’ll look at a strategy you can
use to break this large piece down into manageable pieces, and a way to
gain flexibility in your processing.
As we’ve said time and time
again, worker roles tend to be message-centric. The best way to scale
them is by having a group of instances take turns consuming messages
from a queue. As the load on the queue increases, you can easily add
more instances of the worker role. As the queue cools off, you can
destroy some instances.
In this section, we’ll
look at why large worker roles can be problematic, how we can fix this
problem, and what the inevitable drawbacks are. Let’s start by looking
at the pitfalls of using a few, very large workers.
The Problem
Sometimes the work that’s
needed on a message is large and complicated, which leads to a heavy,
bloated worker. This heaviness also leads to a brittle codebase that’s
difficult to work with and maintain because of the various code paths
and routing logic.
A worker that takes a long time
to process a single request is harder to scale and can’t process as many
messages as a group of smaller workers. A long-running unit of work
also exposes your system to more risk. The longer an item takes to be
processed, the more likely it is that the work will fail and have to be
started over. This is no big deal if the processing takes 3 seconds, but
if it takes 20 minutes or 20 hours, you have a significant cost to
failure.
This problem can be
caused by one message being very complex to process, or by a batch of
messages being processed as a group. In either case, the unit of work
being performed is large, and this raises risk. This problem is often
called the “pig in a python” problem (as shown in figure 3), because you end up with one large chunk of work moving through your systems.
We need a way to digest this work a little more gracefully.
The Solution
The best way to digest this
large pig is to break the large unit of work into a set of smaller
processes. This will give you the most flexibility when it comes to
scaling and managing your system. But you want to be careful that you
don’t break the processes down to sizes that are too small. At this
level, the latency of communicating with the queue and other storage mechanisms in very chatty ways may introduce more overhead than you were looking for.
When you analyze the stages
of processing on the message, you’ll likely conceive of several stages
to the work. You can figure this out by drawing a flow diagram of the
current bloated worker code. For example, when processing an order from
an e-commerce site, you might have the following stages:
Validate the data in the order.
Validate the pricing and discount codes.
Enrich the order with all of the relevant customer data.
Validate the shipping address.
Validate the payment information.
Verify that the products are in stock and able to be shipped.
Enter the shipping orders into the logistics system for the distribution center.
Record the transaction in the ERP system.
Send a notification email to the customer.
You can think of each
state the message goes through as a separate worker role, connected
together with a queue for each state. Instead of one worker doing all of
the work for a single order, it only processes one of the states for
each order. The different queues represent the different states the
message could have. Figure 4
compares a big worker that performs all of the work, to a series of
smaller workers that break the work out (validating, shipping, and
notifying workers).
There might also be some other
processing states you want to plan for. Perhaps one for really bad
orders that need to be looked at by a real human, or perhaps you have
platinum-level customers who get their orders processed and shipped
before normal run-of-the-mill customers. The platinum orders would go
into a queue that’s processed by a dedicated pool of instances.
You could even have a bad order
routed to an Azure table. A customer service representative could then
access that data with a CRM application or a simple InfoPath form, fix the order, and resubmit it back into the proper queue to continue being processed. This process is called repair and resubmit, and it’s an important element to have in any enterprise processing engine.
You won’t be able to put the
full order details into the queue message—there won’t be enough room.
The message should contain a complete work ticket, representing where
the order data can be found (perhaps via an order ID), as well as some
state information, and any information that would be useful in routing
the message through the state machine. This might include the service
class of the customer, for example—platinum versus silver.
As the business changes
over time, and it will, making changes to how the order is processed is
much easier than trying to perform heart surgery on your older, super
complicated, and bloated work role code. They don’t say spaghetti code
for nothing. For example, you might need to add a new step between steps
8 and 9 in our previous list. You could simply create a new queue and a
new worker role to process that queue. Then the worker role for the
state right before the new one would need to be updated to point to the
new queue. Hopefully the changes to the existing parts of the system can
be limited to configuration changes.
Even Cooler—Make the State Worker Role its Own Azure Service
How you want to manage your
application in the cloud should be a primary consideration in how you
structure the Visual Studio solution. Each solution becomes a single
management point. If you want to manage different pieces without
affecting the whole system, those should be split out into separate
solutions.
In this scenario, it
would make sense to separate each state worker role to its own service
in Azure, which would further decouple them from each other. This way,
when you need to restart one worker role and its queue, you won’t affect
the other roles.
In a more dynamic organization,
you might need to route a message through these states based on some
information that’s only available at runtime. The routing information
could be stored in a table, with rules for how the flow works, or by
simply storing the states and their relationships in the cloud service
configuration file. Both of these approaches would let you update how
orders were processed at runtime without having to change code. We’ve
done this when orders needed different stages depending on what was in
the order, or where it was going. In one case, if a controlled substance
was in the order, the processing engine had to execute a series of
additional steps to complete the order.
This approach is often called a poor man’s service bus
because it uses a simple way of connecting the states together, and
they’re fairly firm at runtime. If you require a greater degree of
flexibility in the workflow, you would want to look at the Itinerary
pattern. This lets the system build up a schedule of processing stops
based on the information present at runtime. These systems can get a
little more complicated, but they result in a system that’s more easily
maintained when there’s a complex business process.
Oops, it’s Not Nirvana
As
you build this out, you’ll discover a drawback. You now have many more
running worker roles to manage. This can create more costs, and you
still have to plan for when you eventually will swallow a pig. If your
system is tuned for a slow work day, with one role instance per state,
and you suddenly receive a flood of orders, the large amount of orders
will move down the state diagram like a pig does when it’s eaten by a
python. This forces you to scale up the number of worker instances at
each state.
Although this flexibility is
great, it can get expensive. With this model, you have several pools of
instances instead of one general-purpose pool, which results in each
pool having to increase and then decrease as the pig (the large flood of
work) moves through the pipeline. In the case of a pig coming through,
this can lead to a stall in the state machine as each state has to wait
for more instances to be added to its pool to handle the pig (flood of
work). This can be done easily using the service management APIs, but it
takes time to spin up and spin down instances—perhaps 20 minutes.
The next step to take, to
avoid the pig in a python problem, is to build your worker roles so that
they’re generic processors, all able to process any state in the
system. You would still keep the separate queues, which makes it easier
to know how many messages are in each state.
You could also condense the
queues down to one, with each message declaring what state the order is
in as part of its data, but we don’t like this approach because it leads
to favoritism for the most recent orders placed in the processors, and
it requires you to restart all of your generic workers when you change
the state graph. You can avoid this particular downfall by driving the
routing logic with configuration and dependency injection. Then you
would only need to update the configuration of the system and deploy a
new assembly to change the behavior of the system.
The trick to gaining both
flexibility and simplicity in your architecture is to encapsulate the
logic for each state in the worker, separating it so it’s easily
maintainable, while pulling them all together so there’s only one pool
of workers. The worker, in essence, becomes a router. You can see how
this might work in figure 5. Each message is routed, based on its state and other runtime data, to the necessary state processor.
This functions much like a factory. Each state would have a class that
knows how to process that state. Each state class would implement the
same interface, perhaps IorderProcessStage.
This would make it easy for the worker to instantiate the correct class
based on the state, and then process it. Most of these classes would
then send the message back to the generic queue, with a new state, and
the cycle would start again.
There are going to be times
when you’re working with both web and worker roles and you’re either
importing legacy code that needs access to a local drive, or what you’re
doing requires it. That’s why we’ll discuss local storage next.