Windows Azure : Common uses for worker roles (part 2) - State-directed workers

- How To Install Windows Server 2012 On VirtualBox
- How To Bypass Torrent Connection Blocking By Your ISP
- How To Install Actual Facebook App On Kindle Fire
2/15/2011 9:09:51 AM

4. State-directed workers

Sometimes the code that a worker role runs is large and complex, and this can lead to a long and risky processing time. In this section, we’ll look at a strategy you can use to break this large piece down into manageable pieces, and a way to gain flexibility in your processing.

As we’ve said time and time again, worker roles tend to be message-centric. The best way to scale them is by having a group of instances take turns consuming messages from a queue. As the load on the queue increases, you can easily add more instances of the worker role. As the queue cools off, you can destroy some instances.

In this section, we’ll look at why large worker roles can be problematic, how we can fix this problem, and what the inevitable drawbacks are. Let’s start by looking at the pitfalls of using a few, very large workers.

The Problem

Sometimes the work that’s needed on a message is large and complicated, which leads to a heavy, bloated worker. This heaviness also leads to a brittle codebase that’s difficult to work with and maintain because of the various code paths and routing logic.

A worker that takes a long time to process a single request is harder to scale and can’t process as many messages as a group of smaller workers. A long-running unit of work also exposes your system to more risk. The longer an item takes to be processed, the more likely it is that the work will fail and have to be started over. This is no big deal if the processing takes 3 seconds, but if it takes 20 minutes or 20 hours, you have a significant cost to failure.

This problem can be caused by one message being very complex to process, or by a batch of messages being processed as a group. In either case, the unit of work being performed is large, and this raises risk. This problem is often called the “pig in a python” problem (as shown in figure 3), because you end up with one large chunk of work moving through your systems.

Figure 3. The “pig in a python” problem can often be seen in technology and business. It’s when a unit of work takes a long time to complete, like when a python eats a pig. It can take months for the snake to digest the pig, and it can’t do much of anything else during that timeframe.

We need a way to digest this work a little more gracefully.

The Solution

The best way to digest this large pig is to break the large unit of work into a set of smaller processes. This will give you the most flexibility when it comes to scaling and managing your system. But you want to be careful that you don’t break the processes down to sizes that are too small. At this level, the latency of communicating with the queue and other storage mechanisms in very chatty ways may introduce more overhead than you were looking for.

When you analyze the stages of processing on the message, you’ll likely conceive of several stages to the work. You can figure this out by drawing a flow diagram of the current bloated worker code. For example, when processing an order from an e-commerce site, you might have the following stages:

  1. Validate the data in the order.

  2. Validate the pricing and discount codes.

  3. Enrich the order with all of the relevant customer data.

  4. Validate the shipping address.

  5. Validate the payment information.

  6. Charge the credit card.

  7. Verify that the products are in stock and able to be shipped.

  8. Enter the shipping orders into the logistics system for the distribution center.

  9. Record the transaction in the ERP system.

  10. Send a notification email to the customer.

  11. Sit back and profit.

You can think of each state the message goes through as a separate worker role, connected together with a queue for each state. Instead of one worker doing all of the work for a single order, it only processes one of the states for each order. The different queues represent the different states the message could have. Figure 4 compares a big worker that performs all of the work, to a series of smaller workers that break the work out (validating, shipping, and notifying workers).

Figure 4. A monolithic worker role compared to a state-driven worker role. The big worker completes all the work in one step, leading to the “pig in a python” problem of being harder to maintain and extend as needed. Instead, we can break the process into a series of queues and workers, each dedicated to servicing a specific state or stage of the work to be done.

There might also be some other processing states you want to plan for. Perhaps one for really bad orders that need to be looked at by a real human, or perhaps you have platinum-level customers who get their orders processed and shipped before normal run-of-the-mill customers. The platinum orders would go into a queue that’s processed by a dedicated pool of instances.

You could even have a bad order routed to an Azure table. A customer service representative could then access that data with a CRM application or a simple InfoPath form, fix the order, and resubmit it back into the proper queue to continue being processed. This process is called repair and resubmit, and it’s an important element to have in any enterprise processing engine.

You won’t be able to put the full order details into the queue message—there won’t be enough room. The message should contain a complete work ticket, representing where the order data can be found (perhaps via an order ID), as well as some state information, and any information that would be useful in routing the message through the state machine. This might include the service class of the customer, for example—platinum versus silver.

As the business changes over time, and it will, making changes to how the order is processed is much easier than trying to perform heart surgery on your older, super complicated, and bloated work role code. They don’t say spaghetti code for nothing. For example, you might need to add a new step between steps 8 and 9 in our previous list. You could simply create a new queue and a new worker role to process that queue. Then the worker role for the state right before the new one would need to be updated to point to the new queue. Hopefully the changes to the existing parts of the system can be limited to configuration changes.

Even Cooler—Make the State Worker Role its Own Azure Service

How you want to manage your application in the cloud should be a primary consideration in how you structure the Visual Studio solution. Each solution becomes a single management point. If you want to manage different pieces without affecting the whole system, those should be split out into separate solutions.

In this scenario, it would make sense to separate each state worker role to its own service in Azure, which would further decouple them from each other. This way, when you need to restart one worker role and its queue, you won’t affect the other roles.

In a more dynamic organization, you might need to route a message through these states based on some information that’s only available at runtime. The routing information could be stored in a table, with rules for how the flow works, or by simply storing the states and their relationships in the cloud service configuration file. Both of these approaches would let you update how orders were processed at runtime without having to change code. We’ve done this when orders needed different stages depending on what was in the order, or where it was going. In one case, if a controlled substance was in the order, the processing engine had to execute a series of additional steps to complete the order.

This approach is often called a poor man’s service bus because it uses a simple way of connecting the states together, and they’re fairly firm at runtime. If you require a greater degree of flexibility in the workflow, you would want to look at the Itinerary[1] pattern. This lets the system build up a schedule of processing stops based on the information present at runtime. These systems can get a little more complicated, but they result in a system that’s more easily maintained when there’s a complex business process.

[1] For more information on the Itinerary pattern, see the Microsoft Application Architecture Guide from Patterns & Practices at Microsoft. It can be found at

Oops, it’s Not Nirvana

As you build this out, you’ll discover a drawback. You now have many more running worker roles to manage. This can create more costs, and you still have to plan for when you eventually will swallow a pig. If your system is tuned for a slow work day, with one role instance per state, and you suddenly receive a flood of orders, the large amount of orders will move down the state diagram like a pig does when it’s eaten by a python. This forces you to scale up the number of worker instances at each state.

Although this flexibility is great, it can get expensive. With this model, you have several pools of instances instead of one general-purpose pool, which results in each pool having to increase and then decrease as the pig (the large flood of work) moves through the pipeline. In the case of a pig coming through, this can lead to a stall in the state machine as each state has to wait for more instances to be added to its pool to handle the pig (flood of work). This can be done easily using the service management APIs, but it takes time to spin up and spin down instances—perhaps 20 minutes.

The next step to take, to avoid the pig in a python problem, is to build your worker roles so that they’re generic processors, all able to process any state in the system. You would still keep the separate queues, which makes it easier to know how many messages are in each state.

You could also condense the queues down to one, with each message declaring what state the order is in as part of its data, but we don’t like this approach because it leads to favoritism for the most recent orders placed in the processors, and it requires you to restart all of your generic workers when you change the state graph. You can avoid this particular downfall by driving the routing logic with configuration and dependency injection. Then you would only need to update the configuration of the system and deploy a new assembly to change the behavior of the system.

The trick to gaining both flexibility and simplicity in your architecture is to encapsulate the logic for each state in the worker, separating it so it’s easily maintainable, while pulling them all together so there’s only one pool of workers. The worker, in essence, becomes a router. You can see how this might work in figure 5. Each message is routed, based on its state and other runtime data, to the necessary state processor. This functions much like a factory. Each state would have a class that knows how to process that state. Each state class would implement the same interface, perhaps IorderProcessStage. This would make it easy for the worker to instantiate the correct class based on the state, and then process it. Most of these classes would then send the message back to the generic queue, with a new state, and the cycle would start again.

Figure 5. By moving to a consolidated state-directed worker, we’ll have one queue and one worker. The worker will act as a router, sending each inbound message to the appropriate module based on the message’s state and related itinerary. This allows us to have one large pool of workers, but makes it easier to manage and decompose our bulky process.

There are going to be times when you’re working with both web and worker roles and you’re either importing legacy code that needs access to a local drive, or what you’re doing requires it. That’s why we’ll discuss local storage next.

  •  Windows 7 : Preparing Disks for Use (part 3)
  •  Windows 7 : Preparing Disks for Use (part 2) - Adding a Mirror to an Existing Volume & Shrinking or Extending Volumes
  •  Windows 7 : Preparing Disks for Use (part 1) - Creating Mirrored, Spanned, or Striped Volumes
  •  Windows Server 2008: IPv6 Introduction (part 3) - The 6to4 Tunneling Protocolcol & The Teredo Tunneling Protocol
  •  Windows Server 2008: IPv6 Introduction (part 2) - IPv6 Transition Technologies & The ISATAP Tunneling Protocol
  •  Windows Server 2008: IPv6 Introduction (part 1) - IPv6 Addressing & Comprehending IPv6 Addressing
  •  Windows Server 2008: Domain Name System and IPv6 - Troubleshooting DNS
  •  Windows Server 2008: Domain Name System and IPv6 - DNS in an Active Directory Domain Services Environment
  •  Windows Azure : Processing with worker roles - Communicating with a worker role
  •  Windows Azure : Processing with worker roles - A simple worker role service
    Top 10
    - Microsoft Visio 2013 : Adding Structure to Your Diagrams - Finding containers and lists in Visio (part 2) - Wireframes,Legends
    - Microsoft Visio 2013 : Adding Structure to Your Diagrams - Finding containers and lists in Visio (part 1) - Swimlanes
    - Microsoft Visio 2013 : Adding Structure to Your Diagrams - Formatting and sizing lists
    - Microsoft Visio 2013 : Adding Structure to Your Diagrams - Adding shapes to lists
    - Microsoft Visio 2013 : Adding Structure to Your Diagrams - Sizing containers
    - Microsoft Access 2010 : Control Properties and Why to Use Them (part 3) - The Other Properties of a Control
    - Microsoft Access 2010 : Control Properties and Why to Use Them (part 2) - The Data Properties of a Control
    - Microsoft Access 2010 : Control Properties and Why to Use Them (part 1) - The Format Properties of a Control
    - Microsoft Access 2010 : Form Properties and Why Should You Use Them - Working with the Properties Window
    - Microsoft Visio 2013 : Using the Organization Chart Wizard with new data
    - First look: Apple Watch

    - 3 Tips for Maintaining Your Cell Phone Battery (part 1)

    - 3 Tips for Maintaining Your Cell Phone Battery (part 2)
    programming4us programming4us