Caching in a Service Oriented Architecture (SOA) – Part 2

This is the second part of a an article on designing a cache system for a service oriented architecture. The first part of the article dealt with the design considerations and potential approaches. This part will look at an implementation of a cache system.

 

System Overview

Before designing any system, its a good idea to fully understand what needs the system should satisfy. For the specific project I was working on some of the requirements that drove the design decisions were:

  • Low latency for returning data

Access to the data, no matter where it was located must be fast. The system would not allow for large amounts of latency in retrieving data. Every item type put in the cache must perform under representative usage patterns.

  • Remove load on the database

The environment we were working in was a typical SQL based RDBMS. We had an existing codebase and little ability to leverage some of the newer technologies such as NOSQL – of which many providers support features such as sharding – to help in scaling out the data layer.

  • Consuming layers should not need to know about the implementation

We decided that users of the system – in our case, business logic developers – should not know or care that something resides in the cache or not. In fact the developers of this layer shouldn’t even know if we have a cache. We were striving to put in place a set of design patterns that would completely abstract away where data was located.

  • Must work in a multi-tenant application layer (i.e.. shared web servers) and web-farm

Initially our approach was to leverage the .NET Runtime cache on the application server, but we quickly abandoned this once the requirement for a web-farms and multi-tenant application servers was added. A local cache on the server meant that a users request that added something to the cache on server ‘A’, and a subsequent request from that user, served by server ‘B’ would generate cache misses.

Two possible solutions remained – move to a distributed off-server cache, or synchronize changes to local caches across the set of servers serving requests. Of these two approaches we chose the lesser complex of the two, which was to move to a distributed cache server.

Cache Selection

With the choice of a web server resident cache out of the running due to the need to support web-farms, for our specific implementation we chose to go with the Windows Server AppFabric distributed cache.

We chose this since it had some nice features we would like to include in the future – such as locking and versioning of cache items. Also, given that it was a Microsoft product – this helped to simplify our setup, deployment and supportability concerns. There are a lot of good alternatives in this space, especially MemCache which is something we had considered and make look at leveraging in the future.

Another driver of our decision was the availability of a few features such as notification based expiration for local caches, and that the cache could be configured to use portions of the web servers available memory, so as not to need a dedicated cache server for smaller deployments.

 

Implementation Details

Factory pattern and Interfaces over the Cache Layer

One of the early choices made in the design of the system, was to be as defensive as possible. For that reason, we chose to implement a factory pattern around the cache creation. We felt that the factory pattern paired with some interface based programming would be able to provide a nice layer of abstraction of the cache details, from our development team.

This will allow us to swap out providers without needing to change all of the touch points in the business logic code, since we aren’t coupled directly to the cache implementation.

Some thought needs to be given to designing the proper cache interface for your system. You probably do not want to directly mirror the operations available on any given cache implementation. Remember the goal is to abstract away from the details, so think about how you will actually be using the cache and what the additional requirements you have around it are. In our case we wanted to store some ‘metadata’ with each cache item (explained below), so we made sure to build that into our abstraction layer interfaces.

Interacting with the cache

Figuring out how the business logic code you write will interact with the caching layer is likely the most important piece of the design. The decisions here will drive much of the overall design of the system and will either allow for some flexibility or conversely limit your ability to perform certain operations.

There are a few techniques you can choose, and I encourage you to explore them before settling on a choice.

We chose the Cache Aside pattern. This is where the code that is about to request data checks for the availability of the data in the cache. If the cache has the data, the data is returned. If the cache does not have the data, the data is loaded from the data store, added to the cache and then returned.

Our system had in place a Domain Model based design, which allowed us to fit the cache aside pattern nicely into the foundational mechanisms of which each domain object was built from. This let us keep the details of looking items up in the cache and the updating of items compartmentalized to a few key classes that most developers never directly interacted with.

For handling staleness of data, we were in luck due to the singular path we had to get out all data, which was though our domain objects. Due to this we were able to tap into the updates of objects and determine in the base implementation whether that object had data (direct or related) in the cache and invalidate/remove it if needed. This was a big win for us allowing to keep the complexity of dealing with the cache contained to a few classes and out of the minds of most of the business logic developers.

Store Cache Metadata

For each item in the cache we setup a custom system of storing a set of metadata along side the cached value. This metadata contains key system information such as related domain objects, creation times, and other keys to aid in cross-item or aggregate lookups in the cache.

Regions per customer

Our system can be multi-tenant, so we needed a way to distinguish the data of one customer from another. This can be accomplished a few ways in AppFabric:

  • Part of the cache key
  • Separate region per tenant
  • Separate cache per tenant

We decided against identifying the tenant in the cache key, as this would not allow us to administer a ‘tenant’ without affecting others. The reason we chose the separate region per tenant over the separate cache per tenant was because of the overhead of needing to maintain each cache, which we viewed as much heavier weight than a region. It is also possible to programmatically administer regions, while it is not possible at the cache level.

The only downside with storing a tenant per region is that currently, AppFabric will not distribute a region across multiple server, which it does with a separate cache. Right now this serves our needs, but we may re-evaluate this going forward.

Administrative cache

We envision a need for our operations staff to be able to administer the cache and troubleshoot issues. To that end we built a mechanism into the cache access layer that gave us visibility into what regions were active and created at any time. Ironically, Microsoft did not include a programmatic way to retrieve a list of named regions from the cache.

We came up with a simple approach of creating an administrative cache area, separate from the main cache. Each time a region is added or removed we maintain a key in the administrative region with that regions information.  This allows us to provide a real-time view of what regions are active and then query the regions for the contained cache items if needed.

Lessons Learned

Lack of programmatic administrative capabilities stinks

Not sure what the reason behind some of the gaps in the AppFabric API are, but I must say that the lack of a full-featured administrative API available from code stinks. Most of the administrative functions are only available via PowerShell. I can understand that this is likely targeted at the maintainers of the system, but quite honestly how is someone supposed to create an application that can be used to administer the cache without the need for PowerShell.

Local Cache with Notification based expiration – Not good enough

One of the features we were hoping to leverage with AppFabric, was the local cache option. This allows for a local cache to be constructed from where the client is accessing the cache (i.e. web server in our case). If the main cache gets updated the local cache would get invalidated. This seemed like a great way to boost performance while still maintaining our support for web-farm operation.

Unfortunately the design of this feature doesn’t work quite the way as we expected. The local cache option requires a polling of the main cache to find out if its items are invalid. This would allow for too much latency (or too high of network traffic is polling time decreased) in the system, so we had to nix this. Too bad – if only the implementers had used some type of event driven system so the latency was lower – that would have been great.

Your assumptions will be wrong – Test, Test, Test.

There is absolutely no way to know ahead of time whether a particular strategy will work without empirical testing.

Put some data in the database that represents your usage and test the scenarios – direct to database and data from cache. Make sure that your data access patterns return timings that are favorable to using the cache, otherwise don’t use it.

This was especially true in the system I worked on, where we were storing a complete set of an entity type in the cache, but when clients required the data they only ever needed a subset. We had assumed that this would likely be a poor candidate for caching in its entirety and were considering caching the subset variations. The tests we ran refuted this assumption and showed that the overhead of de-serializing the entire list of entity types and then pruning them in memory with LINQ, was fairly per formant and allowed for less burden of having to manage subset pieces of the entity type in cache.

 

References

Domain Models

http://msdn.microsoft.com/en-us/magazine/ee236415.aspx  (Employing the Domain Model Pattern)

http://martinfowler.com/eaaCatalog/domainModel.html  (P of EAA: Domain Model)

Windows Server AppFabric

http://msdn.microsoft.com/en-us/windowsserver/ee695849

Caching Patterns

http://www.alachisoft.com/resources/articles/domain-objects-caching-pattern.html  (Distributed Caching and Domain Objects Caching Pattern for .NET)

http://www.ibm.com/developerworks/webservices/library/ws-soa-cachemed/  (Cache mediation pattern specification: an overview)

http://ljs.academicdirect.org/A08/61_76.htm  (Caching Patterns and Implementation)

Caching in a Service Oriented Architecture (SOA) – Part 1

This will be a two part post on some of my thoughts on designing a caching system for a service oriented architecture, and some of the results from a series of prototypes done to flesh out the design.

Part 1 – Overview, use and potential approaches

Part 2 – Prototype designs, results and lessons learned

When designing a service oriented architecture (SOA) that is expected to see high volumes of traffic one of the potential architectural components you may be looking at is a caching system. In a high traffic system a cache can be essential in increasing performance and enabling the scalability of the overall system.

Why use a Caching system?

Caching systems inarguably add another layer of complexity as well as another potential point of failure to a systems architecture, so it’s use should be carefully weighed in relation to the expected benefits you anticipate from its use.

There are usually two main reasons for employing a caching system

  1. Offload Database work
  2. Drive down response times

Offloading database work is essentially a way of enabling a pseudo-scaling of the database tier, especially in cases where the database platform doesn’t inherently allow a scaling out. By performing more of the work of retrieving data without involving the database, we are effectively scaling out the capacity of the tier.

The requirement to Drive down response times or keep response times stable as the system grows is another common reason to employ a caching system. Retrieving data from a cache held in memory is magnitudes faster than retrieving it from the database tier in most circumstances and especially so if the database system itself does not employ some internal caching to keep the needed data in memory. If the database system determines it needs to read the data from disk, the cache fetch will seem like a Maserati compared to a ‘Model T’.

Where should you use caching?

Caching should likely be considered at all levels of the system – client tier, web tier and services tier. At each tier of your architecture the needs of the application and the type of data in use will dictate what gets cached and the strategy employed.

The goal of caching within a SOA system that is expected to scale means keeping the cached information at the layer that makes the most sense from a use and manageability standpoint.

Client Tier

At the client tier data should be cached to avoid round trips back to the server when possible. This is likely one of the most expensive calls in a system that can be made, as the network traversed in this call is likely a good distance from the web application or services tier. The best approach to performance here is to not incur the overhead of the call at all if possible.

Thick clients have long used local caching strategies to hold onto data as long as possible. After a database call a thick client would keep the set of data retrieved in memory between user actions and screen changes.

Browser based clients have had a more difficult time caching data due to the stateless nature of the web. Some approaches here have been to store data in the page itself. This can lead to page size bloat and slower response times in a typical scenario such as ASP.NET where the “stored data” is round-tripped with the page. With the rising popularity of AJAX style programming and partial page refreshing, the browser is becoming a more intelligent presentation layer compared to the typical post-back or complete page refresh model.

Web Application Tier

The web application tier has a role to participate in the caching strategy as well. Since in a proper N-tier system,  the web application tier is responsible for serving of resources (pages, images, etc), it should employ its caching strategy around these object types primarily. Employing a cache strategy around how long a page can be served from cache versus being regenerated should be one of the primary focuses of caching at this tier.

While tempting to cache data at the web application tier, this should be avoided as there are several problems that could arise from this in a scaled and load balanced environment, such as data only being available to certain web servers, or the distributed maintenance of a cache from multiple web servers.

Services Tier

Caching at the services tier should target “data” since this is the single point of access to data within a SOA based system. As such it makes sense to control the population, refreshing and invalidation of a data cache from this tier. The services tier lends itself particularly well to the caching of data as its primary purpose it to act as the facade that serves all requests to retrieve or update data.

Where it retrieves this data from is of no concern to the caller other than from the standpoint that the data is correct and accurate. Since employing a cache at this tier is transparent to the caller, offloads work from the database and is more manageable from the standpoint of trapping changes that require updating the cache, it makes the most sense to cache data at this tier.

What types of data should you cache?

The type of data that you determine should be cached should ultimately provide an increase in performance to the system without dramatically increasing the complexity of the system. There are certain types or classes of data that make more sense to cache than others, in order of priority.

  • Data that changes infrequently
  • Expensive queries
  • Data that is accessed frequently

Some thought should also be given to the dependencies between cached object types. I cover this more in the considerations section, but a high number of dependencies between objects may be a factor in determining whether you cache these object types.

Data that changes infrequently

Data items that are fairly static in nature make ideal candidates for caching. The benefit here is that there is a low overhead to managing this type of data in a cache as updates to the data are infrequent requiring less of a need to clear items from the cache and/or refresh them.

An example of data that changes infrequently could be policy data that drives certain actions within the application.

Expensive Queries

Queries that are expensive in either time or resource usage to run are another ideal candidate for caching. Caching this type of data will provide the aforementioned “scaling” increase at the data tier since the underlying database is freed from running the majority of these queries, allowing it to run other queries which in effect provides the same benefit as scaling the database system.

Examples of an expensive query might be a query that aggregates several pieces of data together or performs some level of trending, along the lines of something you may see in a dashboard style view.

Data that is accessed frequently

Data that is accessed frequently also makes a nice candidate for caching since this type of data – even if cheap to execute and return – provides a constant load on the underlying system. Being able to effectively take this constant load off of the database and move it to the cache can yield significant performance improvements.

So in general there are a couple of factors that drive the cost/benefit analysis as to what should be cached: cost of data and frequency of change:

Cost of Data Frequency of Change Benefit of Caching
High High Low*
High Low High
Low High Best not to cache
Low Low Moderate

* While the cost of the data is very high to execute and retrieve, the benefits of caching are reduced by the frequency of change since the frequent changes will lead to high cache turnover, frequent refreshing/re-querying of data and an overall higher level of data management for this item in the cache.

Strategies

Two approaches we designed and tested in a prototype were the “Write Through” and “Data Event Driven” approaches. Both approaches have advantages and disadvantages associated with them and should be carefully considered in the context of how the system is used and the way in which data is interacted with.

Write Through

The write through cache is a a cache implementation where the cache is updated during the operation that is updating a piece of data, followed by a subsequent updating of the data in the underlying data store. In essence this is a cache first and data store second type of model.

This type of model is most effective in a system where there is a well defined set of interfaces for interacting with the data and all areas of the system use (ie. single source for all data interactions). A single source of updates allows for a more manageable point to maintain the data in the cache from, whereby a single component or code path is responsible for the update or refresh of the cache for a given operation.

Another considerations in utilizing this type of design is to think about how concurrency of updates occur in the system. There exists the possibility that there can be two distinct operations that are updating a piece of data, both of which are attempting to first change the data in the cache and then in the data store. Most typical data stores provide a mechanism for handling concurrent updates, usually through a locking mechanism. This may or may not be the case in your cache provider.

Since the first update occurring in the system will be to the cache (not the data store) it is essential that there be some mechanism to effectively handle the ability for concurrent attempts to update data. As mentioned some cache implementations provide the ability to lock a cache item while it is being updated effectively reproducing the same behavior as the data store.

Advantages
  •  
    • Easy to implement given a single point of interaction to data within the system
Disadvantages
  •  
    • May need to implement concurrency handling for updates to the cache
    • Calls that directly modify the database or do not use the “single point of interaction for data” can lead to stale cache data items

Data Event Driven

The data driven model is a cache implementation where the signal for a change to the cache comes from a change to the underlying data in the data store. Basically the data “signals” that it has changes and the cache is updated based on this.

There are two ways to handle a data driven design – the push or pull method.

In the pull method, there would exist a way to actively monitor the underlying data to detect changes and then react to those changes by updating the cache. An example of this would be using something like the SqlDependency feature in Microsoft ADO.NET. Typically you setup a query to watch some set of data and when the results of that query changes you are signaled and can react to the change. In my opinion and through testing this does scale well to larger systems since the number of items that need to be setup and then polled – utilizing system resources – can grow to a large number. For simpler systems or those that do not have a requirement to cache a large number of different data types, this may be appropriate.

In the push method, the data store itself would have a mechanism to watch a set of data for changes and signal that the data has changed to an interested party. Our prototype used a combination of triggers and SQL CLR code to detect when changes to a set of data we were interested in occurred and then raise a signal to the cache implementation to refresh the associated data. The benefit of this method was that there was a low overhead associated with tracking the changes to the data versus polling. One of the disadvantages to this approach is the maintenance of the triggers in the system. As the need to track more and more data items grows the number and complexity of the triggers to detect and deal with changes also increased.

A common disadvantage to both flavors of the Data Event Driven design is that the rollup from data in a normalized database to something that is typically stored in a cache – such as an aggregated type of data object – was exceptionally difficult. Translating what data should be watched for a given object such as a Customer, which might span three normalized tables in the data store was a chore. In our prototype it meant a minimum of three triggers – one for each table – to watch a portion of the data for a Customer, and then the ability to translate a change detected by any of those three triggers into a ‘Customer’ object that was held by the cache.

The obvious advantage and appeal of this type of system is that any and all changes to the data can be caught and propagated to the cache layer for updates as needed. There would be no stale data in the cache if someone wrote directly to the database or skirted a centralized update control point.

Advantages
  • All changes are accounted for the in the data
  • Reactive – low to no resource usage for monitoring changes
Disadvantages
  • Hard to implement.
  • Translation from physical to logical can be tricky

Considerations

Dependencies between cache objects

One of the more challenging aspects of a cache system implementation is how your define and manage any dependencies between data items within the cache. This should be considered in the overall design of the system as to how data is stored with especial attention given to how granular or coarse the items you are putting into the cache are. You should shy away from an implementation where you are storing very discrete data items that cannot stand on their own as having business value. Data items that are only useful when aggregated together may not make the best candidates for being cache. After all joining items together to return data with business value is the purview of the database and not necessarily that of the cache.

Cache Refresh Strategy

There are two approaches to consider when removing a piece of data held in cache is invalid or stale – actively refresh the data by fetching and replacing it, or remove the data from cache and let the next request fetch and cache the data.

In considering these approaches one factor that might drive the decision of one over another is the overall expected volume of traffic. If the expectation of the system volume is very high, it could be that “removing the cache item and letting the next requestor fetch the data” strategy could result in severe spikes in utilization at the data tier as ‘N’ number of requestors all looking for the same piece of data removed from the cache, go and request it from the database. This could possibly be mitigated if the cache supports the concept of locking on a cache item key that isn’t in cache while the data is being fetched.

Single Server Cache vs. Distributed Cache

Some thought should be given to whether a single cache server and the resources available to it would be sufficient for your implementation. If high availability of the cache is a requirement, you may need to consider a distributed cache implementation, which provides for storing multiple copies of your data within the cache for failover and high availability purposes.

I’d be interested in hearing any feedback on the ideas in this article or learning’s you many have from implementing a similar large scale caching implementations in support of a SOA based system.

Error returning a DataTable from a WCF service call

Not debating the merits of whether its appropriate or sensible to return a DataTable as a response from a web service, but if you receive an error like I did make sure to check the following:

1. The DataTable needs to be named.

    Before (causes an error):

public DataTable ExecuteDataTable()
{
    return new DataTable();
}

    After (no error):

public DataTable ExecuteDataTable()
{
    return new DataTable(“Test”);
}

 

2. Ensure that the packet sizes configured for WCF are large enough to accommodate a serialized DataTable with the data content you have in it. Serializing a DataTable to XML results in some fairly large XML documents and can easily surpass the packet limits in the default WCF configuration.