This is the second part of a an article on designing a cache system for a service oriented architecture. The first part of the article dealt with the design considerations and potential approaches. This part will look at an implementation of a cache system.
System Overview
Before designing any system, its a good idea to fully understand what needs the system should satisfy. For the specific project I was working on some of the requirements that drove the design decisions were:
- Low latency for returning data
Access to the data, no matter where it was located must be fast. The system would not allow for large amounts of latency in retrieving data. Every item type put in the cache must perform under representative usage patterns.
- Remove load on the database
The environment we were working in was a typical SQL based RDBMS. We had an existing codebase and little ability to leverage some of the newer technologies such as NOSQL – of which many providers support features such as sharding – to help in scaling out the data layer.
- Consuming layers should not need to know about the implementation
We decided that users of the system – in our case, business logic developers – should not know or care that something resides in the cache or not. In fact the developers of this layer shouldn’t even know if we have a cache. We were striving to put in place a set of design patterns that would completely abstract away where data was located.
- Must work in a multi-tenant application layer (i.e.. shared web servers) and web-farm
Initially our approach was to leverage the .NET Runtime cache on the application server, but we quickly abandoned this once the requirement for a web-farms and multi-tenant application servers was added. A local cache on the server meant that a users request that added something to the cache on server ‘A’, and a subsequent request from that user, served by server ‘B’ would generate cache misses.
Two possible solutions remained – move to a distributed off-server cache, or synchronize changes to local caches across the set of servers serving requests. Of these two approaches we chose the lesser complex of the two, which was to move to a distributed cache server.
Cache Selection
With the choice of a web server resident cache out of the running due to the need to support web-farms, for our specific implementation we chose to go with the Windows Server AppFabric distributed cache.
We chose this since it had some nice features we would like to include in the future – such as locking and versioning of cache items. Also, given that it was a Microsoft product – this helped to simplify our setup, deployment and supportability concerns. There are a lot of good alternatives in this space, especially MemCache which is something we had considered and make look at leveraging in the future.
Another driver of our decision was the availability of a few features such as notification based expiration for local caches, and that the cache could be configured to use portions of the web servers available memory, so as not to need a dedicated cache server for smaller deployments.
Implementation Details
Factory pattern and Interfaces over the Cache Layer
One of the early choices made in the design of the system, was to be as defensive as possible. For that reason, we chose to implement a factory pattern around the cache creation. We felt that the factory pattern paired with some interface based programming would be able to provide a nice layer of abstraction of the cache details, from our development team.
This will allow us to swap out providers without needing to change all of the touch points in the business logic code, since we aren’t coupled directly to the cache implementation.
Some thought needs to be given to designing the proper cache interface for your system. You probably do not want to directly mirror the operations available on any given cache implementation. Remember the goal is to abstract away from the details, so think about how you will actually be using the cache and what the additional requirements you have around it are. In our case we wanted to store some ‘metadata’ with each cache item (explained below), so we made sure to build that into our abstraction layer interfaces.
Interacting with the cache
Figuring out how the business logic code you write will interact with the caching layer is likely the most important piece of the design. The decisions here will drive much of the overall design of the system and will either allow for some flexibility or conversely limit your ability to perform certain operations.
There are a few techniques you can choose, and I encourage you to explore them before settling on a choice.
We chose the Cache Aside pattern. This is where the code that is about to request data checks for the availability of the data in the cache. If the cache has the data, the data is returned. If the cache does not have the data, the data is loaded from the data store, added to the cache and then returned.
Our system had in place a Domain Model based design, which allowed us to fit the cache aside pattern nicely into the foundational mechanisms of which each domain object was built from. This let us keep the details of looking items up in the cache and the updating of items compartmentalized to a few key classes that most developers never directly interacted with.
For handling staleness of data, we were in luck due to the singular path we had to get out all data, which was though our domain objects. Due to this we were able to tap into the updates of objects and determine in the base implementation whether that object had data (direct or related) in the cache and invalidate/remove it if needed. This was a big win for us allowing to keep the complexity of dealing with the cache contained to a few classes and out of the minds of most of the business logic developers.
Store Cache Metadata
For each item in the cache we setup a custom system of storing a set of metadata along side the cached value. This metadata contains key system information such as related domain objects, creation times, and other keys to aid in cross-item or aggregate lookups in the cache.
Regions per customer
Our system can be multi-tenant, so we needed a way to distinguish the data of one customer from another. This can be accomplished a few ways in AppFabric:
- Part of the cache key
- Separate region per tenant
- Separate cache per tenant
We decided against identifying the tenant in the cache key, as this would not allow us to administer a ‘tenant’ without affecting others. The reason we chose the separate region per tenant over the separate cache per tenant was because of the overhead of needing to maintain each cache, which we viewed as much heavier weight than a region. It is also possible to programmatically administer regions, while it is not possible at the cache level.
The only downside with storing a tenant per region is that currently, AppFabric will not distribute a region across multiple server, which it does with a separate cache. Right now this serves our needs, but we may re-evaluate this going forward.
Administrative cache
We envision a need for our operations staff to be able to administer the cache and troubleshoot issues. To that end we built a mechanism into the cache access layer that gave us visibility into what regions were active and created at any time. Ironically, Microsoft did not include a programmatic way to retrieve a list of named regions from the cache.
We came up with a simple approach of creating an administrative cache area, separate from the main cache. Each time a region is added or removed we maintain a key in the administrative region with that regions information. This allows us to provide a real-time view of what regions are active and then query the regions for the contained cache items if needed.
Lessons Learned
Lack of programmatic administrative capabilities stinks
Not sure what the reason behind some of the gaps in the AppFabric API are, but I must say that the lack of a full-featured administrative API available from code stinks. Most of the administrative functions are only available via PowerShell. I can understand that this is likely targeted at the maintainers of the system, but quite honestly how is someone supposed to create an application that can be used to administer the cache without the need for PowerShell.
Local Cache with Notification based expiration – Not good enough
One of the features we were hoping to leverage with AppFabric, was the local cache option. This allows for a local cache to be constructed from where the client is accessing the cache (i.e. web server in our case). If the main cache gets updated the local cache would get invalidated. This seemed like a great way to boost performance while still maintaining our support for web-farm operation.
Unfortunately the design of this feature doesn’t work quite the way as we expected. The local cache option requires a polling of the main cache to find out if its items are invalid. This would allow for too much latency (or too high of network traffic is polling time decreased) in the system, so we had to nix this. Too bad – if only the implementers had used some type of event driven system so the latency was lower – that would have been great.
Your assumptions will be wrong – Test, Test, Test.
There is absolutely no way to know ahead of time whether a particular strategy will work without empirical testing.
Put some data in the database that represents your usage and test the scenarios – direct to database and data from cache. Make sure that your data access patterns return timings that are favorable to using the cache, otherwise don’t use it.
This was especially true in the system I worked on, where we were storing a complete set of an entity type in the cache, but when clients required the data they only ever needed a subset. We had assumed that this would likely be a poor candidate for caching in its entirety and were considering caching the subset variations. The tests we ran refuted this assumption and showed that the overhead of de-serializing the entire list of entity types and then pruning them in memory with LINQ, was fairly per formant and allowed for less burden of having to manage subset pieces of the entity type in cache.
References
Domain Models
http://msdn.microsoft.com/en-us/magazine/ee236415.aspx (Employing the Domain Model Pattern)
http://martinfowler.com/eaaCatalog/domainModel.html (P of EAA: Domain Model)
Windows Server AppFabric
Caching Patterns
http://www.alachisoft.com/resources/articles/domain-objects-caching-pattern.html (Distributed Caching and Domain Objects Caching Pattern for .NET)
http://www.ibm.com/developerworks/webservices/library/ws-soa-cachemed/ (Cache mediation pattern specification: an overview)
http://ljs.academicdirect.org/A08/61_76.htm (Caching Patterns and Implementation)