Wednesday, 12 March 2014

Caching - part 3

In the last part of this Caching series, I would like to discuss Virtual DB, the technique we use to build effective caching for database.

Caching on Database Tier 

Caching on Database tier share most of common attributes with Caching on business logic tier. In deed, if you treat DAO as one service, most of the techniques from earlier article can be applied.

However, due to one unique characteristic of relational database, caching on database tier has one special technique that only can be applied on this tier. This is Virtual DB.

What is Virtual DB?

To understand Virtual DB, let take a look at the sample below:

Imagine if we have a reporting table that contains the lowest and highest temperature for every day from beginning of this century. The system we build must allow user to enter start date, end date and provide lowest and highest temperature for that period.

With the requirement above, active caching is unrealistic due to huge volume of possible queries. Passive caching is also ineffective unless due to some reasons, users keep querying the same period.

The table size is small as we only have 14 years from beginning of this century and for each day, we only need to store 2 temperatures. It is the combination of data that explode the caching size. Moreover, the data store in the cache is highly repetitive. For example, if we already store lowest temperature of year 2012 and year 2013, the lowest temperature of from 2012 to 2013 should be known automatically without accessing DB.

Due to this, it is wise to build the data model inside caching engine and apply business logic to minimize the cache size. I simply call it Virtual DB as it function as a minimized database in memory. It is not realistic to store the whole DB in memory because database is built for scalability, caching do not need this, we only need performance. Because of different goals, we only need to store most frequent accessed records in Virtual DB. It is the developer responsibility to decide which query can be run on Virtual DB and which one is not.

Go back to earlier example, daily temperature do not change after being recorded; hence, it is safe to store these data to Virtual DB without worrying of frequently refreshing data. Still, developer can do polling to update the Virtual DB if there is new record. The polling should be smart enough to only fetch additional records rather than fully load the whole table.

It is not necessary to build the full data model. The real DB can contain humidity, whether there was storm or rain in that day but we can strip off not interested data. It is also not necessary to store all the data. We can choose to include only the daily temperatures for the last 3 years if observation confirm that most of queries fall within this period. The goal is to replace majority of queries not all possible queries. As can be shown the from diagram below, there are queries to both normal DB and Virtual DB.



How to build Virtual DB?

To build Virtual DB, choose a database system that can use memory as storage. For Java, the most well-known one is HSQLDB. If the systems crash and restarts, simply rebuild the DB as part of webapp starting process. It is a bit complex as we need to maintain 2 data sources in DAO and selectively choose which data source to run the query.

Kindly notice that Virtual DB is different from DB Replication. DB replication builds mirrors of DB. To simplify implementation, most of DB replication use master/slave model where the data is one-way synchronized from master to slave nodes. DB replication offer scalability but in term of performance, it is far slower than Virtual DB. Virtual DB even do not need to send request to Database, responses are generated within system memory.

The queries to Virtual DB can be hundreds of times faster than querying real DB. Therefore, if you manage to replace 80% of DB queries with Virtual DB queries, it will provide system a huge boost of performance. We have once improved the server load from few hundred requests per second to above 2000 requests per second using this technique.

Virtual DB can be used together with other techniques to improve performance. Of the system that we built that suffered the worst DDOS attack in 2011, both DB replica and Virtual DB are used. Generally, we do not like to mix the frequently write with frequently read data. No locking is better than both pessimistic and optimistic locking.

Conclusions

So far, we have walked through several well-known techniques to build caching. It is easy to learn the concept but not so easy to implement it properly in real life situation. Developer may need some practices and good understanding of business requirement to build effective caching.

Caching is better to be done with some investigation and analysis. In the future, I will provides some more articles about scalability and Load Analysis. We will need to use them to aid Caching design.