Saturday, 24 May 2014

Testing effectively

Recently, there is a heaty debate regarding TDD which started by DHH when he claimed that TDD is dead.
This ongoing debate managed to capture the attention of developers world, including us.

Some mini debates have happened in our office regarding the right practices to do testing.

In this article, I will represent my own view.

How many kinds of tests have you seen?

From the time I joined industry, here are the kinds of tests that I have worked on:

Unit Test
System/Integration/Functional Test
Regression Test
Test Harness/Load Test
Smoke Test/Spider Test

The above test categories are not necessarily mutually exclusive. For example, you can crate a set of automated functional tests or Smoke tests to be used as regression test. For the benefit of newbie, let do a quick review for these old concepts.

Unit Test

Unit Test aim to test the functional of a unit of code/component. For Java world, unit of code is the class and each Java class suppose to have an unit test. The philosophy of Unit Test is simple. When all the components are working, the system as a whole should work.

A component rarely work alone. Rather, it normally interacts with other components. Therefore, in order to write Unit Test, developers need to mock other components. This is the problem that DHH and James O Coplien criticize Unit Test for huge effort that gain little benefit.

System/Integration/Functional Test

There is no concrete naming as people often use different terms to describe similar things. Contradict to Unit Test, for functional test, developers aim to test a system function as a whole, which may involve multiple components.

Normally, for functional test, the data is retrieved and store to the test database. Of course, there should be a pre-step to set-up test data before running. DHH likes this kind of test. It helps developers test all the functions of the system without huge effort to set-up mock object.

Functional test may involve asserting web output. In the past, it is mostly done with htmlUnit but with recent improvement of Selenium Grid, Selenium became the preferred choice.

Regression Test

In this industry, you may end up spend more time maintaining system than developing new one. Software changes all the time and it is hard to avoid risk whenever making changes. Regression Test supposes to capture any defect that caused by changes.

In the past, software house did have one army of testers but the current trend is automated testing. It means that developers will deliver software with full set of tests that suppose to be broken whenever a function is spoiled.

Whenever a bug is detected, a new test case should be added to cover new bug. Developers create the test, let it fail, and fix the bug to make it pass. This practice is called Test Driven Development.

Test Harness/Load Test

Normal test case does not capture system performance. Therefore, we need to develop another set of tests for this purpose. In the simplest form, we can set the time out for the functional test that run in continuous integration server. The tricky part is this kind of test is very system dependant and may fail if the system is overloaded.

The more popular solution is to run load test manually by using profiling tool like JMeter or create our own load test app.

Smoke Test/Spider Test

Smoke Test and Spider Test are two special kinds of tests that may be more relevant to us. WDS provides KAAS (Knowledge as a Service) for wireless industry. Therefore, our applications are refreshed everyday with data changes rather than business logic changes. It is specific to us that system failure may come from data change rather than business logic.

Smoke Test are set of pre-defined test cases run on integration server with production data. It helps us to find out any potential issues for the daily LIVE deployment.

Similar to Smoke Test, Spider Test runs with real data but it work like a crawler that randomly click on any link or button available. One of our system contains so many combination of inputs that it is not possible to be tested by human (closed to 100.000 combinations of inputs).

Our Smoke Test randomly choose some combination of data to test. If it manage to run for a few hours without any defect, we will proceed with our daily/weekly deployment.

The Test Culture in our environment

To make it short, WDS is a TDD temple. If you create the implementation before writing test cases, better be quiet about it. If you look at WDS self introduction, TDD is mentioned only after Agile and XP

"We are:- agile & XP, TDD & pairing, Java & JavaScript, git & continuous deployment, Linux & AWS, Jeans & T-shirts, Tea & cake"

Many high level executives in WDS start their career as developers. That helps to fostering our culture as an engineering-oriented company. Requesting resources to improve test coverage or infrastructure are common here.

We do not have QA. In worst case, Product Owner or customers detect bugs. In best case, we detect bugs by test cases or by team mates during peer review stage.

Regarding Singapore office, most of our team members grow up absorbing Ken Beck and Martin Fowler books and philosophy. That why most of them are hardcore TDD worshipers.

The focus of testing in our working environment did bear fruits. WDS production defects rate is relatively low.

My own experience and personal view with testing

That is enough about self appraisal. Now, let me share my experience about testing.

Generally, Automated Testing works better than QA

Comparing the output of traditional software house that packed with an army of QA with modern Agile team that deliver fully test coverage products, the latter normally outperform in term of quality and even cost effectiveness. Should QA jobs be extinct soon?

Over monitoring may hint lack of quality

It sounds strange but over the years, I developed insecure feeling whenever I saw a project that have too many layer of monitoring. Over monitoring may hint lack of confidence and in deed, these systems crash very often with unknown reasons.

Writing test cases takes more time that developing features

DDH is definitely right on this. Writing Test Cases mean that you need to mock input and assert lots of things. Unless you keep writing spaghetti code, developing features take much less times compare to writing tests.

UI Testing with javascript is painful

You know it when you did it. Life is much better if you only need to test Restful API or static html pages. Unfortunately, the trend of modern web application development involve lots of javascripts on client side. For UI Testing, Asynchronous is evil.

Whether you want to go with full control testing framework like htmlUnit or using a more practical, generic one like Selenium, it will be a great surprise for me if you never encounter random failures.

I guess every developer know the feeling of failing to get the build pass at the end of the week due to random failure test cases.

Developers always over-estimate their software quality

It is applicable to me as well because I am an optimistic person. We tend to think that our implementation is perfect until the tests failed or someone help to point out a bug.

Sometimes, we change our code to make writing test cases easier

Want it or not, we must agree with DHH on this point. Pertaining to Java world, I have seen people exposing internal variable, creating dummy wrapper for framework object (like HttpSession, HttpRequest,...) so that it is easier to write Unit Test. DHH find it so uncomfortable that he chose to walk way from Unit Test.

On this part, I half agree and half disagree with him. From my own view, altering design, implementation for the sake of testing is not favourable. It is better if developers can write the code without any concern of mocking input.

However, aborting Unit Testing for the sake of having a simple and convenient life is too extreme. The right solution should be designing the system is such a way that business logic is not so tight-coupling with framework or infrastructure.

This is what called Domain Driven Design.

Domain Driven Design

For newbie, Domain Driven Design give us a system with following layers.

If you notice, the above diagram has more abstract layers than Rails or the Java adoption of Rails, Play framework. I understand that creating more abstract layers can cause bloated system but for DDD, it is a reasonable compromise.

Let elaborate further on the content of each layer:

Infrastructure

This layer is where you store your repository implementation or any other environment specific concerns. For infrastructure, keep the API as simple, dummy as possible and avoid having any business logic implemented here.

For this layer, Unit Test is a joke. If there is any thing to write, it should be integration test, which working with real database.

Domain

Domain layer is the most important layer. It contains all system business logics without any framework, infrastructure, environment concern. Your implementation should look like a direct translation of user requirements. Any input, output, parameter are POJO only.

Domain layer should be the first layer to be implemented. To fully complete the logic, you may need interface/API of the infrastructure layer. It is best practice to keep the API in Domain Layer and concrete implementation in Infrastructure layer.

The best kind of test cases for Domain layer is Unit Test as your concern is not the system UI or environment. Therefore, it helps developers to avoid doing dirty works of mocking framework object.

For mocking internal state of object, my preferred choice is using Reflection utility to setup object rather than exposing internal variables through setters.

Application Layer/User Interface

Application Layer is where you start thinking about how to represent your business logic to customer. If the logic is complex or involving many consecutive requests, it is possible to create Facades.

Reaching this point, developers should think more about clients than the system. The major concerns should be customer's devices, UI responsiveness, load balance, stateless or stateful session, Restful API. This is the place for developers to showcase framework talent and knowledge.

For this layer, the better kind of test cases is functional/integration test.

Similar as above, try your best to avoid having any business logic in Application Layer.

Why it is hard to write Unit Test in Rails?

Now, if you look back to Rails or Play framework, there is no clear separation of layers like above. The Controllers render inputs, outputs and may contains business logic as well. Similar behaviours applied if you use the ServletAPI without adding any additional layer.

The Domain object in Rails is an active record and has a tight-coupling with database schema.

Hence, for whatever unit of code that developers want to write test cases, the inputs and output are nots POJO. This make writing Unit Test tough.

We should not blame DHH for this design as he follow another philosophy of software development with many benefits like simple design, low development effort and quick feedback. However, I myself do not follow and adopt all of his ideas for developing enterprise applications.

Some of his ideas like convention over configuration are great and did cause a major mindset change in developers world but other ideas end up as trade off. Being able to quickly bring up a website may later turn to troubles implementing features that Rails/Play do not support.

Conclusion

Unit Test is hard to write if you business logic is tight-coupling to framework.
Focusing and developing business logic first may help you create better design.
Each kinds of components suit different kinds of test cases.

This is my own view of Testing. If you have any other opinions, please feedback.

Friday, 23 May 2014

Software Development and Newton's Laws of Motion

Intro

I have no idea since when the word velocity found a new home in software development, it is nevertheless popular these days. However I am pretty sure that Mr Isaac Newton would not be happy if you talk about motion without mentioning his laws.

First Law

When viewed in an inertial reference frame, an object either remains at rest or continues to move at a constant velocity, unless acted upon by an external force.

There are a lot of external forces

developers are fixing bugs
developers are adding new features
developers are introducing more bugs (lol)
business requests to cut down the operation cost
third party competition is changing the market
users are changing
this list goes on and on

However a team/product is either dead (therefore remains at rest) or is moving at a constant velocity (let's say generating certain amount of revenue or eating certain amount of buget per day).

Now I declare, it is against the law to talk about team velocity, because what should you do to maintain the team's velocity? Nothing, you should do nothing!

Well, that will upset most of the managers, "I'd rather my developers do something".

So we need another law.

Second Law

F = ma. The vector sum of the forces F on an object is equal to the mass m of that object multiplied by the acceleration vector a of the object.

Acceleration is the ability to change the velocity. The F is treated as a constant here, because, come on, let's be honest, your team is pretty much fix sized, unless you are Google. Your time is pretty much fixed to 24 hours per day unless you live on Mars which is slightly longer, 24.622962 hours to be exact. Now we are screwed ... there is only one variable left to play. According to second law, for a given force F, the acceleration is inversely proportional to the mass. Mass is the burden, it is going against acceloration.

Here is a short list of how to gain some mass

too many good-to-have features
too much technical debt
too many abstractions, layers upon layers, ORM, DAO, service, controller, view. We need all of them to get some trivial {"user_id": 123} out of that database. oh forget to mention, there is SQL, and NoSQL ...
too many processes
too many patterns, EnterprisyStrategyFactoryBuilderAdapterListenerInterceptor
too many communication delegations, business -> project manager -> business analyst -> team leader -> developer (add more roles at your own will)
too many frameworks. JavaEE, Spring, Hibernate, Struts, Bootstrap, jQuery, Angular.js, Ember.js. Dare to lookup JavaEE? There are 39 JSRs listed under JavaEE7!
too many servers. Web servers, relational database servers, NoSQL servers, cache servers, message queue servers, third party integration servers ...

Yet, in the end you do want to make a change, do you? If your answser is NO, grats, you can stop reading here. Even the answer is yes, you can only say so after you read the third law.

Third Law

To every action there is always opposed an equal reaction: or the mutual actions of two bodies upon each other are always equal, and directed to contrary parts.

A: "Can we remove feature XYZ? so that the codes can be greatly simplified"
R: "Please no, that is Shareholder ABC's favorite"
A: "Ooookie, nvm"

A: "Can we change to git?"
R: "Nah, zip and email is our best friend"
A: "Maybe next time"

A: "Can we upgrade java 1.4?"
R: "There are too many servers in production"
A: "Fine, let's stick to manual casting"

Aaaaah, I still want to type some more words but there is an equal reaction preventing me from doing that ... So let's call this a day.

Thanks for wasting your time reading my rants.

Happy Coding ...

Reference

http://en.wikipedia.org/wiki/Velocity_(software_development)
http://en.wikipedia.org/wiki/Newton's_laws_of_motion

Wednesday, 21 May 2014

MySQL Transaction Isolation Levels and Locks

Recently, an application that my team was working on encountered problems with a MySQL deadlock situation and it took us some time to figure out the reasons behind it. This application that we deployed was running on a 2-node cluster and they both are connected to an AWS MySQL database. The MySQL db tables are mostly based on InnoDB which supports transaction (meaning all the usual commit and rollback semantics) as well as row-level locking that MyISAM engine does not provide. So the problem arose when our users, due to some poorly designed user interface, was able to execute the same long running operation twice on the database.

As it turned out, due to the fact that we have a dual node cluster, each of the user operation originated from a different web application (which in turn meant 2 different transaction running the same queries). The deadlock query happened to be a “INSERT INTO T… SELECT FROM S WHERE” query that introduced shared locks on the records that were used in the SELECT query. It didn’t help that both T and S in this case happened to be the same table. In effect, both the shared locks and exclusive locks were applied on the same table. An attempt to explain the possible cause of the deadlock on the queries could be explained by the following table. This is based on the assumption that we are using a default REPEATABLE_READ transaction isolation level (I will explain the concept of transaction isolation later)

Assuming that we have a table as such

RowId	Value
1	Collection 1
2	Collection 2
…	Collection N
450000	Collection 450000

The following is a sample sequence that could possibly cause a deadlock based on the 2 transactions running an SQL query like “INSERT INTO T SELECT FROM T WHERE … “ :

Time	Transaction 1	Transaction 2	Comment
T1	Statement executed		Statement executed. A shared lock is applied to records that are read by selection
T2	Read lock s1 on Row 10-20		The lock on the index across a range. InnoDB has a concept of gap locks.
T3		Statement executed	Transaction 2 statement executed. Similar shared lock to s1 applied by selection
T4		Read lock s2 on Row 10-20	Shared read locks allow both transaction to read the records only
T5	Insert lock x1 into Row 13 in index wanted		Transaction 1 attempts to get exclusive lock on Row 13 for insertion but Transaction 2 is holding a shared lock
T6		Insert lock x2 into Row 13 in index wanted	Transaction 2 attempts to get exclusive lock on Row 13 for insertion but Transaction 1 is holding a shared lock
T7			Deadlock!

The above scenario occurs only when we use REPEATABLE_READ (which introduces shared read locks). If we were to lower the transation isolation level to READ_COMMITTED, we would reduce the chances of a deadlock happening. Of course, this would mean relaxing the consistency of the database records. In the case of our data requirements, we do not have such strict requirements for strong consistency. Thus, it is acceptable for one transaction to read records that are committed by other transactions.

So, to delve deeper into the idea of Transaction Isolation, this concept has been defined by ANSI/ISO SQL as the following from highest isolation levels to lowest

Serializable

This is the highest isolation level and usually requires the use of shared read locks and exclusive write locks (as in the case of MySQL).
What this means in essence that any query made will require access to a shared read lock on the records which prevents another transaction’s query to modify these records. Every update statement will require access to an exclusive write lock
Also, range-locks must be acquired when a select statement with a WHERE condition is used. This is implemented as a gap lock in MySQL.
Repeatable Reads

This is the default level used in MySQL. This is mainly similar to Serializable beside the fact that a range lock is not used. However, the way that MySQL implements this level seemed to me a little different. Based on Wikipedia’s article on Transaction Isolation, a range lock is not implemented and so phantom reads can still occur. Phantom reads refer to a possibility that select queries will have additional records when the same query is made within a transaction. However, what I understand from MySQL’s document is that range locks are still used and the same select queries made in the same transaction will always return the same records. Maybe I’m mistaken in my understanding and if there’s any mistakes in my intepretations, I stand ready to be corrected.
Read Committed

This is an isolation level that will maintain a write lock until the end of the transaction but read locks will be released at the end of the SELECT statement. It does not promise that a SELECT statement will find the same data if it is re-run again in the same transaction. It will, however, guarantee that the data that is read are not “dirty” and has been committed.
Read Uncommitted

This is an isolation level that I doubt would be useful for most use cases. Basically, it allows a transaction to see all data that has been modified, including “dirty” or uncommitted data. This is the lowest isolation level

Having gone through the different transaction isolation levels, we could see how the selection of the Transaction Isolation level determines the kind of database locking mechanism. From a practical standpoint, the default MySQL isolation level (REPEATABLE_READ) might not always be a good choice when you are dealing with a scenario like ours where there is really no need for such strong consistency in the data reads. I believe that by lowering the isolation level, it is likely to reduce chances that your database queries meet with a deadlock. Also, it might even allow a higher concurrent access to your database which improve the performance level of your queries. Of course, this comes with the caveat that you need to understand how important consistent reads are for your application. If you are dealing with data where precision is paramount (e.g. your bank accounts), then it is definitely necessary to impose as much isolation as possible so that you would not read inconsistent information within your transaction.

Monday, 12 May 2014

How to build Java based cloud application

Recently, we were tasked to develop a SAAS application for big data analysis. To do data mining, the system need to store multi billion public posts in the database and run the classification process on them.

Classification in our context is a slow, resource intensive and painful process to assign a topic or sentiment to any record in the database. The process can last up to 24 hours with our testing data.

To cope with these requirements, our obvious choice is to build a cloud application on Amazon Web Services. After working on the project for a while, I want to share my own thought, understanding and approach to build Java based cloud application.

What is Cloud Computing

Let start with Wikipedia first:

"Cloud computing involves distributed computing over a network, where a program or application may run on many connected computers at the same time."

The definition may be a bit ambiguous but it is understandable as In The Cloud itself is more of a marketing term rather than technical term. For a newbie, it is easier to understand if we define it with a more practical way:

The only difference between traditional web application with the cloud web application is the ability to scale perfectly. Cloud application should be able to cope with unlimited amount of works given unlimited hardware.

Cloud application is getting popular nowadays because of higher requirement for modern application. In the past, Google is famous for building high scale application that contains almost all available information in the internet. However, for now, many other corporates need to build applications that serve similar scale of data and computation (Facebook, Youtube, LinkedIn, Twitter,.. and also the people who crawl and process their data like us).

This amount of data and processing cannot be achieved with the traditional way of developing application. That lead us to an entirely different approach to build application that can scale very well. This is cloud application.

Why traditional approach of developing web application does not scale well enough

Traditional Approach of developing web application

Let take a look on why traditional application cannot serve that scale of data.

If you have developed one traditional web application, it should be pretty much similar to the diagram above. There are some other minor variations as merging of application server and web server or multiple enterprise servers. However, most of the time, the database is relational. Web servers are normally stateful while enterprise servers can serve both stateless and stateful services.

There are some crucial weaknesses that cause this architect does not scale well enough. Let start our analysis with defining perfect scalability first.

Perfect scalability can be achieved if a system can always provide identical response time for double amount of work given double amount of bandwidth and double amount of hardware.

Perfect scalability cannot be achieved in real life. Rather, developers only aim to achieve near perfect scalability. For example, DNS servers are out of our control. Hence, theoretically, we cannot serve higher amount of requests than the DNS servers. This is the upper bound for any system, even Google.

SQL

Come back to the diagram above, the biggest weakness is the database scalability. When the amount of requests and size of data are small enough, developers should not notice any performance impact when increasing load. Continue to increase the load higher, the impact can be very obvious, if the CPU is 100% utilized or memory fully occupied. At this point, the most realistic option is to pump more memory and CPU to the database system. After this, the system may perform well again.

Unfortunately, this approach cannot be repeated forever whenever problems arise. There will be a limit where no matter how much ram and CPU you have, performance will slowly getting worse. This is expectable because you will have some certain records that need to be create, read, update, delete (CRUD) by many requests. No matter whether you choose to cache them, store them on memory or do whatever trick, they are unique records, persisting in a single machine and there is a limit on amount of access requests that can be sent to a single memory address.

This is the unavoidable limit as SQL is built for integrity. To ensure integrity, it is necessary that any information in SQL server should be unique. This characteristic still applicable even after data segregation or replication are done (at least for the primary instance).

In contrast, NoSQL does not attempt to normalize data. Instead, it chooses to store the aggregate objects, which may contain duplicated information. Therefore, NoSQL is only applicable if data integrity is not compulsory.

Above example (from couchbase.com) shows how data is stored in a document database versus relational database. If a family contains many members, relational database only store a single address for all of them while NoSQL database simply replicate the housing address. When a family relocate, the housing addresses of all members may not be updated in a single transaction, which cause data integrity violation.

However, for our application and many others, this temporary violation is acceptable. For example, you may not need the amount of page views on your social page or amount of public posts in a social website to be 100% accurate.

Data duplication effectively removes the concurrent access to a single memory address that we mentioned above and give developers the option to store data anywhere they want, as long as the changes in one node can be slowly synced up to other nodes. This architect is much more scalable.

Stateful

The next problem is stateful service. Stateful service requires the same set of hardware to serve requests from the same client. When the amount of clients increase, the best possible move is to deploy more application servers and web servers into the system. However, the resource allocation cannot be fully optimized with stateful services.

For traditional applications, load balancer does not have any information of system load and normally spread the requests to different servers using Round Robin technique. The problem here is not all requests are equals and not all clients are sending identical amount of requests. That cause some servers are heavily overloaded while others are still idle.

Mixing of data retrieval and processing

For traditional applications, the server that retrieve data from database ends up processing it. There is no clear separation of processing data and retrieving data. Both of the two tasks can cause bottle neck to the system. If the bottle neck come from data retrieval, data processing is under-utilized and vice versa.

Rethinking best approaches to build scalable application

Look at what have been adopted in our IT fields recently, I hardly found them as new inventions. Rather, they are adoption of the practices that have been used succesfully in real life to solve scalability issue. To illustrate this, let imagine a real life situation of tackling scalability issue.

Hospital

Assume that we have a small hospital. For our hospital, we mostly serve loyal customers. Each loyal customer have a personal doctor, who keeps track of his/her medical record. Because of this, customers only need to show the ICs to be served by the preferred doctors.

To make things challenging, our hospital is functioning before the internet era.

Stateless versus stateful

Is the description above look similar enough to stateful service? Now, your hospital is getting famous and the amount of customers suddenly surges. Provide that you have enough infrastructure, the obvious option is to hire more doctors and nurses. However, customers are not willing to try out new doctors. That cause the new staffs are free while old staffs are busy.

To ensure optimization, you choose to change the hospital policy so that the customers must keep their medical records and the hospital will assign them to any available doctors. This new practice helps to resolve all of your headache and give you the option to deploy more seasonal staffs to cope with sudden surge of clients.

Well, this policy may not make the customers happy but for IT fields, stateless and stateful services provide identical results.

Data Duplication

Let say the amount of customers constantly surge and you start to consider opening more branches. At the same time, there is a new rising problem that customers constantly complain about the need of bringing medical records while visiting hospital.

To solve this problem, you come back to the original policy of storing the medical records at the hospital. However, as you are having more than one branch, each branch need to store a copy of user medical records. At the end of the day or the week, any record change need to be synced to every branch.

Separation of Services

After running the hospital for a few months, you recognize that the resources allocation are not very optimized. For example, you have blood test and X-ray faculty in both branch A and B. However, there are many customer doing blood test in branch A and many people taking X-ray in branch B.

It cause the customers keep waiting in one branch, while no one visit the other branch. To optimize resource, you shutdown the under-utilized faculties and setup unique blood test centre and X-ray centre. Customers will be sent from the branches to the specialized centres for special services.

Adhoc Resource

It is hard to do resource planning for hospital. There are seasonal diseases that only happens at a certain time of the year. Moreover, catastrophe may happen any time. They cause sudden surge of warded patients for a short period. To cope with this, you may want to sign agreement with the city council to temporarily rent facilities when needed and hire more part-time staffs.

Apply these ideas to build cloud application

Now, after looking at the example above, you may feel that most of the ideas make sense. It only take a short while before developers start to apply these ideas into building web application.

Then, we move to the cloud application era.

How to build cloud application

To build a cloud application, we need to find way to apply the mentioned ideas into our application. Here is my suggest approach

Infrastructure

If you start to think about building cloud application, infrastructure is the first concern. If your platform does not support adhoc resource (dynamically bursting of existing server spec or spawning new instance), it is very hard to build cloud application.

At the moment, we choose AWS because it is the most matured platform in the market. We have moved from internal hosting to AWS hosting one year ago due to some major benefits

Mutiple Locations: Our customers are coming from all 5 continents, using Amazon Region, we can deploy the instance closer to customer location, through that, reduce the response time.
Monitoring & Auto Scaling: Amazon offers quite a decent monitoring service for their platform. Due to server load, it is possible to do Auto Scaling.
Content Delivery Network: Amazon CloudFront give us the options to offload static contents from our main deployment, which will improve page load time. Similar to normal instances, static contents can be served from the nearest instances to customer.
Synchronized & Distributed Caching: MemCache has been our preferred caching solution over the years. However, one major concern is the lack of support for synchronization among the nodes. Amazon Elastic Cache give us the option to use MemCache without worrying about node synchronization
Management API: This is one major advantage. Recently, we start to make use of Management API to spawn up instance for a short while to run integration test.

Database

Provide that you have select the platform for developing cloud application, the next step should be selecting the right database for your system. The first decision you need to make is whether SQL or NoSQL is the right choice for your system. If the system is not data intensive, SQL should be fine, if the reverse is true, you should consider NoSQL.

Sometimes, multiple databases can be used together. For example, if we want to implement a Social Network application like Facebook, it is possible to store system settings or even user profiles in SQL database. In contrast, user posts must be stored in the NoSQL database due to huge volume of data. Moreover, we can choose SOLR to store public posts due to strong searching capability and Mongo DB for storing of user activities.

If possible, please choose the database system that support clustering, data segregation and load balancing. If not, you may end up implement all of these features yourself. For example, SOLR should be the better choice compare to Lucene unless we want to do our own data segregation.

Computing Intensive or Data Intensive

It is better if we know that the system is data intensive or computing intensive. For example, Social Network like Facebook is pretty much data intensive while our big data analysis are both data intensive and computing intensive.

For data intensive system, we can let any node in the cloud retrieve data and do processing as well. For computing intensive node, it is better to split out data retrieval and data processing.

Data intensive system normally serve real-time data while computing intensive system run the background jobs to process data. Mixing these two heavy tasks in the same environment may end up reducing system effectiveness.

For computing cloud, it is better to have a framework to monitor load, distribute tasks and collect results at the end of computing process. If you do not need the processing to be real time, Hadoop is the best choice in the market. If real time computation is required, please consider Apache Storm.

Design Pattern for Cloud Application

To build a successful Cloud Application, there are something that we should keep in mind.

1. Stateless

It is a must to make all your services and server stateless. If the service need user data, include them as parameter in the API.

It is worth noticed that to implement Stateless Session on Web Server, we have a few choices to consider:

Cookie based session
Distributed Cache session
Database Session

The solutions above are sorted from up to down with lower scalability but easier management.

2. Idempotence

For Cloud Application, most of the API call will happen through the network rather than internal method calls. Therefore, it is better if we can make the method calls safe. If you stick to the Stateless principle above, it is likely that the services you implement are already idempotent.

3. Remote Facade

Remote Facade is different with Facade pattern. They may look similar in term of practice but aim to fix different problems. As most of your API calls happen over the network, the network latency contribute a great part to the response time. With Remote Facade pattern, developers should build a coarse-grained API so that the amount of calls can be reduced.

In layman's terms, it is better to go to supermarket and buy 10 things in one shot rather than visit 10 times, each time buy 1 thing.

4. Data Access Object

As you may transfer the data around, be careful with the amount of data you transfer. It is best to only give the minimum data as required.

5. Play Safe

This is not a design pattern but you will thanks yourself for playing safe in the future. Due to the nature of distributed computing, when something go wrong, it is very difficult to find out which part is wrong. If possible, implement health check, ping, thoroughly logging, debug mode to every component in the system.

Conclusion

I hope this approach to build Cloud Application can bring some benefit to everyone. If you have other opinions or experience, kindly feedback and share with us.

In the next article, I will share the design of our Social Monitoring Tool.

Sunday, 4 May 2014

From Scrum to Kanban

This month marks one year from the time we switched from Scrum to Kanban. I find it is a good time for us to review the impact of this change.

Our Scrum

I have experienced two working environment that practice Scrum and still they are quite different. That why it may be more valuable if we start with sharing of our Scrum practice.

Iteration

Our iteration is 2 weeks long. I am quite satisfied with the duration as one week is a bit too short to develop any meaningful story and 1 month is a bit too long to plan or to do retrospective.

Our iteration start with the first Monday morning retrospective. In the same day after noon, there is iteration planning. For the rest of the iteration, we do coding as much we want.

Our product owner request us to do two rounds of demo, soft demo on the last Wednesday of iteration, where we can show the newly developed features on development machine or Stage environment. On the last day of iteration, we suppose to do final demo on UAT environment in order to get the stories accepted.

Agile emphasize on adapting to change, but we still do T+2 planning (two iterations ahead). With this practice, we know quite well what is going to be delivered or to be worked on for at least one month ahead. If there is urgent work, the iteration will be re-planned and some stories will be pushed back to next iteration.

Daily Life

Our daily life starts with a morning alarm. Some ancient coders in the past set the rule of using alarm for office starting hour. Anyone come to office after the alarm ring will have the honour to donate 1 dollar to the team fund. This fund can be used to host retrospective outdoor or to buy coffee. To be honest, I like this idea, even it effectively cut 22 dollars to my monthly income.

15 minutes later, we have another alarm for the daily stand-up. This short period supposed to be used to read email and catchup with what happen overnight. Our team bases in Asia but is actively working with project stakeholders in Europe and US. That why we need this short email checking session to have a meaningful stand-up.

It is not really Scrum practice, but like most of other corporate environments, we need to fill up time-sheet at the end of the day. Using time-sheet, we keep track of the time spent versus the estimated effort and use that to calculate velocity.

Roles

As specified by Scrum , we have development team and product owner. In our company, product owner are called Capability Manager. At the moment, our management are discussing whether they should split Capability Manager to two roles, one focus on technical aspect of product and the other solely focus on business aspect.

We do not have Scrum master, instead, we have Release Manager. This role is a bit confusing because it does not appear in any practice. In our environment, Release Manager work more like the traditional Project Manager. Not all the projects we have Release Manager but for some bigger scale projects, Release Manager can be quite useful and quite busy as well. Most of our products are SAAS applications, and some successful products can have more than 100 customers worldwide. Capability Manager can focus on product features and let the Release Manager deal with story planning, customer deadline and minor customization.

There is also one more discussion on whether Release Manager job requires technical background as they need to do iteration planning and some stories are technically related.

Tools

We use mixture of Excel spreadsheet, Jira and Rally in our daily life.

Jira is the leftover tool of the past, before we move to Rally. Now, we only use Jira to track support tickets and defects.

Rally is the online platform for Agile practice with built-in support for iteration, story, defect, release, backlog,..

Even with these tool, we cannot avoid using the old day spreadsheet to keep track of team resources (team resource pipeline) and do resource planning (resource matrix) as well.

Due to resource scarcity, we still have multi-tasks team that deal with few projects and few product owners at the same time. Periodically, the release managers need to sit together and bargain for their resource next few iterations.

Spirit

As one of my friend always say, Scrum is more about spirit rather than practices. I can't agree more with this. Applying Scrum is more about doing things with Scrum mindset rather than strictly following written practices. Personally, I feel we are applying Scrum quite well.

At first, in the team standup, we try our best to avoid making it look like a progress report but information sharing and collaborating session. Once in a while, the standup last more than default 15 minutes because developers spend time elaborating ideas and discussing on the spot. Release Manager or Product Owner do not join our daily standup.

Our retrospective is a close door activity, which only involve team member. Both Release Manager and Product Owner will not join us unless we call them in to ask for information. Each team member takes turn to be the facilitator. The format of retrospective is not fixed. It is up to the facilitator imagination to decide what will we do for the retrospective. The rest just sit down, relax and wait to see what will happen next.

The planning sessions includes tasking and Poker Style estimation game. It is up to the team to re-estimate (we estimate one time when the story still in backlog), verify the assumption and later arrange and commit the story to fit team resource for this iteration nicely. Sometimes, we have a mini debate if there is big gap between team member estimations.

Why we moved to Kanban

You may wonder if our Scrum work so well, why did we move to Kanban. Well, it was not our team decision. Kaban was initiated at UK headquarters and spread to other regions. However, working with Scrum is not all perfect, let I share with you some of problems that we are facing.

Resource Utilization at the end of iteration

This problem may not be very severe in our office but it is a big concern in other regions. Due to technical difficulties, sometimes, estimation is very far from spent effort. This leave a big gap at the end of iteration. It may be good if the gap is big enough to schedule another story but most of the time, it does not. This creates the low productivity issue that management want to fix. They hope removing iteration will remove this virtual gap and let the developers focus on delivering work.

The pressure from iteration commitment

By committing to the planned stories in the iteration, we are under the pressure to deliver it. The stories were estimated with 2 weeks duration for development but we normally need to deliver them faster to match the soft demo on Wednesday and final demo on Friday.

To make thing worse, our Web Service team is in other region and we need to raise the deployment ticket one day in advance to get things done. If the deployment ticket failed, we need one more day to redeploy. The consequence is whether we develop too fast to meet the deadline or we follow the estimation, then miss the commitment.

Another concern is the pressure to estimate and commit to something developers don't know so well and still be punished for missing the commitment. This creates the defensive mindset where developers will try to include a safety buffer on any estimation they make.

Then, our Kanban

Life is not so much different when we move to Kanban. For the good, we have the budget to buy a big screen. For the bad, we do not do iteration planning any more. However, we still keep our retrospective on first Monday morning.

Kanban board

Now, we open the Kanban board in Rally to track our development progress.

We create our Kanban board with 7 columns, which reflect our working process

None (equals to backlog)
Tasking
Building
Peer Review (only after Stage deployment)
Deploy to UAT
Acceptance (story is signed of by Product Owner)
Deploy to LIVE

The product owner creates stories in backlog, which will be pulled to Tasking column by Release Manager. After that, it is development team responsibility to move this story to Deploy to UAT column. After that, it is product owner responsibility to verify and accept it. If there is any feedback, the story will be put back to Building column. Otherwise, it is signed off and ready to be deploy to Production. It is up to the Release Managers when they want to deploy the accepted feature to Live environment.

As Kanban practice, we want to limit multitasking and set the threshold of the capacity for each column. As we do pairing, with 8 developers in our team, the threshold for each column suppose to be no more than 4. However, this is easier to say than do as stories are often blocked by external factor and we need to work on something else.

Planning

There is no iteration planning any more. Rather, we do planning whenever there is new story in Tasking column. The story is both tasked and estimated by one pair rather than collecting inputs from the whole team.

What is a bit unnatural is due to our multi-tasking nature, one pair do not follow one story from Tasking until Deploy to UAT. To deal with this, we often need to come back to the pair that do tasking to ask for explanation.

Demo

We still need to estimate but there is not fixed time for demo. In the regular meeting between team and Release Manager, the most asked question is "Do you have anything to demo today?" and the most popular answer is "No".

Estimation

When aborting Scrum, we also abort Story Point Estimation. We still count the spent effort versus estimated effort but it only for reference. From last year, we moved back to estimation by pair day.

Our feeling

So, how do we feel after one year practising Kanban?

I think it is a mixture feeling. On the good side, there are less thing to worry about, less commitment to keep and better focus on development. Plus, we have the big screen to look at it every morning.

However, things are not all rosy. I do not know whether we do Kanban the wrong way or it is just the natural of Kanban, developers do not follow one story from beginning until the end. One guy may task the story this way following his skills set and someone else will end up delivering the work.

Moreover, I feel Kanban treating every developer equal, which is not so true. If there is one story available in Building Column and you are free, you must take the story, no matter you have the skill or not. It hamper the productivity of the team. However, it also can be positively viewed as Kanban fostering skills and knowledge sharing among developers.

Moving to Kanban also causes developers spending more time on story development. There is no pressure to cut corner to deliver but there is also a tendency to over-deliver good to have features, which are not included in the Acceptance Criterias.

That is for us, for Release Manager, they seem to be not so happy with the transition. Lack of iteration only make their planning more ambiguous and difficult.