Developers Corner

High volume, low latency system

2017-06-27T08:20:00.001-07:00

We are currently in the process of optimizing the bidding platforms for PocketMath, one of the largest supply of mobile programmatic inventory in the world. With a fleet of 15 to 35 bidders around the world, this platform help to serve 40 billion to 70 billion of requests per day. The latency of the system is pretty good with 95 percentile of response time fall below 2 milliseconds.

The optimization process gives us a precious opportunity to think about which factors are crucial to building a high performance, low latency system.

In this article, let us share the principles that guide our development process.

Background

As some of you may have been familiar with, Real-time bidding (RTB) is a means by which advertising inventory is bought and sold on a per-impression basis, via programmatic instantaneous auction. In this ecosystem, PocketMath is a Demand-side Platform, which helps the buyers to buy impressions from Ad Exchange.

Because all of the buying and selling happen per impression, the latency requirement for RTB is very strict. Most of the Ad Exchanges will not accept any response from DSP after 100 milliseconds. This time constraint is quite tight if we take the network round-trip into consideration. Normally, network transfer contributes more to the total response time than processing bid request. Therefore, to reduce the timeout risk, DSP will normally self-impose a much lower limit.

Other than fast response time, there is another requirement for DSP is stability. It is a common practice for Ad Exchange to throttle the traffic to DSP if the timeout happens too often.

Architecture Guideline

As a design mistake can be very costly in the long run, it is better to get it right from the starting point.

Knowing the limit

Compare to other domains like FinTech, the non-functional requirement of RTB is quite special that. it enforces the maximum response time very strictly but does not make it mandatory to process every request. However, the server should still send a no bid response when it intends to skip a request. in RTB, skipping processing only causes opportunity lost, which is not too bad compared to what may happen to a mission critical system. However, failing to process a request in time is much worse because it does not only causes loss of opportunity but also wastage of resource.

Therefore, we designed our system to always operate at the optimal throughput regardless of the traffic volume. To ensure each component in the system is functioning at the optimal level, surge protectors are added at the component client so that additional load can be automatically discarded as early as possible.

Every quarter, while reviewing the volume of inventory, we also adjust hardware and calibrate all the rate limiters to keep the system running at the best value for money. In case the load suddenly surges due to a spike of traffic or campaign configuration, the system should still continue to process the traffic at its designed capacity and skip the load it cannot handle. The response time is consistent regardless of the load.

Knowing when to apply microservices architecture

Microservices architecture gives us a lot of flexibility to develop and maintain the system. However, it also adds network latency to time-critical tasks. Therefore, we need to think twice before applying microservice architecture to our system. For time-critical request, it is better to minimize the number of network hops that the information needs to travel before the response can be generated. We keep the component that facing exchanges as a near single monolithic application. It is a huge component with lots of logic and information embedded to process the majority of requests. Only for some requests that the required information is too big to cache or need to be real-time, then this component will make a network to other components in the system. Moreover, this component can operate partially by discarding the requests it cannot handle if some the external components are not available.

Stream Processing

It is a simple fact that centralized architecture won't scale. Therefore, to be scalable, the system architecture should resemble a graphic card design more than a CPU design where the information can be processed in parallel and independently. To achieve that, it is necessary to keep all of the components stateless and the information package to be self-sufficient for processing.

Try your best to avoid any processing that may require shared resource like a physical database. For example, if the data is immutable, we can clone the data to many read-only databases or caches to avoid centralize processing. It is even more crucial to avoid the scenario where the information can not be processed independently like locking by unique indexes.

Eventually, if these conditions cannot be fulfilled, we should try to reduce the impact by minimizing the common part by applying MapReduce processing or in-memory computing. We should also add some redundancy to the components that handling the bottleneck to minimize the risk.

Auto Recovery

Even after applying all of the good practice, maintaining system stability is still a very challenging task because there are too many unknown factors that can affect the system throughput. For example, average processing time can be highly variable while system performance can be temporarily degraded due to backup, hardware upgrade or intermittent network issue.

The easy way to improve system stability is to increase the redundancy. However, redundancy only works best when the system is so critical that efficiency is not a concern at all. Otherwise, developers should resort to a smarter method to cope with this challenge.

Fortunately, the approach in this use case is pretty straightforward. When being overloaded, you are left with 2 choices, upgrading the hardware or reducing the load. If possible, we should do both. However, autoscaling can be regarded as Devops responsibility but reducing load is a challenge that should be tackled at the architecture level. In order to do that, we need to build a feedback mechanism so that the front end components can slow down or stop responding to new requests when the backend components are overloaded so that the system can go back to the balanced state.

Implementations Guideline

For a high-performance system, the crappy code will be punished as long as it was exposed. Therefore, it is never redundant to optimize your implementation twice before rolling it out to the production. Here are some of our experience with developing the high-performance system.

Monitoring

For whatever purpose, it is always to good practice to implement health check on the system. However, for a high-performance system, the requirement for monitoring is even more sophisticated with the need for collecting insight about system operation. This information can be very crucial for detecting anomalies, preventing crash and system tuning. We should not only care if the implementation works but also how well is the execution.

There are some well-known APM in the market like NewRelic or DataDog that can help us collecting operation metrics and providing alerts when bad things occur. The license of APM may not be cheap but it is highly recommendable to afford one because of the benefit they will bring in the long term.

In addition to APM, it is also a good idea to embed debug API into health check so that developers can do the in-depth investigation on Production environment whenever they need to. This practice has proven to be very useful in resolving outages and troubleshooting user inquiries.

Testing on Production

This practice will surely trigger some concern as it is considered a taboo in the IT world until recently. Simply speaking, the landscape of software industry has changed. In the past, software development is normally a side function of the big corporates with the mission to build domain specific applications. However, nowadays, software development tend to play a much bigger role of transforming life in various startups around the world. In this new role, for an IT project to be successful, the pace of changes may be more important than maintaining the stability of the system. Therefore, we need to make a case by case judgment on the balance between time to market and the quality control.

For high-performance systems, testing in Staging is less effective because most of the performance issues only appear under heavy load. It is also difficult to load test them using simulated data because there are too many possible combinations of inputs that may expose hidden bugs. Therefore, similar to what Facebook and Google have done, it is not necessarily harmful to sacrifice a small part of traffic for testing new features. The key requirement for this practice is the ability to identify and contain the damage when it happens.

Understanding Machine

At the beginning of this century, there are many initiatives to make programming easier by isolating the business logic from underlying machine execution. Therefore, the development of complicated application become lots easier with additional layers of abstraction. However, as a side effect, developers manage to go through many projects without collecting fundamental knowledge about underlying execution.

However, if you are lucky enough to work on a high-performance system, this knowledge will be important and relevant again. We have seen tremendous benefit of well-optimized code that makes a good use of hardware to get the work done. This benefit can sometimes come as higher throughput, smaller memory footprint or even more critical outcomes like lower latency and more stable performance. It is easy to see that the latter outcomes are somethings not easy to achieve with more hardware but only better implementation.

At the basic level, developers should understand the underlying implementation of programming language for common syntaxes. For the advanced level, it is important to pick up knowledge about operating system, network and the hardware infrastructure as well.

Understandable Code

The biggest source that contributes to performance or functional bugs in our system is code complexity. It is difficult to add new features if existing codebase is too difficult to understand. Patching mentality will continue to increase the technical debt further until it is almost impossible to avoid making mistake.

Moreover, complicated codes will have a negative impact on CPU utilization as the biggest performance blunder comes from missing CPU cache rather than processing speed. Hence, a high-performance implementation is also a clean, easy to understand and straight-forward implementation.

In-memory computing

In-memory computing is a hot trend recently due to memory getting much cheaper and bigger. In the past, when we need more performance, the most common trick is to increase the concurrency level. Most of the web servers in the past have a high number of CPU cores but a relatively low amount of memory per core. That fact implies that web server role is processing of web request rather than data. Most of the data processing usually happens in the data warehouse rather than application. However, for low latency system, retrieving and processing data remotely is considered too expensive. Therefore, if an application can not churn out higher throughput or lower latency with more memory, it might not be well optimized to utilize all available hardware. In the perfect scenario, the system should reach max CPU and memory utilization at the same time. When under load, if only one of them is the bottleneck while the other still has lots of redundant capacity, then it may be a good indicator of wrong hardware configuration or bad optimization.

In a high-performance system, both CPU and memory should be treated as the precious resource. We should be careful to conserve both with well-known techniques like object pools, primitive types, suitable data structure and efficient implementation.

Conclusion

Developing a low latency and high throughput application requires some special skill sets that not easy to find in the mass market. A common perception is good developers will write performance code. This is true for most of the time. However, many experienced developers, who shine in building other applications but still struggle when dealing with the high-performance system because of old habits and lack of performance consideration in mind. The key point for success in this area should lie in the self-reliant analysis, fundamental understanding and logical thinking.

It is also worth highlighting it is not always better to follow the trend in development world because many new methods are good for some other purposes rather than performance. Therefore, it is good to keep learning new things but should always understand the cost versus benefit for each of them based on your priority.

MySQL Partition Pruning

2017-04-08T22:26:00.001-07:00

Recently, we learned an expensive lesson about MySQL partition pruning. There, it is better to share it here so that others will not repeat our mistake.

Background

In our system, there is a big stats table that does not have primary key and indexes. This table is partitioned, but the lack of indexes often causes the full partition or even full table scan when query. To make things worse, the system still continues writing to this table, making it slower every day.

To fix performance issue, we want to clean the legacy data and add new indexes. However, this is not easy because the table is too big. Therefore, we chose the long approach by migrating only the wanted data from this old table to a new table with proper schema.

Partition by hash

It would have been fine if we only did what we originally intended to do. However, we changed the partition type for convenient and that made the new table slower.

In the original table, the partition is based on a timestamp column that represents the time as a number of hours from epoch. For example, the first second of the year 2017 in GMT is 1483228800 seconds from epoch. To get the number of hours, we divide the number by 3600 to get 1483228800 div 3600) = 412008.

Because of the partition by range type, we need to have a maintenance script that creates the monthly partition for next year. This way of partition is not very ideal because the partition size is big and not even. Hence, we converted monthly to weekly partition but too lazy to define each range and switched from partition by range to partition by hash.

This is a short version of how hash definition will look like if we do the partition by range

PARTITION BY RANGE (hour_epoch)
(PARTITION pOct2016 VALUES LESS THAN (419304),
 PARTITION pNov2017 VALUES LESS THAN (420024) ENGINE = InnoDB,
 PARTITION pDec2017 VALUES LESS THAN (420768) ENGINE = InnoDB,
 PARTITION pMax VALUES LESS THAN MAXVALUE ENGINE = InnoDB)

And this is how the partition definition will look like if we do partition by hash

partition by hash (hour_epoch div 168) partitions 157;

The partition by hash type did more than just shorten the syntax. MySQL will try to split records evenly by applying modulo function to select a partition. However, to make the duration of one partition one week, we divide hour_epoch number by 168 to effectively get week_epoch.

With the new table schema, we were happy with smaller partitions, shorter description, and more indexes.

Performance issue

Because of the huge volume of data, we could not fully migrate data to the new schema to verify performance. We only did the preliminary performance test with the data of 2 weeks and did not detect any performance issue. However, in the final testing, we were surprised to observe mixed result. Most of the queries are faster as expected, but some are slower.

After investigating, we realized that instead of scanning only a few partitions, MySQL does the full table scanning for time range query. It is even stranger that this behavior only happens with the date range smaller than 3 weeks. Totally surprised by this result, we overcame our procrastination to read up MySQL document carefully and realize why.

"For tables that are partitioned by HASH or [LINEAR] KEY, partition pruning is also possible in cases in which the WHERE clause uses a simple = relation against a column used in the partitioning expression"

As the document clearly explained, the partition pruning only works with the equal condition for partition by hash type. However, we did not detect this issue earlier because of the query optimizer will auto convert range condition to equal condition if the number of distinct values in between of the range condition is short enough. Unfortunately, in our early test, the data of 2 weeks is short enough for the query optimizer to hide the problem from us.

Solution

After learning about the issue, we struggled to find a way to fix the performance issue. There are 2 proposed solutions

Trick the query optimizer to do the work by splitting a big range to multiple small ranges, each fit one partition. In this way, the query optimizer will work on each individual small ranges.
Rebuild the schema again with the proper partition type.

The first solution is quick but dirty while the second solution is too time-consuming. Eventually, we almost decided to launch the new table with the first solution until finding a quick way to implement the second solution.

We have dug through MySQL document and learned that re-parititioning is basically a copy and paste operation. However, MySQL also has another command that allows us to do some partition change without too much effort.

ALTER TABLE pt
    EXCHANGE PARTITION p
    WITH TABLE nt;

In this command, MySQL allows us to exchange partition between a table and a partition of another table. Even when this is not a direct exchange between 2 partitions of 2 tables, it is just a matter of inconvenience to do one more middle swap to a temp table.

This is how our partition swapping looks like

ALTER TABLE origin_table EXCHANGE PARTITION p1 WITH TABLE temp_table;
ALTER TABLE final_table EXCHANGE PARTITION p1 WITH TABLE temp_table;

Even though this is not as fast as you may guess as MySQL will do a row by row validation to ensure every record of temp table is elligible for storing in the final table partition. If we use MySQL 5.7, this validation can be turned off by adding "WITHOUT VALIDATION" to the end of the second command.

Because we use Aurora, which only support MySQl 5.6, it still took us 2 days to fully update the partition type. However, this would have been one month if we do not use partition exchange.

Fortunately, we managed to recover from the mistake this time. We hope that you learn from our mistake and do remember to read the document carefully before using any fancy method.

Retrospective

2017-01-02T07:04:00.002-08:00

Some folks asked me before that which Agile practice is the most important and my immediate answer is Retrospective. From my own experience, Retrospective plays the biggest role in the success of Agile practicing. Unfortunately, it may not necessarily be a popular practice. This is a bit sad because after trying Agile in different organizations, I see no practice that shows value as early and obviously as the Retrospective. Moreover, it is one of the easiest practice to adopt because it does not require discipline to practice regularly. It can be practiced as little as once a year and still be able to bring the differences.

Why retrospective is so important

Stay true to "agile" spirit

Unless you are hiding a rock, it is hard to ignore the debate about "Agile" versus "agile". Lots of developers are upset with the fact that agile is being seen as a set of ruleset rather than mindset. Unfortunately, trying to adopt agile by following the ruleset may lead to a rigid mindset, which is reversed to Agile manifesto.

Retrospective is not vulnerable to this problem because it is the most flexible practice in Agile. Retrospective stays true to the agile spirit by not specifying the method but only the purpose and benefit of the activity. Therefore, it leaves the team with freedom to conduct the activity in whatever ways that fit. The rule followers still can have it their ways with many techniques available but in general, this practice is very personal. While Planning, User Stories, Backlog and Iteration practices may look pretty the same everywhere, Retrospective is always very unique. Because each team has its own problems and members, following the same format still leads to different outcomes.

First step toward improvement

It is quite obviously that in order to improve, we need to see our weakness and limits. This logic should apply not only to software development but to any other aspect of life as well. Therefore, one of the first thing that one should do before introducing any change is spending time learning about the characteristic of each individual and the dynamic of the team.

The traditional method to understand team through psychology test is overrated. It tends to make teams fall into common stereotypes. It is not that psychology test is a waste of time but in reality, it works better for the individual, especially when the subject of the test is willing to collaborate. Therefore, psychology test is better to be a method of collecting feedback and improvement measurement.

For collecting insights about team dynamic, Retrospective is a more effective method because it is less intrusive. People are normally more comfortable when we ask less and let them talk more about what they are concerning about. Fortunately, that is exactly what Retrospective is about.

Keep a close look at the team well-being

The days where developers need to pray to get a decent job have passed. Nowadays, the demand for good developers is so high that most of the companies turn to headhunters to recruit talents. Hence, it is not only challenging to get more talents, but also to retain talents.

We may not be able to do much if this is paycheck competition. However, job changing is rarely purely paycheck driven. It can be very emotionally difficult to leave a job you love and a caring environment. Therefore, if the leader keeps a close eye on the team and each individual, there will be much greater chance to shield the team from lucrative offers.

How to run retrospective

As mentioned above, a good Retrospective is one that let people voice out their inner concern and thinking. Therefore, anything resembles form filling or interview is counterproductive. The better suggestion should be a flexible format. Retrospective itself need to be interesting and intimate enough to put people in a comfortable zone. Our ultimate goal is to let people share more so that the team can improve.

An effective facilitator needs to know how to stir up the conversation when it goes quiet and be silent when people are having a deep reflection. Any context switching is helpful as well. For example, a retrospective session can be out of office, far from the boss, enjoyable with coffee.

The last thing you need to remember about retrospective is to never ever take any discipline action from what you have learned in retrospective. Otherwise, it will be rightfully viewed as a betrayal of trust. Honestly, this will be the worst thing that can happen to the team.

So, if your team has not had an out of the box, open minded retrospective session for sometimes, please find an opportunity to bring the team to a nice place. I believe you and your team will have a good time.

Let's Implement "Login with Github" button

2016-09-28T23:27:00.001-07:00

Recently we delivered a simple workshop in Spring User Group Singapore about implementing "Login with Github" button using Spring Boot, Spring Security, OAuth2 and Angular

https://github.com/verydapeng/springular

Spring Oauth2 with JWT Sample

2016-04-16T21:45:00.001-07:00

Sometimes ago, we published one article sharing a custom approach to implementing stateless session in the cloud environment. Today, let explore another popular use case of setting up OAuth 2 authentication for a Spring Boot application. In this example, we will JSON Web Token (JWT) as the format of the OAuth 2 token.

This sample was developed partly based on the official sample of Spring Security OAuth 2. However, we will focus on understanding the principal of the OAuth 2 request. The source code is available at https://github.com/tuanngda/spring-boot-oauth2-demo.git

Background

OAuth 2 and JWT

We will not go to detail when you may want to use OAuth 2 and JWT. In general, OAuth 2 is useful when you need to allow other people to build front end app for you services. We focus on OAuth 2 and JWT because they are the most popular authentication framework and protocol in the market.

Spring Security OAuth 2

Spring Security OAuth 2 is an implementation of OAuth 2 that built on top of Spring Security, which itself is a very extensible authentication framework.

In overall, Spring Security authentication includes 2 steps, creating an authentication object for each request and applying authorization check depending on authentication.

The first step was done in a multi-layer Security Filter. Depending on the configuration, each layer can help to create the authentication object for web request with basic authentication, digest authentication, form authentication or any custom method of authentication. The client side session we built in the previous article is effectively a layer of custom method authentication and Spring Security OAuth 2 is built on the same mechanism.

Because in this example, our application both provides and consume token, Spring Security OAuth 2 should not be the sole authentication layer for the application. We need another authentication mechanism to protect the token provider endpoint so that resource owner can authenticate himself before getting the JWT token.

For a cluster environment, the token or the secret to sign token (for JWT) suppose to be persisted so that the token can be recognized at any resource server but we skip this step to simplify the example. Similarly, the user authentication and client identities are all hard-coded.

System Design

Overview

In our application, we need to setup 3 components

Authorization Endpoint and Token Endpoint to help to provide OAuth 2 token.
A WebSecurityConfigurerAdapter, which is an authentication layer with hard-coded order of 3 (according to Dave Syer). This authentication layer will setup authentication and principal for any request that contains OAuth 2 token.
Another authentication mechanism to protect Token endpoint and other resources if the token is missing. In this sample, we choose basic authentication for its simplicity when writing tests. As we do not specify the order, it will take the default value of 100. With Spring security, the lower order, the higher priority; so we should expect OAuth 2 come before basic authentication in the FilterChainProxy. Inspecting in IDE proves that our setup is correct.

In the above picture, OAuth2AuthenticationProcessingFilter appear in front of BasicAuthenticationFilter.

Authorization Server Configuration

Here is our config for Authorization and Token Endpoint

@Configuration
@EnableAuthorizationServer
public class AuthorizationServerConfiguration extends AuthorizationServerConfigurerAdapter {

    @Value("${resource.id:spring-boot-application}")
    private String resourceId;
    
    @Value("${access_token.validity_period:3600}")
    int accessTokenValiditySeconds = 3600;

    @Autowired
    private AuthenticationManager authenticationManager;
    
    @Bean
    public JwtAccessTokenConverter accessTokenConverter() {
        return new JwtAccessTokenConverter();
    }

    @Override
    public void configure(AuthorizationServerEndpointsConfigurer endpoints) throws Exception {
        endpoints
            .authenticationManager(this.authenticationManager)
            .accessTokenConverter(accessTokenConverter());
    }
    
    @Override
    public void configure(AuthorizationServerSecurityConfigurer oauthServer) throws Exception {
        oauthServer.tokenKeyAccess("isAnonymous() || hasAuthority('ROLE_TRUSTED_CLIENT')")
            .checkTokenAccess("hasAuthority('ROLE_TRUSTED_CLIENT')");
    }

    @Override
    public void configure(ClientDetailsServiceConfigurer clients) throws Exception {
        clients.inMemory()
            .withClient("normal-app")
                .authorizedGrantTypes("authorization_code", "implicit")
                .authorities("ROLE_CLIENT")
                .scopes("read", "write")
                .resourceIds(resourceId)
                .accessTokenValiditySeconds(accessTokenValiditySeconds)
        .and()
            .withClient("trusted-app")
                .authorizedGrantTypes("client_credentials", "password")
                .authorities("ROLE_TRUSTED_CLIENT")
                .scopes("read", "write")
                .resourceIds(resourceId)
                .accessTokenValiditySeconds(accessTokenValiditySeconds)
                .secret("secret");
    }
}

There are few things worth noticing about this implementation.

Setting up JWT token is as simple as using JwtAccessTokenConverter. Because we never specify the signing key, it is randomly generated. If we intended to deploy our application to the cloud environment, it is a must to sync the signing key across all authorization servers.
Instead of creating authentication manager, we choose to inject an existing authentication manager from Spring container. With this step, we can share the authentication manager with the Basic Authentication filter.
It is possible to have trusted application and not trusted application. Trusted application can have their own secret. This is necessary for client credential authorization grant. Except client credentials, all 3 other grants require resource owner's credential.
We allow anonymous for checking token endpoint. With this configuration, the checking token is accessible without basic authentication or OAuth 2 token.

Resource Server Configuration

Here is our configuration for Resource Server Configuration

@Configuration
@EnableResourceServer
public class ResourceServerConfiguration extends ResourceServerConfigurerAdapter {
    
    @Value("${resource.id:spring-boot-application}")
    private String resourceId;
    
    @Override
    public void configure(ResourceServerSecurityConfigurer resources) {
        resources.resourceId(resourceId);
    }

    @Override
    public void configure(HttpSecurity http) throws Exception {
         http.requestMatcher(new OAuthRequestedMatcher())
                .authorizeRequests()
                 .antMatchers(HttpMethod.OPTIONS).permitAll()
                    .anyRequest().authenticated();
    }
    
    private static class OAuthRequestedMatcher implements RequestMatcher {
        public boolean matches(HttpServletRequest request) {
            String auth = request.getHeader("Authorization");
            // Determine if the client request contained an OAuth Authorization
            boolean haveOauth2Token = (auth != null) && auth.startsWith("Bearer");
            boolean haveAccessToken = request.getParameter("access_token")!=null;
   return haveOauth2Token || haveAccessToken;
        }
    }

}

Here are few things to take note:

The OAuthRequestedMatcher is added in so that the OAuth filter will only process OAuth 2 requests. We added this in so that an unauthorized request will be denied at the Basic Authentication layer instead of OAuth 2 layer. This may not make any difference in term of functionality but we added it in for usability. For the client, they will receive 401 HTTP Status with this new header versus the old header:

WWW-Authenticate:Basic realm="Realm"
WWW-Authenticate:Bearer realm="spring-boot-application", error="unauthorized", error_description="Full authentication is required to access this resource"

With the new response header, a browser will auto prompt user for the username and password. If you do not want the resource to be accessible by any other authentication mechanism, this step is not necessary.
Some browsers like Chrome like to send OPTIONS request to look for CORS before making AJAX call. Therefore, it is better to always allow OPTIONS requests.

Basic Authentication Security Configuration

As mentioned earlier, because we need to protect the token provider endpoint.

@Configuration
@EnableGlobalMethodSecurity(prePostEnabled = true)
@EnableWebSecurity
public class SecurityConfiguration extends WebSecurityConfigurerAdapter {
    
    @Autowired
    public void globalUserDetails(AuthenticationManagerBuilder auth) throws Exception {
        auth.inMemoryAuthentication().withUser("user").password("password").roles("USER").and().withUser("admin")
                .password("password").roles("USER", "ADMIN");
    }
    
    @Override
    protected void configure(HttpSecurity http) throws Exception {
     http
        .authorizeRequests()
            .antMatchers(HttpMethod.OPTIONS).permitAll()
            .anyRequest().authenticated()
            .and().httpBasic()
            .and().csrf().disable();
    }
    
    @Override
    @Bean
    public AuthenticationManager authenticationManagerBean() throws Exception {
        return super.authenticationManagerBean();
    }
}

There are few things to take note:

We expose the AuthenticationManager bean so that our two authentication security adapter can share a single authentication manager.
Spring Security CSRF working seamlessly with JSP but is a hassle for RestAPI. Because we want this sample app to be used as a base for users to develop their own application, we turned CSRF off and add in a CORS filter so that it can be used right away.

Testing

We wrote one test scenario for each authorization grant type following exactly OAuth 2 specifications. Because Spring Security OAuth 2 is an implementation based on Spring Security framework, our interest is veered toward seeing how the underlying authentication and principal are constructed.

Before summarizing the outcome of the experiment, let take a quick look at few notes.

Most of the requests to token provider endpoints were sent using POST requests but they include user credential as parameters. Even though we put this credential as part of URL for convenient, never do this in your OAuth 2 client.
We created 2 endpoints /resources/principal and /resources/roles to capture the principal and authority for OAuth 2 authentication.

Here is our setup:

User	Type	Authorities	Credential
user	resource owner	ROLE_USER	Y
admin	resource owner	ROLE_ADMIN	Y
normal-app	client	ROLE_CLIENT	N
trusted-app	client	ROLE_TRUSTED_CLIENT	Y

Here is what we find out

Grant Type	User	Client	Principal	Authorities
Authorization Code	user	normal-app	user	ROLE_USER
Client Credentials	NA	trusted-app	trusted-app	No Authority
Implicit	user	normal-app	user	ROLE_USER
Resource Owner Password Credentials	user	trusted-app	user	ROLE_USER

This result is pretty as expected except for Client Credentials. Interestingly, even though the client retrieves Oauth 2 token by client credential, the approved request still does not have any of client authorities but only client credential. I think this make sense because the token from Implicit Grant Type cannot be reused.

Should you mind your own business?

2016-01-11T20:23:00.000-08:00

In a recent Lean Coffee retrospective, each member of our team was asked to raise one question or concern about working environment. For me, my burning question is how much should we mind other business. After voicing out my concern, I got subtle response from the team as people did not feel very comfortable expressing their thought on this controversial topic. Even without the possibility of hiding real opinion to be politically correct, it is quite likely that many of us still do not know which attitude is more desirable in our working environment. Talking about personal preference, I have met people that truly believe on opposite sides of standing. Apparently, if the workplace is mixed with both types of people having opposite mindsets, conflicts tend to happen more often.

However, this topic is rarely being discussed in workplace. Therefore, there is often a lack of consensus in the corporate environment regarding how much should we care about other people works. As a consequence, some people may silently suffer while others may feel frustrated with the lack of communication and cooperation.

So, is there a right mindset that we should adopt in our working environment or should we just let each people to have it their own way and hope that they will slowly adapt to each other? Let discuss if we have any viable solution.

Mind your own business

Micro Management

Usually, the culprits of meddling with other people works are the one who do managing job. It is obvious that no one like to be micro-managed, even the micro manager himself. Being told of what to be done in details would give us limited space for innovation and self learning plus a bitter taste of not being trusted.

In real life, micro management rarely be the optimal solution. Even in best scenario, micro management can only help to deliver the project with average quality and create a bunch of unhappy developers. Following instructions rather than depending on own thinking rarely create excellent works. To be fair, it can help to rescue a project if the progress fall below expectation but it will not help much on achieving excellency. It is even worse that manager, who accomplish the jobs by micro management can be addicted to it and find it harder to place trust on his sub-ordinates in the future.

To avoid misunderstanding

It is even harder to interfere with some one else work if you are not happen to be a senior or the supervisor. In reality, an act of good will can be interpreted as an act of arrogance unless one manage to earn reputation in the work place. Even if the contribution bring obvious outcome, not every working environments encourage heroism or rock star programmer. Moreover, some introvert folks may not feel comfortable with the attention as a side effect.

Similar to above, this heroin act is normally more welcome in crisis time but may not be very well received when the project is already stabled.

Or mind the other people business as well

Reaching team goal

To be fair, no one border to care about someone else work if it is not for the sake of the project. Of course, there are many poisonous managers who want to act busy by creating artificial pressure but in this article, let focus on the people who want to do see the project success. These people sometimes walk out of their role profiles and just do whatever necessary to get job done.

Eventually, because project work is still team work, it may be more beneficial for each member of them team to think and focus on common goal rather than individual mission. It is pretty hard to keep the information flow through and the components to integrate smoothly if each member only focus on fulfilling their role. No matter how good is the plan, there will likely be some missing pieces and that missing pieces need to be addressed as soon as possible to keep project moving forward.

The necessary evil to get job done

Many successful entrepreneurs claimed that the secret of their success is delegating the tasks to capable staffs and place trust on them. It is definitely the best solution when you have capable staffs. However, in reality, most of us are not entrepreneurs and the people working on the project are chosen by available resources rather than best resources. There may be a time when a project is prioritized and granted the best resources available, when it is in a deep crisis. However, we would not want our project to go through that.

Therefore, from my point of view, the necessary evil here is the task oriented attitude over people oriented attitude. We should value people and give everyone chance for self improvement but still, task completion is first priority. If the result of work is not the first priority, it is hard to measure and to improve performance. I fell it even more awkward to hamper performance for the sake of human well-being when the project is failing.

So, what is the best compromise?

I think the best answer is balance. Knowing that meddling with other people works is risky but the project success is the ultimate priority, the best choice should be defining a minimal acceptable performance and be ready to step in if the project is failing. Eventually, we do not work to fulfill our task, we work together to make project success. Personal success provide very little benefit other than your own well-being from the corporate point of view. However, don't let this consume you and be addicted with the feeling of being people's hero. Plus, don't raise the bar too high, otherwise the environment will be stressed and people feel less encouraged.

Designing database

2015-07-06T02:10:00.004-07:00

Database design has evolved greatly over the last 10 years. In the past, it used to be the database analyst job to fine-tune the SQL query and database index to ensure performance. Nowadays, developers play a much more crucial role to ensure the scalability of data.

The database design task, which was autonomy, now becomes an exciting task, which requires a lot of creativity. In this short article, let's walk through an example of real life issue to see how database design has changed.

Requirements

In this example, our business requirement is to build a database to store property information for the country. At first, we need to store any property and its landlords. If a property is being leased, the system need to store tenants information as well. The system should also record activities of properties, including buy, sell and renting.

As a typical database system, the user should be able to query properties out by any information like address, owner name, district, age,... The system need to serve data for both real time query and reporting purpose.

Analysis

It is pretty obvious that there are few entities in this domain like landlord, tenant, transaction and property. Landlord and tenant can be further analysed as people that acts different roles. For example, one person can rent out his house and rent another house to live, which mean he can be the landlord of one property and the tenant of another. That leaves us with 3 major entities: person, property and transaction. Person and property entities have many to many relations to each other. Transaction entity links to one property and at least one person.

If we group some common attributes like occupation, district, building, it is possible to introduce some other sub-entities that may help to reduce redundancy in information.

The era of relational database

If you are one of a developer that being trapped in the relational database era, the only viable choice for persistence is relational database. Naturally, each entity should be stored in a table. If there are relationship between 2 entities, they are likely to refer to each other by foreign keys.

With this setup, there is zero redundancy and every piece of information has the single source of truth. Obviously, it is the most efficient way in term of storage.

There may be an issue here as it is not easy to implement text searching. Whether 10 years ago or today, text search has never been supported well by relational databases. SQL language provides some wildcard matching in the language itself but it is still very far from a full text search.

Assume that you have completed the task of defining database schema, the fine tuning task is normally the job of database analysts; they will look into every individual query, view, index to increase the performance as much as possible.

However, if the readers have spent years working on relational database, it is quite easy to see the limit of this approach. A typical query may involve joining several tables. While it works well for low amount of records, the solutions seem less feasible when the number of tables increase or the amount of records in each table increase. All kinds of tweaks like data sharding, scaling vertically or adding index only help to increase performance up to a certain level. No magic can help if we are going to deal with hundreds millions of records or joining more than 10 tables.

Extension of relational database

To solve the issue, developers have tried several techniques that may scale better than a traditional relational database. Here are some of them:

Database explosion

This technique reverses the process of normalizing data in relational database. For example, instead of joining property with the building, district or country table, we can simply copy all the column of the relevant records to main table. As a consequence, duplication and redundancy happen. There is no single source of truth for sub-entities like building, district, country. In exchange, the joining part in SQL query is simplified.

Explosion is an expensive process that may take hours or even days to run. It sacrifices space, freshness of data in order to increase real time query performance.

Adding document database

In this technique, relational database is still the source of truth. However, to provide text search, important fields were extracted and stored in a document database. For example, knowing that users will search for people by age, gender and name, we can create document that contains these information plus the record id and store them to Solr or Elastic Search server.

Real time query to the system will first be answered by searching in document database. The answer, which includes bunch of record ids will later be used by relational database to load records. In this approach, document database acts like an external index system that help to provide text search capability.

Storing the whole data to a noSQL database

The final choice is storing data to an object-oriented or document database. However, this approach may add a lot of complexity for data maintenance.

To visualize, we can store the whole property or person to database. The property object contains its owners objects. In reverse, the owner object may includes several property objects. In this case, it is quite a hassle to maintain to set of related objects if the data change.

For example, if a person purchases a property, we need to go to the property object to update owner information and go to that person object to update property information.

Combining relational database and noSQL database

The limits of existing methods

After scanning through the approaches mentioned above, let try to find the limit for each approach.

Relational database normalizes data before storing to avoid duplication and redundancy. However, by optimizing storage, it causes additional effort on retrieving the data. Taking consideration that database is normally limit by querying time, not storage, it doesn't seem to be a good trade off.

Explosion reverses the normalizing process but it cannot offer fresh data as explosion normally take a long time to run. Comparing running explosion with storing the whole entity object to an object-oriented database, the latter option may be easier to maintain.

Adding document database offers text search feature but I feel that it should reverses the options to improve scalability. Document database is faster for retrieval while relational database is better for describing relationship. Hence, it doesn't make sense to send the record ids from document database back to relational database for retrieving records. What may happen if there are millions of records id to be found. Retrieving those records from noSQL database is typically faster than relational database.

As mentioned above, when these entities are inter-linked, there is no easy way to separate them out to store to an object-oriented database.

Proposing combination of relational and noSQL database to store data

Thinking about these limits, I feel that the best way to store data should be combining both relational and document database. Both of them will act as source of truth, storing what they do best. Here is the explanation of how should we split the data.

We store the data similarly to a traditional relational database but splitting the columns to 2 types:

Columns that store id or foreign keys to other entity ids ("property_id", "owner_id",..) or unique fields

Columns that store data ("name", "age",...)

After this, we can remove any data column from relational database schema. It is possible to keep some simple fields like "name", "gender" if they help to give us some clues when looking at records. After that, we can store the full entities in a document database. We should try to avoid making cross-references in stored documents.

Explain the approach by example

Let try to visualize the approach by describing how should we implements some simple tasks

Storing a new property owned by a user

Configure JPA to only store name and id for each main entity like person, property. Ignore data fields or sub-entities like building, district, country.
Store the property object to relational database. As the result of earlier step, only id and name of the property are persisted.
Update property object with persisted ids.
Store property object to document database.
Store owner object to document database.

Querying property directly

Sending query to document database, retrieving back record.

Querying property based on owner information

Sending query to relational database to find all property that belong to the owner.
Sending query to document database to find these property by ids.

In the above steps, we want to store records to relational database first because of the auto id generation. With this approach, we have a very thin relational database that only capture relationships among entities rather than the entities them selves.

Summary of the approach

Finally, let summarize the new approach

Treating main entities as independent records.
Treating sub-entities as complex properties, to be included as part of main entities.
Storing id, name and foreign keys of main entities inside relational database. The relational database is serving as a bridge, linking independent objects in noSQL database.
Storing main entities with updated ids to noSQL database.
Any CRUD operation will require committing to 2 databases at the same time.

Pros:

Off-load the storing data task from relational database but let it do what it can do best, stores relationships.
Best of both worlds with text search and scalability of noSQL database and relations searching of relational database.

Cons:

Maintaining 2 databases.
No single source of truth. Any corruption happen in one of the two databases will cause data loss.
Code complexity.

Possible alternative

Storing data to a graph database that offer text search capability. This is quite promising as well but I have not done any benchmark to prove feasibility.

Conclusion

The solutions is pretty complex but I found it is interesting because the scalability issue is solved at the code level rather than database level. By splitting the data out, we may tackle the root cause of the issue and be able to find some balance between performance and maintenance effort.

The complexity of this implementation is very high but there is no simple implementation for big data.

Can java optimize empty array allocation?

2015-06-11T23:53:00.001-07:00

Yesterday came across a simple optimization case, here is the original method

public String[] getSomeArray() {
if (nothing) {
return new String[0];
}
// normal processing ignored for brevity
}

at the first sight the allocation looks quite wasteful, and I am tempted to carry out some micro optimization like

private static final String[] EMPTY = new String[0];
public String[] getSomeArray() {
if (nothing) {
return EMPTY;
}
// normal processing ignored for brevity
}

However, another developer Pedro ringed the bell, maybe java can JIT away the allocation all together, and this does looks a very reasonable JIT target.

Let's find out!

jmh to rescue

@Benchmark
public void test1() {
for (int i = 0; i < 10000; i++) {
get1();
}
}
public String[] get1() {
return new String[0];
}

private static final String[] CONST = {};
@Benchmark
public void test2() {
for (int i = 0; i < 10000; i++) {
get2();
}
}
public String[] get2() {
return CONST;
}

A benchmark run gave following result which showed that the two methods ran pretty much at the same speed, therefore the actual allocation could be indeed optimized away

test$ java -jar target/benchmarks.jar -f 1
# JMH 1.9.3 (released 28 days ago)
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/bin/java
# VM options: <none>
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: org.sample.MyBenchmark.test1

# Run progress: 0.00% complete, ETA 00:01:20
# Fork: 1 of 1
# Warmup Iteration 1: 3177146862.839 ops/s
# Warmup Iteration 2: 2969126090.532 ops/s
...
# Warmup Iteration 19: 3904120378.974 ops/s
# Warmup Iteration 20: 3368973982.889 ops/s
Iteration 1: 3273016452.646 ops/s
Iteration 2: 3720653112.375 ops/s
...
Iteration 19: 2940755393.888 ops/s
Iteration 20: 3490675218.425 ops/s

Result "test1":
3150112425.866 ±(99.9%) 346620443.427 ops/s [Average]
(min, avg, max) = (2526859466.365, 3150112425.866, 3790445537.196), stdev = 399168618.122
CI (99.9%): [2803491982.439, 3496732869.293] (assumes normal distribution)

# JMH 1.9.3 (released 28 days ago)
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/bin/java
# VM options: <none>
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: org.sample.MyBenchmark.test2

# Run progress: 50.00% complete, ETA 00:00:40
# Fork: 1 of 1
# Warmup Iteration 1: 2646209214.510 ops/s
# Warmup Iteration 2: 3014719359.164 ops/s
...
# Warmup Iteration 19: 3639571958.173 ops/s
# Warmup Iteration 20: 3127621392.815 ops/s
Iteration 1: 3464961418.737 ops/s
Iteration 2: 2827541432.787 ops/s
....
Iteration 19: 2888880315.543 ops/s
Iteration 20: 3109114933.979 ops/s

Result "test2":
3048325924.714 ±(99.9%) 269904767.209 ops/s [Average]
(min, avg, max) = (2523324876.886, 3048325924.714, 3573386254.596), stdev = 310822731.303
CI (99.9%): [2778421157.505, 3318230691.923] (assumes normal distribution)

# Run complete. Total time: 00:01:20

Benchmark Mode Cnt Score Error Units
MyBenchmark.test1 thrpt 20 3150112425.866 ± 346620443.427 ops/s
MyBenchmark.test2 thrpt 20 3048325924.714 ± 269904767.209 ops/s
test$

to be super conservative, let's check the assembly

0x00000001051ffa99: movabs $0x11e65a2c8,%rbx ; {metadata({method} {0x000000011e65a2c8} 'get1' '()[Ljava/lang/String;' in 'org/sample/MyBenchmark')}
0x00000001051ffaa3: and $0x7ffff8,%edx
0x00000001051ffaa9: cmp $0x0,%edx

0x0000000109ae1f51: movabs $0x122f3d440,%rbx ; {metadata({method} {0x0000000122f3d440} 'get2' '()[Ljava/lang/String;' in 'org/sample/MyBenchmark')}
0x0000000109ae1f5b: and $0x7ffff8,%eax
0x0000000109ae1f61: cmp $0x0,%eax

Now we see the exact same native codes were generated.

Case closed.

Java does optimize empty array allocation.

Happy Coding!

by Dapeng

Rethinking database schema with RDF and Ontology

2015-05-25T19:44:00.001-07:00

When I joined the industry 10 years ago, my first project used relational database. After that, my next project also used relational database. And as you may guess, my next next projects also used relational database. This went on for so long that I almost forgot that table is just one format to store data.

I only found myself interested in other kind of databases 4 years ago when my company slowly moved to BigData analysis and knowledge management. Over these years, the exposure I have with RDF and Ontology has given me an urge to re-visit and re-think about the approach and principle to build database.

In the scope of this article, I will solely focus my thought on the role of database schema. Please do not feel worry even if the term RDF and Ontology sound unfamiliar to you, I will try my best to introduce these concepts in this article.

Background

Graph as knowledge database

As everyone know, creating relational database start with defining the database schema. It doesn't matter which database we choose, we still need to define tables, columns and any foreign key before inserting data. Apparently, we need to decide how the data would look like before making it.

However, this is not a suitable approach for us when building knowledge database. Because the system suppose to store future knowledge rather than current knowledge, it is impossible to figure out how the data will look like. Therefore, we turned our focus out of relational database and looked for other solutions.

There are many NoSQL databases that can support schema-less data but most of them does not fit our requirement because we want to stored linked data. Hence, graph database seems to be the most sensational choice for us.

Resource Description Framework

A round of shopping in the market give us an unease feeling as there is no widely adopted standard for query language. If we choose a graph database, we may end up writing vendor specific implementation, which doesn't seem to be a good strategy to start.

In an attempt to find some common standards, we managed to find Resource Description Framework, a W3C specification for modelling of information in web resources. It seems to be the best choice for us because Resource Description Framework comes with a very simple mechanism to describe resource linkage. The only side effect that every resource need to be identified by an URI.

For example, to describe that Apple produce iPhone 5, which being sold at 600 USD, we need to generate 2 triples like below.

<http://example.org/Apple> <http://example.org/produce> <http://example.org/iPhone5>
<http://example.org/iPhone5> <http://example.org/price> '600 USD'@en

We have no interest in using URI as resource identifier but still need to do so in order to comply to standard. However, we cannot blame RDF because it was invented for the web. The author has a nice dream where we can follow the URI of resource to retrieve it. Obviously, this idea did not come true.

Leave the original idea of RDF one side, the major benefit for us is the query language SPARQL. For example, to figure out the price of any Apple phones, the query should be:

select ?phone ?price where {
<http://example.org/Apple> <http://example.org/produce> ?phone .
?phone <http://example.org/price> ?price
}

However, RDF is just a concept. In order to use RDF and SPARQL, we still need to find a decent API or implementation, preferably in Java. This leaded us to 2 popular choices, Sesame and Apache Jena. Both APIs are widely accepted and have several vendor implementations. Even better, some vendors provide implementations for both of them.

Ontology

In the example above, it is easy to see that to make a meaningful query, we need to know how the data look like. Therefore, we still end up should have some kinds of data schema. However, this schema should act more like meta-data rather than data definition. Generally, we do not need the schema to be pre-defined before inserting data.

To address this issue, there are two strategies. People started with defining a common vocabulary for RDF. Because most of the time, we already know the relationship among resources but do not know the resources themselves, the vocabulary should be good enough to help forming SPARQL query. However, due to the wide scope of describing the world, no vocabulary is enough to fully express every domain. Therefore, there are many domain specific vocabularies.

While the first approach only tackles describing possible relationship among resources, the second approach attempts to describe resources as well. For example, to describe the resources in the previous example, we can define a class named smart phone and a class named manufacturer. After this, we can specify that a manufacturer can produce a phone.

<http://example.org/Apple> <http://example.org/type> <http://example.org/Manufacturer>
<http://example.org/iPhone5> <http://example.org/type> <http://example.org/Phone>
<http://example.org/Manufacturer> <http://example.org/produce> <http://example.org/Phone>

Those triples above form an Ontology. Compare with vocabulary, we found Ontology as a more descriptive way of describing data schema because it can tell us which kind of relationship is applicable to which kind of resource. Therefore, we will not waste time figuring out whether iPhone 5 produces Apple or Apple produces iPhone 5.

Ontology plus RDF is a good choice to build knowledge database. While the repository can be an RDF store which can take in any triple, we can build another ontology in a separate space to model the knowledge in main repository.

From our point of view, it is better to use ontology to form query rather than data validation because it will help to slightly decouple data and schema. In practice, it is perfectly acceptable to insert data that does not fit Ontology, as long as they does not contradict.

For example, with the Ontology defined earlier, we should allow inserting knowledge like

<http://example.org/Apple> <http://example.org/produce> <http://example.org/iPod>
<http://example.org/Apple> <http://example.org/produce> <http://example.org/iPad>

but we can reject any knowledge like

<http://example.org/iPhone5> <http://example.org/produce> <http://example.org/Apple>

With this approach, we can allow importing data first, then modelling them later.

Database Schema

A refreshing thought

Relating our approach on building knowledge database with relational database make me feel that the requirement of defining data schema before data insertion is driven by implementation. If we can temporarily forget practical concerns like data indexing, it is possible to insert data first and define data schema later.

This also brought me another thought of whether it is feasible to have more than one data schema for the same piece of data. The idea may sound awkward at first but quite realistic if we look at the daunting task of modelling the world. For example, if we think of Obama as the US president, we may care about the when he started serving, when will he leave office; but if we think of Obama as any other US citizen, then we care more about date of birth, residential area, security number,... In this way, the schema is serving as a perspective for us to inspect and modify resource.

So, if I can travel back to the time people discussing a common query language for database, I would suggest adding some features to SQL to enrich it, or to introduce a new kind of query language that is less strict:

Allow insertion without table definition. Automatically create table definition following insertion command parameters.
Make the id of record unique per database rather than unique per table. A record can appear in many tables with the same id. An insertion command need to specify which field is the id for the record.
The data definition is more generic without size constrain (varchar and int instead of varchar(255) or int(11)).
The select command must comply to the table definition.
It is possible to share a field between two tables for the same record.

Before wrapping up this article, let try to do an quick exercise of building an database that can implements these extended features. The underlying storing system can be any schema-less engine but we can use RDF store for simplicity.

Requirements

Insert Obama into US citizen table with name and age and gender. The identifier field is name.
Insert Obama into US president table with name, age and elected year. The identifier field is name.
Define US citizen table with field name and age.
Define US president table with name, age and elected year.
Select record from US citizen table will only show name and age as gender is not defined in table definition.
Update Obama record in US President table with new age will affect the age in US citizen table because it is a sharing field.

Implementations

Step 1

SQL: insert into citizen(name, gender, age) value ('Barack Obama', 'Male', 53)
Triples:

<Barack Obama> <type> <citizen>
<Barack Obama> <name> 'Barrack Obama'
<Barack Obama> <gender> 'Male'
<Barack Obama> <age> 53

Step 2

SQL: insert into president(name, elected_year) value ('Barrack Obama', 'Male', 53)
Triples:

<Barack Obama> <type> <president>
<Barack Obama> <elected_year> 2009

Step 3

SQL: create table citizen ('name' varchar, 'age' int, primary key ('name') )
Triples:

<citizen> <field> <name>
<citizen> <field> <age>
<citizen> <primary_key> <name>

Step 4

SQL: create table president ('name' varchar, 'elected_year' int, primary key ('name') )
Triples:

<president> <field> <name>
<president> <field> <elect_year>
<president> <primary_key> <name>

Step 5

SQL: select * from citizen
SPARQL: select ?record ?field_name ?field_value where {

?record <type> <citizen>
<citizen> <field> ?field_name
?record ?field_name ?field_value

}

Step 6

SQL: update president set age=54 where name='Barack Obama'
Overwrite <Barack Obama> <age> 53 with <Barack Obama> <age> 54

Conclusions

I think some of the ideas above a bit hard to digest if audiences are not familiar with RDF or Ontology. Still, I hope it can raise some more thoughts on bridging the gap between relational database and knowledge database.

First Agile impression

2015-05-12T22:15:00.000-07:00

Last year, we had a mass recruitment for Java developers with various level of experience. Unfortunately, from this part of the world (Asia) Agile has not been very widely adopted. Therefore, we ended up spend extra effort getting new members to be familiar with Agile and XP.

Over the team forming process, which lasted about 6 months, most of the new team members provided positive feedback about working in an Agile environment. In this article, I would like to share with audiences one of the feedback that we have. Enjoy!

It has been a quickfire six months working in an agile environment. This is my first foray into agile software development in nearly a decade of working in software. Hearing and reading about it could not have prepared me for the actual experience of working in this peculiar environment. However, the overwhelming sentiment for me is - what a breath of fresh air!

The morning standups were unsettling, to put it mildly. Suddenly, you stopped having excuses for not having done anything the day before :). They take some getting used to, but slowly I’m learning that it is more a sharing session than a status update. Easier said than done, but I’m getting there. Not sure what you're going to do today? Tell everyone - there’s always something to do.

I like the inherent openness that Agile brings to the table. Seemingly mundane things like outstanding tasks become more explicit and we are all the better for it. Finish a task at hand? No one's stopping you from going to the board and picking a new one. Stuck with a sticky issue? Bring it up in the standup and more often than not, offers of help can be expected. Oftentimes though, a quick holler is all that will be required.

Whole days dedicated to planning and retrospectives demand concentration and focus and more often than not, creativity. I have found that they mean less disruption during the iteration for the most important work of all - writing good code. Worthwhile, no?

We practice pair programming. For someone new to this, the intensity is unexpected as you try to align yourself to your pair's thought process in a continuous back-and-forth cycle. Overall, I have found it to be quite draining, but I believe the upside cannot be underestimated. Each pairing session, even with the same colleague, seems to involve a whole new dynamic and the constant adjustment needed can be likened to a skill. Working in such close proximity can be a double-edged sword, though. Some friction is inevitable and I have often experienced a whole range of emotions whilst working through a pairing session. I try to manage these emotions and reflect upon them afterwards to understand why I’d felt the way I’d felt and how I could have done better. It is undeniably rewarding though - I have learnt a great deal about myself and my pairs during those sessions.

Test-driven development is standard practice here. Mastering, or rather, adhering to the red-green-refactor pattern seemed counter-intuitive at first, but it starts to make sense after a while. In my limited experience, writing tests becomes more fluid with practice. Since there is almost never an excuse for not providing test coverage for any code that will see the light of day (much less production code!), buckling down and writing that test will be a good habit to develop and one which I am convinced will prove to be an invaluable skill and an integral part in my journey to become a better developer.

Building software has never been quite so engaging and dare I say it..fun. Having a close knit team definitely helps. Hopefully, we can keep the good momentum and spirit going as we welcome new members into our fold and the workload ramps up. It will not be easy, but nobody said it would be. Whatever the future holds, I await with bated breath. Onwards.

Zhi Liang

Migrating Spring Web MVC from JSP to AngularJS

2015-04-18T01:37:00.001-07:00


Target Audience
This article is written for Spring Web MVC developers who are familiar with JSP development and who would like to understand how to migrate to a client side Javascript framework like AngularJS.

Sample Pet Clinic for reference
An example of a Spring Pet clinic application that we have tried to revamp as an AngularJS app with an updated design can be found here . You can refer to this project on how we introduced AngularJS to the project.
Introduction to AngularJS
AngularJS is a Javascript framework created at Google that touts itself as a "Superheroic Web MVW Framework" (where the "W" in the "MVW" being a tongue-in-cheek reference to "Whatever" for all the various MVx architecture. As it is based on an MVx architecture, AngularJS provides a structure to Javascript development and thus gives Javascript an elevated status compared to traditional Spring + JSP application that probably uses Javascript to provide that bit of interactivity on the user interface. With AngularJS, your Javascript application will also inherit features like Dependency-Injection, HTML-vocabulary extension (via the use of custom directives), unit-testing and functional testing integration as well as DOM-selectors ala JQuery (using jqLite as it provides only a subset of JQuery but you could also easily use JQuery if you prefer). AngularJS also introduces scopes to your Javascript code so that variables declared in your code are bound only to the scope that is required.This prevents variables pollution that inadvertently arises when the size of your Javascript grows.

When you are developing a Spring Web MVC application using JSP, you will likely use the Spring-provided form tags to bind your form inputs to a server side model. Similarly, AngularJS provides a way to bind form inputs to models on the client side. In fact, it provides instantaneous 2-way data-binding from the form input to your model on the Javascript application. That means that not only do you have the benefits of having your view updated with changes inside your Javascript model, any changes you make to your UI will also update the Javascript model (and consequently any other views that is bound to that model). It is almost magical to see all the views that are bound to the same JS model on the app update itself automatically. Moreover, since your model can be set to a particular scope, only views that belong to the same scope will be affected, allowing you to sandbox code that should be local only to a particular portion of your view. (This is done via an AngularJS attribute called ng-controller that is set in your HTML templates). You can see the difference in a later section comparing JSP tags and AngularJS directives.


You can see an illustration of the difference here







You can also see this in action with the following embedded plunker code below.



Using AngularJS, it is possible to write relatively complex User Interfaces in an organised and elegant manner, always encapsulating the required logic within your components and never running the risk of errant global Javascript variables polluting your scope. It is also very testable, and there are built-in mechanisms to perform tests at the unit and functional level, ensuring that all your User Interface codebase goes through the same rigourous testing that your Java/Spring code undergoes, ensuring quality even at your user interfaces.  

Another advantage of using AngularJS to write your html templates is that the templates are essentially similar to html even with the various frontend logic baked into your view. It is possible to incorporate AngularJS logic into your template and still pass a HTML validation. Try viewing a JSP file from a browser with all the template logic in place and most likely your browser will give up rendering the page. You can see how a typical AngularJS template looks like :

<div class="row thumbnail-wrapper">
  <div data-ng-repeat="pet in currentOwner.pets" class="col-md-3">
    <div class="thumbnail">
      <img data-ng-src="images/pets/pet{{pet.id % 10 + 1}}.jpg" 
        class="img-circle" alt="My Pet Image">
      <div class="caption">
        <h3 class="caption-heading" data-ng-bind="pet.name">Basil</h3>
        <p class="caption-meta" data-ng-bind="pet.birthdate">08 August 2012</p>
        <p class="caption-meta"><span class="caption-label" 
           data-ng-bind="pet.type.name">Hamster</span></p>
      </div>
      <div class="action-bar">
        <a class="btn btn-default" data-toggle="modal" data-target="#petModal" 
          data-ng-click="editPet(pet.id)">
          <span class="glyphicon glyphicon-edit"></span> Edit Pet
        </a>
        <a class="btn btn-default">
          <span></span> Add Visit
        </a>
      </div>
    </div>
  </div>
</div>

You can probably spot some non-HTML addition to the template includes attributes like data-ng-click which maps a click on a button to a method name call. There's also data-ng-repeat which will loop through a JSON array and generate the necessary html to render the same view for each item in the array. Yet with all these logic in place, we are still able to validate and view the html template from the browser. AngularJS calls all the non-html tags and attributes "directives" and the purpose of these directives is to enhance the capabilities of HTML. AngularJS also supports both HTML 4 and 5 so if you have templates that are still relying on HTML 4 DOCTYPEs, it should still work fine although the validators for HTML 4 will not recognize data-ng-x attributes.
Scopes in AngularJS
One important concept to grasp in AngularJS is that of scopes. In the past, whenever I had to write javascript for my web application, I had to manage the variable names and construct special name-spaced objects in order to store my scoped properties. However, AngularJS does it for you automatically based on its MVx concept. Every directive will inherit a scope from its controller (or if you would like, an isolate scope that does not inherit scope properties) where the properties and variables created in this scope does not pollute the rest of the scopes or global context.

Scopes are used as the "glue" of an AngularJS application. Controllers in AngularJS use scopes to interact with the views. Scopes are also used to pass models and properties between directives and controllers. The advantage of this is that we are now forced to design our application in a way which components are self-contained and relationships between components have to be considered carefully through a use of a model that can be prototypically inherited from a parent scope.

A scope can be nested in another scope prototypically in the same way Javascript implements its inheritance model via prototyping. However, one must be careful that any property name that is declared in the child scope that is similar to the parent will be hide the parent property from the child scope thereafter. An example of this can be described in the code below

At the very top in the hierarchy of scopes is the $rootScope, a scope that is accessible globally and can be used as the last resort to share properties and models across the whole application. The use of this should be minimized as it introduces a sort of "global" variable that can pose the same problems when it is over-used.

More information about scopes can be gleaned from the AngularJS documentation found here .
Directives in AngularJS
Directives is one of the most important concept in AngularJS and it is indeed the foundation of what we constitute AngularJS in the markup. It is essentially all the additional customized markup in the form of element, attributes, classes or comments in our HTML markup that gives the markup new functionalities.

Consider the following code snippet that demonstrates a customized directive called wdsCustom that will replace the markup element <wds-custom company="wds"> with markup that contains information about a model called "wds" that is declared in the controller scope that wraps the directive. You can have a look at the files app.js, index.html and directive template wds-custom-directive.html to see how this works in the below plunkr snippet.



As this article does not attempt to teach you how to write a directive, you can refer to the official documentation here.


Differences in architecture between JSP and AngularJS
When one migrates from a server-side templating engine like JSP or Thymeleaf to a Javascript-based templating engine, one has to experience a paradigm shift towards a client-server architecture. One has to cease thinking of the view as being a part of the web application and instead conceive the web application as 2 separate client-side and server-side applications. An illustration of this is in the next diagram which shows how a Spring application becomes a provider of RESTful Web Services, servicing various front end applications including an AngularJS browser-based application as well as a possibility to provide services for mobile clients like tablets or smartphones. These services could include OAuth, Authentication and other business logic services which should be obfuscated from public view. One should bear in mind that any data or business logic that is published in the form of JSON or javascript files are exposed for the client-side to see. Thus, if there's any business sensitive logic or workflow that should not be exposed, these logic should only be performed on the backend.

Also, when you are working in AngularJS, we prefer to encapsulate form submissions in a JSON object which is sent over to the backend RESTful service via a AngularJS HTTP Post method call (instead of using the HTML form submission). All validation can be performed on the front end using AngularJS's built-in input validation (or if you like to have your custom validation, it can also be easily written) before posting your data to the server. Of course, it is always prudent to validate the same data at the server side as well to ensure that clients that do not check their data do not compromise the integrity of the data on the server side. 





Application structure
One question when we migrate over to AngularJS are the questions on how we would want to organise our AngularJS + Spring application. At my company WDS, we are using Maven as our dependency and package management tool for Java/Spring and that influenced how we decided to place our AngularJS javascript application. The AngularJS application is created within src/main/webapp and the main files are


components/ # the various components are stored here.
js/app.js   # where we bootstrap the application
plugins/    # additional external plugins e.g. jquery.
services/   # common services are stored here.
images/
videos/

You can see an image capture of the folder structure in Eclipse below.

This way of grouping the resources uses the feature-grouping method. There are also ways to group your resources based on types, e.g. grouping all your controllers, services and views into its namesake folder. There are pros and cons for either option but we prefer to use the feature grouping way to contain our resources.


Comparisons betwen AngularJS and JSP custom tags

If you have used Spring's custom form tags in your JSPs for developing your forms, you may be wondering if AngularJS provides the same kind of convenience for mapping form inputs to objects. The answer is yes! As a matter of fact, it is easy to bind any HTML element to a Javascript object. The only difference is that now the binding occurs on the client-side instead of the server-side.
<form:form method="POST" commandName="user">
<table>
    <tr>
        <td>User Name :</td>
        <td><form:input path="name" /></td>
    </tr>
    <tr>
        <td>Password :</td>
        <td><form:password path="password" /></td>
    </tr>
    <tr>
        <td>Country :</td>
        <td>
            <form:select path="country">
            <form:option value="0" label="Select" />
            <form:options items="${countryList}" itemValue="countryId" itemLabel="countryName" />
            </form:select>
        </td>
    </tr>
</table>
</form:form>

Here is an example of the same form in AngularJS



<form name="UserForm" data-ng-controller="ExampleUserController">
  <table>
    <tr>
        <td>User Name :</td>
        <td><input data-ng-model="user.name" /></td>
    </tr>
    <tr>
        <td>Password :</td>
        <td><input type="password" data-ng-model="user.password" /></td>
    </tr>
    <tr>
        <td>Country :</td>
        <td>
            <select data-ng-model="user.country" data-ng-options="country as country.label for country in countries">
               <option value="">Select<option />
            </select>
        </td>
    </tr>
</table>
</form>


Form inputs in AngularJS are augmented with additional capabilities like the ngRequired directive that makes the field mandatory based on certain conditions. There are also built-in validation for checking ranges, dates, patterns etc.. You can find out more at AngularJS's official documentation found here which provides all the relevant form input directives.

Considerations when moving from JSP to AngularJS
If you are considering migrating from JSP to AngularJS, there are a few factors to consider for a successful migration. 
Converting your Spring controllers to RESTful services
You will need to transform your controllers so instead of forwarding the response to a templating engine to render a view to the client, you will provide services that will be serialised into JSON data instead. The following is an example of how a standard Spring MVC controller RequestMapping uses the ModelAndView object to render a view with the Owner as described in the url mapping.

@RequestMapping("/api/owners/{ownerId}")
public ModelAndView showOwner(@PathVariable("ownerId") int ownerId) {
    ModelAndView mav = new ModelAndView("owners/ownerDetails");
    mav.addObject(this.clinicService.findOwnerById(ownerId));
    return mav;
}

A controller RequestMapping like that can be converted into an equivalent RESTful service that returns the owner based on the ownerId. Your template can then be moved into AngularJS which will then bind the owner object to the AngularJS template.

@RequestMapping(value = "/api/owners/{id}", method = RequestMethod.GET)
public @ResponseBody Owner find(@PathVariable Integer id) {
    return this.clinicService.findOwnerById(id);
}

In order for Spring MVC to convert your return object (which needs to be Serializable) to a JSON object, you need to configure your Spring context file to include a conversion service that uses Jackson2 for JSON serialisation. An example of the snippet that performs this Spring context configuration is shown below.

<bean id="objectMapper" class="org.springframework.http.converter.json.Jackson2ObjectMapperFactoryBean" p:indentOutput="true" p:simpleDateFormat="yyyy-MM-dd'T'HH:mm:ss.SSSZ"></bean>
<mvc:annotation-driven conversion-service="conversionService" >
 <mvc:message-converters>
  <bean class="org.springframework.http.converter.json.MappingJackson2HttpMessageConverter" >
   <property name="objectMapper" ref="objectMapper" />
  </bean>
 </mvc:message-converters>
</mvc:annotation-driven>

Once you have converted your controllers into RESTful services, you can then access these resources from your AngularJS application.

One nice way to access RESTful services in AngularJS is to use the built-in ngResource directive that allows you to access your RESTful services in an elegant and concise manner. An example of the Javascript code to access RESTful services using this directive can be illustrated by the following:

var Owner = ['$resource','context', function($resource, context) {
 return $resource(context + '/api/owners/:id');
}];

app.factory('Owner', Owner);

var OwnerController = ['$scope','$state','Owner',function($scope,$state,Owner) {
 $scope.$on('$viewContentLoaded', function(event){
  $('html, body').animate({
      scrollTop: $("#owners").offset().top
  }, 1000);
 });

 $scope.owners = Owner.query();
}];

The above snippet shows how a "resource" can be created by declaring an Owner resource and then initialising it as an Owner service. The controller can then use this service to query for Owners from the RESTful endpoint. In this way, you can easily create the resources that your application require and map it easily to your business domain model. This declaration is done once only in the app.js file. You can actually take a look at this actual file in action here


Synchronizing states between the backend and your AngularJS application

Synchronizing states is something that needs to be managed when you are developing a client-server architecture. You will need to give some thought to how your application updates its state from the backend or refresh its view whenever some state changes.


Authentication
Having your client-side code exposed to the public makes it even more important to think through how you would like to authenticate your users and maintain session with your application. (Stateless or Stateful ?) . 


Testing
AngularJS comes with the necessary tools to help you perform testing at all layers of your Javascript development from unit to functional testing. Planning how you test and perform builds incorporating those tests will determine the quality of your front end client. We use a maven plugin called frontend-maven-plugin to assist us in our build tests.


Conclusion

Migrating to AngularJS from JSP may seem daunting but it can be very rewarding in the long run as it makes for a more maintainable and testable user interface. The trend towards client side rendered views also encourages building more responsive web applications that were previously hampered by the design in server side rendering. The advent of HTML 5 and CSS3 has ushered us to a new era in View rendering technologies. The future of View layer development is on the client side and the future is now.

Authentication Mechanisms for Web Applications

2015-03-12T03:38:00.001-07:00

Authentication is the basic requirement for most of websites. However, there are many mechanisms to implement authentication and they are not very interchangeable. Depend on business requirement, developers need to choose the most appropriate method of authentication for their application. It may not be an easy task unless one understand the differences among mechanisms well.

In this short article, I would like to recap 4 popular mechanisms for implementing authentication. There will be comparison and consideration to choose the most appropriate method as well.

Background

In a standard Java web application, there are two kinds of meta-data that normally included in request, session and authentication. The session information help to make HTTP protocol stateful while the authentication information help to identify user.

Usually, there is one header contains authentication information, another header contains session information or a single header contains both. It is also possible for a request to have only one of the two.

When HTTP is first introduced as a stateless protocol, the official authentication mechanism is basic access authentication. With this authentication method, every request must include a header that contains encrypted username and password. Apparently, basic authentication is only appropriate for a secured channel.

Still, many people would feel that basic authentication is too unsecured. Surprisingly, the basic authentication requests client to encrypt instead of hashing the password. Even worse, the encrypted password is included in every request. Therefore, if hackers manage to intercept a single request, the user password is revealed.

To strengthen the authentication mechanism, digest access authentication is introduced. With the new authentication method, client send the hashed value of combination of username, password, realm and server nonce. While realm is fixed, the nonce can be randomly generated value issued by server. Later modification of digest authentication allow client to add its nonce as well. Kindly notice that the digest authentication send the plain values of username, nonces and realm in the request. Here is one example

Authorization: Digest username="some_user", realm="Some Realm", nonce="MTQyNjE1MDE5Njc0MjoxZjdkYjIzZjI0YjZjNDExMzU2OTk3MWIyNWQzYmYwNg==", uri="/some_website/index.html", response="fcc3a3cd69c93c76e65c845263c3d4f4", qop=auth, nc=00000001, cnonce="c61b667053c03c31"

In the above request, the response value is the hashed value of username, password, realm, server nonce and client nonce combined. The server calculated its own response and authenticate user if the responses are matched.

Digest authentication make it almost impossible to figure-out the password by intercepting requests and it is also helps to prevent the replay-attack. For that to happen, the server nonce should be changed regularly. Therefore, re-sending a request after the server nonce expired will fail.

Most of readers may notice that these two methods are not that popular anymore. The HTTP based authentication mechanisms are supported by the browsers and may not provide user a beautiful login form. Worse, it provides no easy way to logout unless user close the browser or the tab. Therefore session-based login is the preferred authentication mechanism nowadays. Basic and digest authentication are still used but mostly for B2B communication where maintaining a session is not necessary.

Session-based login is applicable when HTTP is stateful. For this to happen, there is a session cookie that embedded into every request (normally, it is named JSESSIONID). The server will create a session storage for each session it created. When a request arrive, the server will try to locate user profile in user session. If the user profile is not available, a login form will be shown to user. Upon successful authentication, the server store user profile into session storage.

This method works well but it requires the stickiness load balancer if the web application is deployed in a cluster environment. Otherwise, a server will not recognize the session if it was generated by another server.

For the cloud application, it is better to build a client-side session with a cookie that contains user information. This session cookie is secured with a signature. To avoid reading database frequently, we can use a common key to sign all the sessions instead of using password.

Comparison

After knowing each mechanism, let compare them:

Features	Basic Authentication	Digest Authentication	Server-side Session	Client-Side Session
Support Session	No	No	Yes	Yes
Logout	Close Browser	Close Browser	Kill session	Kill session
Support Cluster	Yes	Yes	Require stickiness LB	Yes
Prevent Replay Attack	No	Yes	Optional (session timeout)	Optinal (include timestamp in session signature)
Client Computation	Cheap	Cheap	Zero	Zero
Server Computation	Cheap	Cheap	Zero	Cheap
Client Storage	username & password	username & password	session cookie	session cookie
Server Storage	zero	zero	session storage	zero
Handshake	Not required	at least 2 requests	at least 2 requests	at least 2 requests
Network overhead	short header	long header	short header	long header

Conclusion

You can see the above comparison and make the choice yourself. For me, Basic Authentication and Digest Authentication seem to be the better choice to build B2B authentication. However, they do not fit customer facing application well.

Choosing between Basic and Digest Authentication is merely depend on how secured is your channel. Most of the time, it is worth safeguarding our application with Digest Authentication as the network, handshake and computation overhead is not very much.

For customer facing UI, I would prefer Client-side session over Server-side session because they scale better. You always can throw in new servers to share the workload. For server-side session, the additional server can only serve new sessions, not existing sessions, unless you can clone session storages.

The problem with client-side session is the expose of user information in the session cookie. Because of that, you can only put primitive and not confidential values in the session. If you can live with that limit then client-side session is surely the better choice.

AngularJs as the alternative choice for building web interface

2015-02-24T16:38:00.000-08:00

Recently, we were invited to conduct a joint talk with a UX designer, Andrew from Viki on Improving communications between Design and Development for Singasug. The main focus of the talk was to introduce AngularJS and elaborate on how it help to improve collaboration between developers and designer from both sides of view. To illustrate the topic, we work together on revamping the Pet Clinic Spring sample application.

Over one month of working remotely, mostly on spare time, we have managed to refresh the Pet Clinic application with a new interface, which built on AngularJS and Bootstrap. Based on what we have been through, we delivered the talk to share what have been done with Spring community.

In this article, we want to re-share the story with a different focus, that is using AngularJs as the alternative choice for building web application

Project kick starting

The project was initiated last year by Michael Isvy and Sergiu Bodiu, the two organizers of the Singapore Spring User Group. Michael asked our help to deliver a talk about AngularJS for the web night. He even went further by introducing us an UX designer named Andrew and nicely asked if we can collaborate on revamping the Pet Clinic application. Feeling interested on this idea, we agreed to work on the project and delivered the talk together.

We set the initial scope of the project to be one month and aimed to show as much functionality as possible rather than fully complete the application. Also, because of Andrew involvement, we take this opportunity to revamp the information architect of the website and give it with new layout.

Project Delivery

The biggest problem we have to solve is geographic location. All of us working on spare time and can afford limited face to face communication. Due to personal schedule, it is difficult to setup a common working place or time. For discussing about the project and planning, we only scheduled a weekly meeting during weekend.

To replace direct communication, we contact each other through Whatapp. We hosted the project on Github and turned on the notification feature so that each member is informed after any check-in. Even though there was a bit of concern at the beginning but each member play their part and the project progressed well.

To prepare in advance, we converted SpringMVC controllers with Jsp view to the Restful controllers. After that, Andrew showed us the wireframe flow that laid the common understanding of how the application should behave. Based on that, we built the respective controllers and integrated with html templates Andrew provided.

After half of the project time, Andrew picked up some understanding about Angular directive and be able to work directly on the project source code. That helped to boost our development speed.

When the project has been through half of the timeline, Michael came back to us and asked if it is possible to make the website functional without the Restful API. He think that would help designers to be able to develop html template without deploying it into an application server. We bought into this idea and provided this feature for the application. By turning on a flag, the AngularJS application will no longer make any http call to server. Instead, the mock http service will returned the pre-defined static Json files.

What we have achieved

After one month, we have a new website with beautiful theme that look like a commercial website rather than sample application. The new Pet Clinic is now a single page application that accessing Restful API.

We did figure out an effective way of working remotely even when we did not know each other before the project started.

We also found many benefits of building web interface with AngularJs. That is what we want to share with you today.

Understanding AngularJS

If you are new to AngularJS, think of it as a new MVC framework for front-end application. MVC is a well known pattern for server side application, but it is applicable for front-end application as well. With AngularJs, developer build a model based on the data received from server. After that, Angular will bind the model data to the html controls, displaying output on screen.

The good thing here is AngularJs provide the two-way binding instead of one-way binding. In this case, if user update the value in the html control, the model is automatically updated without developer effort.

This is the illustration of the boring one-way binding versus the two-way binding

Angular also provides scope for each model, so that the 2-way binding is only active within the boundary of a controller. To instruct the binding, AngularJs uses directives, which embedded directly to html controls. The Angularised html will look similar to this:

<div class="col-md-3" data-ng-repeat="owner in owners">
  <div class="thumbnail">
      <div class="caption">
      <h3 class="caption-heading">

           <a href="show.html" data-ng-bind="owner.firstName + ' ' +owner.lastName"></a>

      </h3>
      <p data-ng-bind="owner.city"></p>
      <p data-ng-bin="owner.address"></p>
    </div>
  </div>
</div>

For the above binding to work, you need to prepare the data for the model "owners" in your controller

var OwnerController = ['$scope','$state','Owner',function($scope,$state,Owner) {
 $scope.owners = Owner.query();
}];

If you feel curious how the code of the controller look so short, it is because we implement an Owner service with JPA liked methods

var Owner = ['$resource','context', function($resource, context) {
 return $resource(context + '/api/owners/:id');
}];

Look at the above template, if we take out all Angular directives, which start with data-ng or ng, we should have back the original html template. The only exception here is the directive data-ng-repeat, which function like a for loop that help to shorten the html code.

After Angularised the template file and create controller, the last thing one need to do is to declare them in the global app.js file

....
.state({ name: "owners",
  url: "/owners",
  templateUrl: "components/owners/owners.html",
  controller: "OwnerController",
  data: { requireLogin : true })

....
app.controller('OwnerController', OwnerController);
...
app.factory('Owner', Owner);

So far, that is the effort to Angularised one owners page template. The above example is mostly about showing data, but if we replace the tag <p/> with <input/> then users should be able to edit and view the owner at the same time.

Pros and cons of AngularJS

We have done evaluation on AngularJS before adopting it and we have evaluated it again after using. To recap our experience, we will share some benefits and issues of AngularJS.

Facilitate adoption of Restful API

Obviously, one need to introduce Restful API to work with AngularJS. Given the changes happened in this industry over last decade, Restful API is slowly but surely will be the standard practice for future applications.

In the past, a typical Spring framework developer should know JDBC, JPA, Spring, Jsp in order to develop web applications. But they are no longer enough. At first, there are some big players like Twitter, Facebook or Google that introducing the APIs for third-party web applications to integrate with their services. Later, there is a booming trend of mobile applications that no longer render UI based on html.

Because of that, there is an increasing demand for building application that serving data instead of serving html. For example, any start-up that want to play safe with will start with building Restful API before building a front-end application. That can save a lot of effort if there is a need to build for another client rather than browser in the future.

There is another contributing factor from development point of view. Splitting back-end and front-end applications make parallel development easier. It is not necessary to wait until the back-end services completed before building web interface. It is also beneficial in term of utilizing resources. In the past, we need developers that know Java and Javascript to develop java web application. Similarly, we need developers that know .NET and Javascript to develop .NET application. However, the web applications nowadays are using a lot more Javascript than in the past. That make finding developers that good at both languages harder to find. With Restful API, it is possible to recruit front-end developers that have strong understanding of Javascript and Css to build web interface while back-end developers can focus on security, scalability and performance issues.

Because we adopted Restful API from development benefit, it is important for us to showcase the ability of running Angular application without back-end service. Because AngularJS injects the modules by declaration, we have a single point of integration to configure the http service for the whole application. This look very favorable, especially for a Spring developer, because we have better control over the technology usage.

Faster development

Jsp by nature is only for viewing. Changing of data happened by other methods like form submitting or Ajax request. That why most of us think the MVC pattern only include one-way binding for displaying view. That not necessary to be the case if you remember desktop applications. For front end, MVC is not that popular. Some developers still like to manually populate values for html controls after each Ajax request. We also do but only sometimes. Some other times, we prefer automation over control. That why we love AngularJS.

As any other MVC framework, AngularJS provide a model behind the scene for each html control. But it is a Wow factor to have it reversed way, when the html control update the model. Think about it practically, sometimes our html controls are inter-related. For example, updating the value of country from the first drop down suppose to display corresponding states or provinces in the next drop down. AngularJS allows us to do this without coding.

Directive is cleaner comparing to Jsp or jQuery

I think the biggest benefit that AngularJS offer is custom directive. We adopted AngularJS because of two-way binding but custom directive is what make us committed.

Directive captured our interest at first sight. It is a clean solution for the long term problems that we face, the continuity problem and tracking problem.

Let start with a short example of a for loop in jsp file


<% for(int i = 0; i < owners.size(); i+=1) { %>

...

<% } %>

The code between opening and closing of for loop helps to render one element. Whenever we write this kind of code, we break the original html template into smaller chunks. Therefore, it is understandable that designer don't like to maintain jsp file. They are meant to be looked at by developers not designers.

To avoid inserting opening and closing of for loop at different places, we can use jsp tag. Then, we have another issue with the html code go missing in the jsp file. There is no clean and easy way for us to represent this jsp contents to designers. If there is a minor change to be made to the web interface, it normally comes as an UI patch request from designers.

Thing did not get better when developers moved to jQuery based UI. We always have concern regarding DOM manipulation and event handlers registration. In a typical jQuery web application, viewing source give us very limited information on what the user would see. Instead of being the source of truth, HTML become the material for Javascript developers to perform magic. That even shun the designers away more than Jsp.

Even for developers, thing does not go well if you are not the author of the code. It may take guess work to find out the code that being executed when user clicks a button, unless you know the convention. The Css selector is too flexible. It allows one developer to define behavior in a non-deterministic way and let the other developer go around searching for it.

Directive helps us get rid of both problems.

The directive data-ng-repeat="owner in owners" is inserted directly to the html tag. That make it is easier to keep track when the loop actually end both for us and the designers. The html inside the div remained similar to the original template, which make minor modification is possible. For tracking purpose, seeing ng-click directive on the html tags tell us immediately which method to look for. The ng-controller directive or the routing configuration gave us clue which controller that the method should be defined in. That make reaching source much easier.

Finally, the AngularJS team allow us to define our own directive. This is amazing because it allows us to create new directive to solve un-foreseen circumstance in the future and tap into existing directives library created by community.

Let get ourselves familiarized with custom directive by this example. In our Pet Clinic app, clicking some links on the banner will scroll the page down to respective part of the website.

To achieve this behavior in a jQuery app, the fastest solution is to make it as a link and put

href="javascript:scrollToTarget(destination)"

For an Angular app, it is not necessarily to be a link and the code look slightly different

ng-click="scrollToTarget(destination)"

If he decided to make a directive for this function, the code will look like

scroll-to-target="destination"

For us, the last one is the best. The directive is a familiar concept for any designer, as they should have used class, id, name or onclick. Having a new well meaning directive will not be that hard to pick up even for a designer. For us, it is fun and cool. We are no longer depend on World Wide Web consortium to release the feature that we want. If we need somethings, we can hunt for the directive, we also can modify it or create it our selves. Slowly, we can have a community-backed directives library that forever improving.

Automation versus Control

After AngularJS presentation, I think not every folk will be ready to jump the boat. The AngularJS deliver so much magic with a big trade off that you need to relinquish the control of your application over to AngularJS.

Instead of having work done yourself, you need to provide information for AngularJS to do the work. Two-way binding, injection of services all happen behind the scene. When we need to modify the way it work, it is not so convenient to do so.

For us, relinquishing control to build an configurable application is tolerable. However, not able to alter the behavior when we need is much less tolerable. This is where Spring framework shines. It does not simply give you the bean, it allow you to do whatever way you want with it like executing bean life-cycle functions, injecting bean factory or defining scope.

Contrast to this, I find it hard to go out of Angular way when I want to. For example, I would like Angular to help us populate content of a textarea and later running CkEditor Javascript to convert this textarea to a html editor. To get this done, the application need to load CkEditor Javascript only after the binding is completed. It can be done by altering AngularJS behavior or converting the textarea to a directive.

But I am not satisfied enough with both, the former solution does not look clean and the latter solution seem tedious. It look cooler if we have life-cycle support to inject additional behavior like we have with Spring framework

FAQ about AngularJS

Up to this point, I hope you already have some ideas and can make a choice for yourself whether AngularJS is the right choice. To contribute to your decision making, I will share our answers of the questions we received in the presentation.

Is it possible to protect confidential data when you build web interface with AngularJS?

For me, this is not AngularJS issue, this is how you design your Restful API. If there is something you do not want users to see, don't ever show it on your Restful API. For a Spring developer, this means avoid using entity as @ResponseBody if you are not ready to fully expose it. At least, put in some annotation like @JsonIgnore to avoid the confidential fields.

How you pass object around different pages?

There are many ways to do so. Each page have its own controller but the whole application share a single rootscope that you can place the object there. However, we should not use rootscope unless there is no other way. Our prefer solution is using url.

For example, our routing configuration specify:

        .state({
  name: "owners",
  url: "/owners",
  templateUrl: "components/owners/owners.html",
  controller: "OwnerController",
  data: { requireLogin : true }
 }).state({
  name: "ownerDetails",
  url: "/owners/:id",
  templateUrl: "components/owners/owner_details.html",
  controller: "OwnerDetailsController",
  data: {requireLogin : true}
 })

When user route to the owner detail page, the controller load the owner again from the id in url. It may look redundant but it allow the browser bookmarking.

Is this really possible for designers to commit to the project?

Yes, that what we have tried and it worked. One day before the presentation, our designer, Andrew redesign the landing page without our involvement. He kept the all the directives we put in and there is no impact to functionality after the design changed.

Should I adopt AngularJs 1 or wait until AngularJs 2 available?

I think you should go ahead with AngularJS 1. For us, AngularJs 2 is so different that it can be treated as another technology stack and support for AngularJS will continue for at least a year after AngularJS 2 is released (slated for 2016) We feel that due to community support, AngularJs 2 will continue to thrive like Play framework 1 after the release of Play framework 2.

Conclusion

So, we have done our part to introduce a new way of building web interface for Spring framework user. We hope you are convinced that Restful API is the way to go and AngularJs is worth trying. If you have any other idea, please help to feedback.

If you are keen to see the project or original talk, here is the reference:

https://github.com/singularity-sg/spring-petclinic

http://petclinic.hopto.org/petclinic/slides/#/

http://petclinic.hopto.org/petclinic/#/

Enjoy!

Exploration of ideas

2014-11-04T22:18:00.001-08:00

There are many professionals for an individual to choose and I believe that e should follow the professional that he like most or hate least. The chance of success and quality of life are both much better that way. So, if you ask me why I choose software development as a career, I can assure you that programming is a fun career. Of course, it is fun not by sitting in front of the monitor and typing endlessly on the keyboard. It is fun because we control what we write and we can let our innovation run wild.

In this article, I want to share some ideas that I have tried and please see for your self if any idea fit your work.

TOOLS

Using OpenExtern and AnyEdit plugins for Eclipse

Of all the plugins that I have tried, these two plugins are my two most favorites. They are small, fast and stable.

OpenExtern give you the shortcut to open file explorer or console for any eclipse resource. At the beginning day of using Eclipse, I often found myself opening project properties just to copy project folder to open console or file explorer. OpenExtern plugin make the tedious 5 to 10 seconds process become 1 second mouse click. This simple tool actually helps a lot because many of us running Maven or Git commands from console.

The other plugin that I find useful is AnyEdit. It adds handful of converting, comparing and sorting tool to Eclipse editor. It eliminates the need to use external editor or comparison tool for Eclipse. I also like to turn on auto formatting and removing trailing spaces when saving. This works pretty well if all of us have the same formatter configuration (line wrapping, indentation,...). Before we standardized our code formatter, we had a hard time comparing and merging source code after each check in.

Other than these two, in the past, I often installed the Decompiler and Json Editor plugins. However, as most of Maven artifacts now a day are uploaded with source code and Json content can be viewed easily using Chrome Json plugin, I do not find these plugins useful anymore.

Using JArchitect to monitor code quality

In the past, we monitored the code quality of projects with eye balls. That seems good enough when you have time to take care of all modules. Things get more complicated when the team is growing or multiple team working on the same project. Eventually, the code still need to be reviewed and we need to be alerted if things go wrong.

Fortunately, I got an offer from JArchitect to try out their latest product. First and foremost, this is a stand alone product rather than traditional way of integrating to IDE. For me, it is a plus point because you may not want to make your IDE too heavy. The second good thing is JArchitect can understand Maven, which is a rare feature in the market. The third good thing is that JArchitect create their own project file in their own workspace, which make no harm to the original Java project. Currently, the product is commercial but you may want to take a look to see if the benefit justify cost

SETTING UP PROJECTS

As we all know, a Java web project has both unit test and functional test. For functional test, if you use framework like Spring MVC, it is possible to create tests for controller without bringing up the server. Otherwise, it is quite normal that developers need to start up server, run functional test, then shutdown the server. This process is a bit tedious given the fact that the project may be created by someone else, which we have never communicated before. Therefore, we are trying to setup project in such a way that people can just download it and run without much hassle.

Server

In the past, we had the server setup manually for each local development box and the integration server (Cruise Control or Hudson). Gradually, our common practice shift toward checking in the server to every project. The motivation behind this move is saving of effort to setup a new project after checking out. Because the server is embedded inside project, there is no need to download or setup the server for each box. Moreover, this practice discourages shared server among projects, which is less error prone.

Other than server, there are two other elements inside a project that are server dependence. They are properties files and database. For each of this element, we have slightly different practice, depending on situation.

Properties file

Checking in properties template rather than properties file

Each developer need to clone the template file and edit when necessary. For continuous integration server, it is a bit trickier. Developer can manually create the file in the workspace or simply check in the build server properties file to the project.

The former practice is not used any more because it is too error prone. Any attempt to clean workspace will delete properties file and we cannot track back the modification of properties in the file. Moreover, as we setup Jenkins as a cluster rather than single node like in the past, it is not applicable any more.

For second practice, rather than checking in my-project.properties, I would rather checkin my-project.properties-template and my-project.properties-jenkins. The first file can be used as guidance to setup local properties file while the second can be renamed and used for Jenkins.

Using host name to define external network connection

This practice may work better sometimes if we need to setup similar network connections for various projects. Let say we need to configure database for around 10 similar project pointing to the same database. In this case, we can hard code the database server in the properties file and manually setup the host file in each build node to point the pre-defined domain of the database.

For non essential properties, provide as much default value as possible

There is nothing much to say about this practice. We only need to remind ourselves to be less lazy so that other people may enjoy handling our projects

Using LandLord Service

This is a pretty new practice that we only apply from last year. With the regulation in our office, Web Service is the only team that manage and have access to UAT and PRODUCTION server. That why we need to guide them and they need to do the repetitive setup for at least 2 environments, both normally require clustering. It is quite tedious to them until the day we consolidate the properties of all environments, all project to a single landlord service.

From that time, each project would start up and connecting to the landlord service with an environment id and application id. The landlord will happily serve them all the information they need.

Database

Using DBUnit to setup database once and let each test case automatically roll back transaction

This is the traditional practice which is still work quite well now a day. However, it still require developer to create an empty schema for DBUnit to connect to. For it to work, the project must have a transaction management framework that support auto roll back in test environment. It also requires that the database operation happen within the test environment. For example, if in the functional test, we attempt to send a HTTP request to the server. In this case, the database operation happen in the server itself rather than in the test environment wen cannot do anything to roll it back.

Running database in memory

This is a built-in feature of Play framework. Developer will work with in-memory database in development mode and external database in production mode. This is doable as long as developer only work with JPA entities and has no knowledge of underlying database system.

Database evolutions

This is a borrowed idea from Ror. Rather than setting up the database from beginning, we can simply check current version and sequentially run the delta script so that the database got the wanted schema. Similar to above, it is expensive to do it yourselves unless there is native support from a framework like Ror or Play.

CODING

I have been in the industry for 10 years and I can tell you that software development is like fashion. There is no clear separation between good and bad practice. Whatever things are classified as bad practices may come back another day as new best practices. Let summarize some of the heaty debates we have

Access modifiers for class attributes

Most of us were taught that we should hide class attributes from external access. Instead, we suppose to create getter and setter for each attribute. I strictly followed this rule in my early days even I did not know that most IDE can auto generate getter and setter.

However, later, I was introduced to another idea that setter is dangerous. It allows other developers to spoil your object and mess up the system. In this case, you should create immutable object and do not provide the attribute setter.

The challenge is how should we write our Unit Test if we hide the setter for class attributes. Sometimes, the attributes is inserted to the object using IOC framework like Spring. If there is no framework around, we may need to use reflection util to insert mock dependency to the test object.

I have seen many developers solving problem this way but I think it is over-engineering. If we compromise a bit, it is much more convenient to use package modifier for the attribute. As best practices, test cases will always be on the same package with implementation and we should have no issue injecting mock dependencies. The package is normally controlled and contributed by the same individual or team; therefore the chance of spoiling object is minimal. Finally, as package modifier is the default modifier, it saves a few bytes of your codes.

Monolithic application versus Micro Service architecture

When I joined industry, the word "Enterprise" mean lots of xml, lots of deployment steps, huge application, many layers and the code is very domain oriented. As things evolve, we adopted ORM, learned to split business logic from presentation logic, learn how to simplify and enrich our application with AOP. It goes well until I was told again that the way to move forward is Micro Service Architect, which make Java Enterprise application functions similarly to a Php application. The biggest benefit we have with Java now may be the performance advantage of Java Vitual Machine.

When adopting Micro Service architecture, it is obvious that our application will be database oriented. By removing the layers, it also minimize the benefit of AOP. The code base sure will shrink but it will be harder to write Unit Test.

Agile is a simple topic

2014-09-16T04:04:00.001-07:00

Agile manifesto is probably one of the best ever written manifestos in software development if not the best. Simple and elegant. Good vs Bad 1 2 3 4, done. It is so simple that I am constantly disappointed by the amount of stuff that floating on the Internet about, what is agile what is not, how to do agile, Scrum, Kanban and who knows what will pop up next year claiming to to be another king of agile.

If I ever tell you we are the purist agile team and we don't have sprint, we don't have stand up meetings, we don't story board, we don't have burn down charts, we don't have planning poker cards, we don't have any of the buzzwords, most of the so called IT consultants will hang me on the spot.

Let's face it, being pure isn't about what you have, it is about what you don't! The pure gold has nothing but gold that's why it is super valuable. We should build our teams on developers, codes and business needs. The three pure ingredient of a team, any one taken away a team is no more.

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.

Antoine de Saint-Exupery

Exactly the manifesto is saying "we value less on processes and tools" and yet we have seen all kinds of weird super imposed processes and tools everywhere. "Look, we have standups, we have sprints, we have story boards therefore we agile". NO, absolutely NOT. You can walk like a duck, quack like duck, but you are still not a duck.

But why the hype anyway?

Partly the consulting companies are to be blamed, they try to sell the buzz words to the management so that they can make $$$ by simply asking the developers to do what they already know, writing codes, but in a different way.

The biggest enemies are all the developers especially the team leaders and managers. Because they are lazy to know the developers (the people), lazy to learn the codes (the working software) the lazy to analyse the business needs. Because "in the end of the day I need to show my developers that I am doing a manager's work", "what is the shortcut?", "look, I just got this scrum from a random blog post, standups 5 mins, no problems. Poker cards, easy. Story boards, no big deal ... ". "Done, now we are scrum, now we are agile, if the things fail, it is the developers problem". Goodbye, there goes a team.

So now you question me, "you said agile is simple, why it looks so hard now?"

Any fool can make something complicated. It takes a genius to make it simple.
Woody Guthrie

People are born equal, a genius doesn't magically popup, it takes real hard work to reach that level. Let's go back to the origin, the mighty manifesto.

Get rid of all unnecessary processes and tools, and go talk to people. "What is Jimmy's strength? What can we do to make up for Sam's weakness? Is David and Carl a good pair?".

Stop typing inside Words or Excel, go read the real codes, "What can we do the enhance the clarity of the codes, how to improve the performance without too much sacrifice, what are the alternative ways to extend our software".

Stop coming up with imaginary use cases, go meet the customer "What are your point points, what are the 3 most important features that need to be enhanced and delivered. Based on our statistics, we believe if we build feature X in such a way, the business can grow in Y%, do you think we should do this?"

Stop wasting our life on keeping a useless backlog, go see the 3 biggest opportunities and threats and work on them, rise and repeat. If fact that is exactly how evolution bring our human to this stage, "eliminate the immediate threat to ensure the short term survival, and seek the opportunities for long term growth". As we all decadents of the mother nature, we are incapable of out smart her, so learn from her.

real process/methodology grows from the team not super imposed on to the team

real process/methodology does not have a name because it is unique to each team

Grow your own dream team!

Thanks for wasting your time reading my rant

Stateless Session for multi-tenant application using Spring Security

2014-09-07T08:28:00.001-07:00

Once upon a time, I published one article explaining the principle to build Stateless Session. Coincidentally, we are working on the same task again, but this time, for a multi-tenant application. This time, instead of building the authentication mechanism ourselves, we integrate our solution into Spring Security framework.

This article will explain our approach and implementation.

Business Requirement

We need to build authentication mechanism for an Saas application. Each customer access the application through a dedicated sub-domain. Because the application will be deployed on the cloud, it is pretty obvious that Stateless Session is the preferred choice because it allow us to deploy additional instances without hassle.

In the project glossary, each customer is one site. Each application is one app. For example, site may be Microsoft or Google. App may be Gmail, GooglePlus or Google Drive. A sub-domain that user use to access the application will include both app and site. For example, it may looks like microsoft.mail.somedomain.com or google.map.somedomain.com

User once login to one app, can access any other apps as long as they are for the same site. Session will be timeout after a certain inactive period.

Background

Stateless Session

Stateless application with timeout is nothing new. Play framework has been stateless from the first release in 2007. We also switched to Stateless Session many years ago. The benefit is pretty clear. Your Load Balancer do not need stickiness; hence, it is easier to configure. As the session in on the browser, we can simply bring in new servers to boost capacity immediately. However, the disadvantage is that your session is not so big and not so confidential anymore.

Comparing to stateful application where the session is store in server, stateless application store the session in HTTP cookie, which can not grow more than 4KB. Moreover, as it is cookie, it is recommended that developers only store text or digit on the session rather than complicated data structure. The session is stored in browser and transfer to server in every single request. Therefore, we should keep the session as small as possible and avoid placing any confidential data on it. To put it short, stateless session force developer to change the way application using session. It should be user identity rather than convenient store.

Security Framework

The idea behind Security Framework is pretty simple, it helps to identify the principle that executing code, checking if he has permission to execute some services and throws exceptions if user does not. In term of implementation, security framework integrate with your service in an AOP style architecture. Every check will be done by the framework before method call. The mechanism for implementing permission check may be filter or proxy.

Normally, security framework will store principal information in the thread storage (ThreadLocal in Java). That why it can give developers a static method access to the principal anytime. I think this is somethings developers should know well; otherwise, they may implement permission check or getting principal in some background jobs that running in separate threads. In this situation, it is obviously that the security framework will not be able to find the principal.

Single Sign On

Single Sign On in mostly implemented using Authentication Server. It is independent of the mechanism to implement session (stateless or stateful). Each application still maintain their own session. On the first access to an application, it will contact authentication server to authenticate user then create its own session.

Food for Thought

Framework or build from scratch

As stateless session is the standard, the biggest concern for us is to use or not to use a security framework. If we use, then Spring Security is the cheapest and fastest solution because we already use Spring Framework in our application. For the benefit, any security framework provide us quick and declarative way to declare assess rule. However, it will not be business logic aware access rule. For example, we can define that only Agent can access the products but we can not define that one agent can only access some products that belong to him.

In this situation, we have two choices, building our own business logic permission check from scratch or build 2 layers of permission check, one is only role based, one is business logic aware. After comparing two approaches, we chose the latter one because it is cheaper and faster to build. Our application will function similar to any other Spring Security application. It means that user will be redirected to login page if accessing protected content without session. If the session exist, user will get status code 403. If user access protected content with valid role but unauthorized records, he will get 401 instead.

Authentication

The next concern is how to integrate our authentication and Authorization mechanism with Spring Security. A standard Spring Security application may process a request like below:

The diagram is simplified but still give us a raw idea how things work. If the request is login or logout, the top two filters update the server side session. After that, another filter help check access permission for the request. If the permission check success, another filter will help to store user session to thread storage. After that, controller will execute code with the properly setup environment.

For us, we prefer to create our authentication mechanism because the credential need to contain website domain. For example, we may have Joe from Xerox and Joe from WDS accessing Saas application. As Spring Security take control of preparing authentication token and authentication provider, we find it is cheaper to implement login and logout ourselves at the controller level rather than spending effort on customizing Spring Security.

As we implement stateless session, there are two works we need to implements here. At first, we need to to construct the session from cookie before any authorization check. We also need to update the session time stamp so that the session is refreshed every time browser send request to server.

Because of the earlier decision to do authentication in controller, we face a challenge here. We should not refresh the session before controller executes because we do authentication here. However, some controller methods is attached with the View Resolver that write to output stream immediately. Therefore, we have no chance to refresh cookie after controller being executed. Finally, we choose a slightly compromised solution by using HandlerInterceptorAdapter. This handler interceptor allow us to do extra processing before and after each controller method. We implement refreshing cookie after controller method if the method is for authentication and before controller methods for any other purpose. The new diagram should look like this

Cookie

To be meaningful, user should have only one session cookie. As the session always change time stamp after each request, we need to update session on every single response. By HTTP protocol, this can only be done if the cookies match name, path and domain.

When getting this business requirement, we prefer to try new way of implementing SSO by sharing session cookie. If every application are under the same parent domain and understand the same session cookie, effectively we have a global session. Therefore, there is no need for authentication server any more. To achieve that vision, we must set the domain as the parent domain of all applications.

To illustrate this global session, let come back to the earlier example where we have two applications that contain the domain name as microsoft.mail.somedomain.com or google.map.somedomain.com

For the session cookie to be global, we will set the domain as somedomain.com. Obviously, the session cookie can be seen and maintained by both applications as long as they share the same secret key to sign.

Performance

Theoretically, stateless session should be slower. Assuming that the server implementation store session table in memory, passing in JSESSIONID cookie will only trigger a one time read of object from the session table and optional one time write to update last access (for calculating session timeout). In contrast, for stateless session, we need to calculate the hash to validate session cookie, load principal from database, assigning new time stamp and hash again.

However, with today server performance, hashing should not add too much delay in server response time. The bigger concern is querying data from database, and for this, we can speed up by using cache.

In best case scenario, stateless session can perform closely enough to stateful if there is no DB call made. In stead of loading from session table, which maintained by container, the session is loaded from internal cache, which is maintained by application. In the worst case scenario, requests are being routed to many different servers and the principal object is stored in many instances. This add additional effort to load principal to the cache once per server. While the cost may be high, it occurs only once in a while.

If we apply stickiness routing to load balancer, we should be able to achieve best case scenario performance. With this, we can perceive the stateless session cookie as similar mechanism to JSESSIONID but with fall back ability to reconstruct session object.

Implementation

I have published the sample of this implementation to https://github.com/tuanngda/sgdev-blog repository. Kindly check the stateless-session project. The project requires a mysql database to work. Hence, kindly setup a schema following build.properties or modify the properties file to fit your schema.

The project include maven configuration to start up a tomcat server at port 8686. Therefore, you can simply type mvn cargo:run to start up the server.

Here is the project hierarchy:

I packed both Tomcat 7 server and the database so that it work without any other installation except MySQL. The Tomcat configuration file TOMCAT_HOME/conf/context.xml contain the DataSource declaration and project properties file.

Now, let look closer at the implementation

Session

We need two session objects, one represent the session cookie, one represent the session object that we build internally in Spring security framework:

public class SessionCookieData {
 
 private int userId;
 
 private String appId;
 
 private int siteId;
 
 private Date timeStamp;
}

and

public class UserSession {
 
 private User user;
 
 private Site site;

 public SessionCookieData generateSessionCookieData(){
  return new SessionCookieData(user.getId(), user.getAppId(), site.getId());
 }
}

With this combo, we have the objects to store session object in cookie and memory. The next step is to implement a method that allow us to build session object from cookie data.

public interface UserSessionService {
 
 public UserSession getUserSession(SessionCookieData sessionData);
}

Now, one more service to retrieve and generate cookie from cookie data.

public class SessionCookieService {

 public Cookie generateSessionCookie(SessionCookieData cookieData, String domain);

 public SessionCookieData getSessionCookieData(Cookie sessionCookie);

 public Cookie generateSignCookie(Cookie sessionCookie);
}

Up to this point, We have the service that help us to do the conversion

Cookie --> SessionCookieData --> UserSession

and

Session --> SessionCookieData --> Cookie

Now, we should have enough material to integrate stateless session with Spring Security framework

Integrate with Spring security

At first, we need to add a filter to construct session from cookie. Because this should happen before permission check, it is better to use AbstractPreAuthenticatedProcessingFilter

@Component(value="cookieSessionFilter")
public class CookieSessionFilter extends AbstractPreAuthenticatedProcessingFilter {
 
...
 
 @Override
 protected Object getPreAuthenticatedPrincipal(HttpServletRequest request) {
  SecurityContext securityContext = extractSecurityContext(request);
  
  if (securityContext.getAuthentication()!=null  
     && securityContext.getAuthentication().isAuthenticated()){
   UserAuthentication userAuthentication = (UserAuthentication) securityContext.getAuthentication();
   UserSession session = (UserSession) userAuthentication.getDetails();
   SecurityContextHolder.setContext(securityContext);
   return session;
  }
  
  return new UserSession();
 }
 ...
 
}

The filter above construct principal object from session cookie. The filter also create a PreAuthenticatedAuthenticationToken that will be used later for authentication. It is obviously that Spring will not understand this Principal. Therefore, we need to provide our own AuthenticationProvider that manage to authenticate user based on this principal.

public class UserAuthenticationProvider implements AuthenticationProvider {
@Override
  public Authentication authenticate(Authentication authentication) throws AuthenticationException {
    PreAuthenticatedAuthenticationToken token = (PreAuthenticatedAuthenticationToken) authentication;

    UserSession session = (UserSession)token.getPrincipal();

    if (session != null && session.getUser() != null){
      SecurityContext securityContext = SecurityContextHolder.getContext();
      securityContext.setAuthentication(new UserAuthentication(session));
      return new UserAuthentication(session);
    }

    throw new BadCredentialsException("Unknown user name or password");
  }
}

This is Spring way. User is authenticated if we manage to provide a valid Authentication object. Practically, we let user login by session cookie for every single request.

However, there are times that we need to alter user session and we can do it as usual in controller method. We simply overwrite the SecurityContext, which is setup earlier in the pre-authentication filter.

public ModelAndView login(String login, String password, String siteCode) throws IOException{
    
    if(StringUtils.isEmpty(login) || StringUtils.isEmpty(password)){
      throw new HttpServerErrorException(HttpStatus.BAD_REQUEST, "Missing login and password");
    }
    
    User user = authService.login(siteCode, login, password);
    if(user!=null){
      SecurityContext securityContext = SecurityContextHolder.getContext();
      UserSession userSession = new UserSession();
      userSession.setSite(user.getSite());
      userSession.setUser(user);
      securityContext.setAuthentication(new UserAuthentication(userSession));
    }else{
      throw new HttpServerErrorException(HttpStatus.UNAUTHORIZED, "Invalid login or password");
    }
    
    return new ModelAndView(new MappingJackson2JsonView());
    
  }

Refresh Session

Up to now, you may notice that we have never mentioned the writing of cookie. Provided that we have a valid Authentication object and our SecurityContext contain the UserSession, it is important that we need to send this information back to browser.

Before the HttpServletResponse is generated, we must generate and attach the session cookie to it. This new session cookie, which has similar domain and path will replace the older session cookie that the browser is keeping.

As discussed above, refreshing session is better to be done after controller method because we implement authentication at this layer. However, there is a challenge caused by ViewResolver of Spring MVC. Sometimes, it writes to OutputStream so soon that any attempt to add cookie to response will be useless.

After consideration, we come up with a compromise solution that refresh session before controller methods for normal requests and after controller methods for authentication requests. To know whether requests is for authentication, we place an newly defined annotation at the authentication methods.

  @Override
  public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler) throws Exception {
    if (handler instanceof HandlerMethod){
      HandlerMethod handlerMethod = (HandlerMethod) handler;
      SessionUpdate sessionUpdateAnnotation = handlerMethod.getMethod().getAnnotation(SessionUpdate.class);
      
      if (sessionUpdateAnnotation == null){
        SecurityContext context = SecurityContextHolder.getContext();
        if (context.getAuthentication() instanceof UserAuthentication){
          UserAuthentication userAuthentication = (UserAuthentication)context.getAuthentication();
          UserSession session = (UserSession) userAuthentication.getDetails();
          persistSessionCookie(response, session);
        }
      }
    }
    return true;
  }

  @Override
  public void postHandle(HttpServletRequest request, HttpServletResponse response, Object handler,
      ModelAndView modelAndView) throws Exception {
    if (handler instanceof HandlerMethod){
      HandlerMethod handlerMethod = (HandlerMethod) handler;
      SessionUpdate sessionUpdateAnnotation = handlerMethod.getMethod().getAnnotation(SessionUpdate.class);
      
      if (sessionUpdateAnnotation != null){
        SecurityContext context = SecurityContextHolder.getContext();
        if (context.getAuthentication() instanceof UserAuthentication){
          UserAuthentication userAuthentication = (UserAuthentication)context.getAuthentication();
          UserSession session = (UserSession) userAuthentication.getDetails();
          persistSessionCookie(response, session);
        }
      }
    }
  }

Conclusion

The solution works well for us but we do not have the confident that this is the best practices possible. However, it is simple and does not cost us much effort to implement (around 3 days include testing).

Kindly feedback if you have any better idea to build stateless session with Spring.

Distributed Crawling

2014-08-28T19:18:00.000-07:00

Around 3 months ago, I have posted one article explaining our approach and consideration to build Cloud Application. From this article, I will gradually share our practical design to solve this challenge.

As mentioned before, our final goal is to build a Saas big data analysis application, which will deployed in AWS servers. In order to fulfill this goal, we need to build distributed crawling, indexing and distributed training systems.

The focus of this article is how to build the distributed crawling system. The fancy name for this system will be Black Widow.

Requirements

As usual, let start with the business requirement for the system. Our goal is to build a scalable crawling system that can be deployed on the cloud. The system should be able to function in an unreliable, high-latency network and can recover automatically from a partial hardware or network failure.

For the first release, the system can crawl from 3 kind of sources, Datasift, Twitter API and Rss feeds. The data crawled back are called Comment. The Rss crawlers suppose to read public sources like website or blog. It is free of charge. DataSift and Twitter both provide proprietary APIs to access their streaming service. Datasift charges its users by comment count and the complexity of CSLD (Curated Stream Definition Language, their own query language). Twitter, in the other hand, offers free Twitter Sampler streaming.

In order to do cost control, we need to implement mechanism to limit the amount of comments crawled from commercial source like Datasift. As Datasift provided Twitter comment, it is possible to have single comment coming from different sources. At the moment, we did not try to eliminate and accept it as data duplication. However, this problem can be eliminated manually by user configuration (avoid choosing both Twitter and Datasift Twitter together).

For future extension, the system should be able to link up related comments to from a conversation.

Food for Thought

Centralized Architecture

Our first thought when getting requirement is to build the crawling on the nodes, which we called Spawn and let the hub, which we called Black Widow to manage the collaboration of effort among nodes. This idea was quickly accepted by team members as it allows the system to scale well with the hub doing limited work.

As any other centralized system, Black Widow suffers from single point of failure problem. To help easing this problem, we allow the node to function independently for a short period after losing connection to Black Widow. This will give the support team a breathing room to bring up backup server.

Another bottle neck in the system is data storage. For the volume of data being crawled (easily reach few thousands records per seconds), NoSQL is clearly the choice for storing the crawled comments. We have experiences working with Lucene and MongoDB. However, after research and some minor experiments, we choose Cassandra as the NoSQL database.

With that few thoughts, we visualize the distributed crawling system to be build following this prototype:

In the diagram above, Black Widow, or the hub is the only server that has access to the SQL database system. This is where we store the configuration for crawling. Therefore, all the Spawns, or crawling nodes are fully stateless. It simply wakes up, registers itself to Black Widow and does the assigned jobs. After getting the comments, the Spawn stores it to Cassandra cluster and also push it to some queues for further processing.

Brainstorming of possible issues

To explain the design to non-technical people, we like to relate the business requirement to a similar problem in real life so that it can be easier to understand. The similar problem we choose would be collaborating of efforts among volunteers.

Imagine if we need to do a lot of preparation work for the upcoming Olympic and decide to recruit volunteers all around the world to help. We do not know volunteers but the volunteers know our email, so they can contact us to register. Only then, we know their emails and may send tasks to them through email. We would not want to send one task to two volunteers or left some tasks unattended. We want to distribute the tasks evenly so that no volunteers are suffering too much.

Due to cost issue, we would not contact them through mobile phone. However, because email is less reliable, when sending out tasks to volunteers, we would request a confirmation. The task is consider assigned only when the volunteer replied with confirmation.

With above example, the volunteers represent Spawn nodes while email communication represent unreliable and high latency network. Here are some problems that we need to solve:

1/ Node failure

For this problems, the best way is to check regularly. If a volunteer stop responding to the regular progress check email, the task should be re-assign to someone else.

2/ Optimization of tasks assigning

Some tasks are related. Therefore assigning related tasks to the same person can help to reduce total effort. This happen with our crawling as well because some crawling configurations have similar search terms, grouping them together to share the streaming channel will help to reduce final bill.

Another concern is the fairness or ability to distribute the amount of works evenly among volunteers. The simplest strategy we can think of is Round Robin but with a minor tweak by remembering earlier assignments. Therefore, if a task is pretty similar to the tasks we assigned before, the task can be skipped from Round Robin selection and directly assign to the same volunteer.

3/ The hub is not working

If due to some reasons, our email server is down and we cannot contact volunteer any more, it is better to let the volunteers stop working on the assigning tasks. The main concern here is over-running of cost or wasted efforts. However, stopping working immediately is too hasty as temporary infrastructure issue may cause the communication problem.

Hence, we need to find a reasonable amount of time for the node to continue functioning after being detached from the hub.

4/ Cost control

Due to business requirement, there are two kinds of cost control that we need to implement. First is the total of comments being crawled per crawler and second is the total of comments crawled by all crawlers belong to the same user.

This is where we have a debate about the best approach to implement cost control. It is very straight forward to implement the limit for each crawler. We can simply pass this limit to the Spawn node and it will automatically stop the crawler when the limit is reached.

However, for the limit per user, it is not so straight forward and we have two possible approaches. For the simpler choice, we can send all the crawlers of one user to the same node. Then, similar to the earlier problem, the Spawn node knows the amount of comments collected and stops all crawlers when limit reached. This approach is simple but it limits the ability to distribute jobs evenly among nodes. The alternative approach is to let all the nodes retrieve and update a global counter. This approach creates huge network traffic internally and add considerable delay to comment processing time.

At this point, we temporarily choose the global counter approach. This can be considered again if the performance become a huge concern.

5/ Deploy on the cloud

As any other Cloud application, we can not put too much trust in the network or infrastructure. Here is how we make our application conform to the check-list mentioned in last article:

Stateless: Our spawn node is stateless but the hub is not. Therefore, in our design, the nodes do actual work and the hub only collaborates efforts.
Idempotence: We implement hashCode and equal methods for every crawler configuration. We store the crawler configurations in the Map or Set. Therefore, the crawler configuration can be sent multiple times without any other side effect. Moreover, our node selection approach ensure that the job will be sent to the same node.
Data Access Object: We apply the JsonIgnore filter on every model objects to make sure no confidential data flying around in the network.
Play Safe: We implement health-check API for each node and the hub itself. The first level of support will get notified immediately when anything wrong happened.

6/ Recovery

We try our best to make the system heal itself from partial failure. There are some type of failure that we can recover from:

Hub failure: Node register itself to the hub when it start up. From then, it is the one way communication when only the hub send jobs to node and also poll for status update. The node is consider detached if it failed to get any contact from Hub for a pre-defined period. If a node is detached, it will clear all the job configurations and start registering itself to the hub again. If the incident is caused by hub failure, a new hub will fetch crawling configurations from database and start distributing jobs again. All the existing jobs on Spawn nodes will be cleared when the Spawn node go to detached mode.
Node failure: When hub fail to poll a node, it will do a hard reset by removing all working jobs and re-distribute from beginning again to the working nodes. This re-distribution process help to ensure optimized distribution.
Job failure: There are two kind of failures happened when the hub do sending and polling jobs. If a job is failed in the polling process but the Spawn node is still working well, Black Widow can re-assign the job to the same node again. The same thing can be done if the job sending failed.

Implementation

Data Source and Subscriber

In the initial thought, each crawler can open it own channel to retrieve data but this does not make sense any more when inspecting further. For Rss, we can scan all URLs once and find out the keywords that may belong to multiple crawlers. For Twitter, it supports up to 200 search terms for one single query. Therefore, it is possible for us to open single channel that serve multiple crawlers. For Datasift, it is quite rare, but due to human mistake or luck, it is possible to have crawlers with identical search terms.

This situation lead us to split out crawler to two entities: subscriber and data source. Subscriber is in charge of consuming the comments while data source is in charge of crawling the comments. With this design, if there are two crawlers with similar keywords, a single data source will be created to serve two subscribers, each processing the comments their own ways.

Data source will be created when and only when no similar data source exist. It starts working when having the first subscriber subscribe to it and retire when the last subscriber unsubscribe from it. With the help of Black Widow to send similar subscribers to the same node, we can minimize the amount of data sources created and indirectly, minimize the crawling cost.

Data Structure

The biggest concern of data structure is Thread Safe issue. In the Spawn node, we must store all running subscribers and data sources in memory. There are a few scenarios that we need to modify or access these data:

When a subscriber hit the limit, it automatically unsubscribe from data source, which may lead to deactivation of data source.
When Black Widow send a new subscriber to Spawn nodes.
When Black Widow send a request to unsubscribe an existing subscriber.
Health check API expose all running subscribers and data sources.
Black Widow regularly polls the status of each assigned subscriber.
The Spawn node regularly checks and disables orphan subscribers (subscriber which is not polled by Black Widow).

Another concern of data structure is idempotence of operations. Any of operation above can be missing or being duplicated. To handle this problem, here is our approach

Implement hashCode and equals method for every subscriber and data source.
We choose the Set or Map to store collection of subscribers and data sources. For records with identical hash code, Map will replace the record when there is new insertion but Set will skip the new record. Therefore, if we use Set, we need to ensure new records can replace old record.
We use synchronized in data access code.
If Spawn node receive a new subscriber that similar to existing subscriber, it will compare and prefer to update existing subscriber instead of replacing. This avoid the process of unsubscribing and subscribing identical subscribers, which may interrupt data source streaming.

Routing

As mentioned before, we need to find a routing mechanism that serve two purposes:

Distribute the jobs evenly among Spawn nodes.
Route similar jobs to the same nodes.

We solved this problem by generating an unique representation of each query named uuid. After that, we can use a simple modular function to find out the note to route:

int size = activeBwsNodes.size();

int hashCode = uuid.hashCode();

int index = hashCode % size;

assignedNode = activeBwsNodes.get(index);

With this implementation, subscribers with similar uuid will always be sent to the same node and each node has equals chance of being selected to serve a subscriber.

This whole practice can be screwed up when there is change to the collection of active Spawn nodes. Therefore, Black Widow must clear up all running jobs and reassign from beginning whenever there is a node change. However, node change should be quite rare in production environment.

Handshake

Below is the sequence diagram of Black Widow and Node collaboration

Black Widow does not know Spawn node. It wait for the Spawn node to register itself to the Black Widow. From there, Black Widow has the responsibility to poll the node to maintain connectivity. If Black Widow fail to poll a node, it will remove the node from the its container. The orphan node will eventually go to detached mode because it is not being polled any more. In this mode, Spawn node will clear existing jobs and try to register itself again.

The next diagram is the subscriber life-cycle

Similar to above, Black Widow has the responsibility of polling the subscribers it send to Spawn node. If a subscriber is not being polled by Black Widow anymore, Spawn node will treat the subscriber as orphan and remove it. This practice help to eliminate the threat of Spawn node running obsoleted subscriber.

On Black Widow, when a subscriber polling fails, it will try to get a new node to assign the job. If the Spawn node of the subscriber still available, it is likely that the same job will go to the same node again due to our routing mechanism we used.

Monitoring

In a happy scenario, all the subscribers are running, Black Widow is polling and nothing else happen. However, this is not likely to happen in real life. There will be changes in Black Widow and Spawn nodes from time to time, triggered by various events.

For Black Widow, there will be changes under following circumstances:

Subscriber hit limit
Found new subscriber
Existing subscriber disabled by user
Polling of subscriber fails
Polling of Spawn node fails

To handle changes, Black Widow monitoring tool offers two services: hard reload and soft reload. Hard Reload happen on node change while Soft Reload happen on subscriber change. Hard Reload process takes back all running jobs, redistribute from beginning over available nodes. Soft Reload process removes obsoleted jobs, assigns new jobs and re-assigns failed jobs.

Compare to Black Widow, the monitoring of Spawn node is simpler. The two main concerns are maintaining connectivity to Black Widow and removing orphan subscribers.

Deployment Strategy

The deployment strategy is straight forward. We need to bring up Black Widow and at least one Spawn node. The Spawn node should know the URL of Black Widow. From then, the Health Check API will give use the amount of subscribers per node. We can integrate Health Check with AWS API to automatically bring up new Spawn node if existing nodes are overloaded. The Spawn node image will need to have Spawn application running as service. Similarly, when the nodes are not utilized, we can bring down redundant Spawn nodes.

Black Widow need special treatment due to its importance. If Black Widow fails, we can restart the application. This will cause all existing jobs on Spawn nodes to become orphan and all the Spawn nodes go to detached mode. Slowly, all the nodes will clean up itself and try to register again. Under default configuration, the whole restarting process will happen within 15 minutes.

Threats and possible improvement

When choosing centralized architecture, we know that Black Widow is the biggest risk to the system. While Spawn node failure only causes a minor interruption in the affected subscribers, Black Widow failure finally lead to Spawn nodes restart, which will take much longer time to recover.

Moreover, even the system can recover from partial, there still be interruption of service in recovery process. Therefore, if the polling requests failed too often due to unstable infrastructure, the operation will be greatly hampered.

Scalability is another concern for centralized architecture. We have not had a concrete amount of maximum Spawn nodes that the Black Widow can handle. Theoretically, this should be very high because Black Widow only do minor processing, most of its effort are on sending out HTTP requests. It is possible that network is the main limit factor for this architecture. Because of this, we let the Black Widow polling the nodes rather than the nodes polling Black Widow (other people do this, like Hadoop). With this approach, Black Widow may work at its own pace, not under pressure of Spawn nodes.

One of the first question we got is whether it is a Map Reduce problem and the answer is No. Each subscriber in our Distributed Crawling System processes its own comments and does not reporting result back to Black Widow. That why we do not use any Map Reduce product like Hadoop. Our monitor is business logic aware rather than purely infrastructure monitoring, that why we choose to build ourselves over using monitoring tools like Zoo Keeper or AKKA.

For future improvement, it is better to walk away from Centralized Architecture by having multiple hubs collaborating with each other. This should not be too difficult provided that the only time Black Widow accessing database is loading subscriber. Therefore, we can slice the data and let each Black Widow load a portion of it.

Another point that make me feel pretty unsatisfied is the checking of global counter for user limit. As the check happened on every comment crawled, this greatly increases internal network traffic and limit the scalability of system. The better strategy should be divide of quota based on processing speed. Black Widow can regulate and redistribute quota for each subscriber (on different nodes).

The Emergence of DevOps and the Fall of the Old Order

2014-08-20T02:37:00.002-07:00

Software Engineering has always been dependent on IT operations to take care of the deployment of software to a production environment. In the various roles that I have been in, the role of IT operations has come in various monikers from “Data Center” to “Web Services”. An organisation delivering software used to be able to separate these roles cleanly. Software Engineering and IT Operations were able to work in a somewhat isolated manner, with neither having the need to really know the knowledge that the other hold in their respective domains. Software Engineering would communicate with IT operations through “Deployment Requests”. This is usually done after ensuring that adequate tests have been conducted on their software.

However, the traditional way of organising departments in a software delivery organisation is starting to seem obsolete. The reason is that software infrastructure have moved towards the direction of being “agile”. The same buzzword that had gripped the software development world has started to exert its effect on IT infrastructure. The evidence of this seismic shift is seen in the fastest growing (and disruptive) companies today. Companies like Netflix, Whatsapp and many tech companies have gone into what we would call “cloud” infrastructure that is dominated by Amazon Web Services.

There is huge progress in the virtualization technologies of hardware resources. This have in turn allowed companies like AWS and Rackspace to convert their server farms into discrete units of computing resources that can be diced and parcelled and redistributed as a service to their customers in an efficient manner. It is inevitable that all this configurable “hardware” resources will eventually be some form of “software” resource that can be maximally utilized by businesses. This has in turn bred a whole new genre of skillset that is required to manage, control and deploy these Infrastructure As A Service (IaaS). Some of the tools used by these services include provisioning tools like Chef or Puppet. Together with the software apis provided by the IaaS vendors, infrastructure can be brought up or down as required.

The availability of large quantities of computing resources without all the upfront costs associated with capital expenditures on hardware have led to an explosion in the number of startups trying to solve problems of all kinds imaginable and coupled with the prevalence of powerful mobile devices have led to a digital renaissance for many industries. However, this renaissance has also led to the demand for a different kind of software organisation. As someone who has been part of software engineering and development, I am witness to the rapid evolution of profession.

The increasing scale of data and processing needs requires a complete shift in paradigm from the old software delivery organisation to a new one that melds software engineering and IT operations together. This is where the role of a “DevOps” come into the picture. Recruiting DevOps in an organisation and restructuring the IT operations around such roles enable businesses to be Agile. Some businesses whose survival depends on the availability of their software on the Internet will find it imperative to model their software delivery organisation around DevOps. Having the ability to capitalise on software automation to deploy infrastructure within minutes allows a business to scale up quickly. Being able to practise continuous delivery of software allow features to get into the market quickly and allows a feedback loop in which a business can improve itself.

We are witness to a new world order and software delivery organisations that cannot successfully transition to this Brave New World will find themselves falling behind quickly especially when a competitor is able to scale and deliver software faster, reliably and with less personnel.

Information is money

2014-08-03T05:03:00.003-07:00

When people ask me what am I doing, my immediate response is IT. Even though, the answer is not very specific, it is the easiest to understand and it still helps to describe what we are doing. In fact, it doesn't matter what programming languages we use, our responsibility is to build the information system, which deliver information to end-user. Therefore, we should value information more than anyone else. However, in reality, I feel there are so much wasted information in modern information system.

In this article, I would like to discuss the opportunity to collect user behaviour and measure user happiness when building information system. I also want to share my idea on how to improve user experience based on data collected.

How important is user's behaviour information

Let begin with a story that happened in my earlier career. We need to implement an online betting system for customer, which function similarly to a stock market. In this system, there is no traditional bookmakers like William Hill. Instead, each user can people offer and accept the bets from another. Because it is a mass market with big pool of users, the rate offered is quite accurate and the commission is pretty small. However, the betting system is not our focus today. What capture my intention most is not the technical aspect of the project, even though it is quite challenging. In stead, I feel interested with the way the system silently but legally make huge amount of profit based on the information it collected.

The system captured the bet history of every user, through that, identify top winners and top losers of each month. Based on that information, the system automatically place the bet follow the winners and against the losers. Can you imagine that you are the only person in the world who know Warren Buffett's activities in real time? Then, it should be quite simple to simulate his performance, even without his knowledge? Needless to say, this hidden feature generated profit at level of hundred thousands dollars every single day.

In the open market, information is everything and we see why the law punish insider trading or any other attempt to gain advantage of information so strictly like that. However, there is no such kind of law for online gambling activity yet and this practice is still legal. That early experience gave me a deep impression on how important is information.

Later, I have interest in applying psychology when dealing with customer. In order to persuade one person or making sale happened, one guy need to observe and understand his client. Relate what I have learnt to the information system that I built before, I feel that it is not so nice to implement a system just only serve as information provider or selling tool. Actually, we do have chance to do much better if we really want.

Website authors know the importance of user experience and they did try their best to collect user information using online survey. However, personally, I feel this approach will never work. I have never answer any survey myself. Any time I saw a popup, it doesn't matter how polite is the words or how beautiful is the design, I will just click on close button.

We should not forget that no matter how important is the user feedback, it is not the user's benefit to answer our survey. In fact, no sale person approach client to ask them to do customer experience survey, unless there is incentive to do it.

Hence, the information is still need to be collected, but in a way that user does not notice it (remember how Google silently monitor anyone using their services?)

How should we use the information?

We should not waste effort collecting information if we even don't know what to do with it. However, this is not something new. Whenever I go to a professional selling site like Amazon, I find it is quite cool that they have managed to use every single piece of information they have to push sale. One time, I went there searching for helmet, next time I saw all the items for a rider like me. They remembers every single item that users have viewed or bought and regularly offer new things based on the data they collected.

Google and Facebook also do similar things. They will try to guess what you like or care about before delivering any ads to you. The million dollars question is can we do any better than this?

I vote yes. It does not means that I do not appreciate the talent and the profession of the product teams in Amazon, Google for Facebook. However, I feel that there is still a distance between these products and an experience sale person. Let imagine there is a real person that sharing desktop view with customer, seeing every mouse-click, movement and keys entered. Given this guy can pause user for a while, so that he can think, analyse and decide what user will see next, what will he do?

It is apparently that the information we collect from user screen cannot compare to the information from a face to face communication, but we have not used up this information yet. Most of the system automatically make the guess that any product that customer clicked on is what he like. A person can do better than that. If an user open the phone in 3 seconds and immediately move to other phones, he may accidentally click on the phone rather than by intention. Moreover, if he spend more time on a phone, keep coming back to it and even open the specification, we can be very sure that this is what he is looking for.

How should we collect the information?

As mentioned above, it will never work if we interrupt users to ask questions. The right mechanism for collecting information must be observation. For all the available solution in the market, I think what is missing in the ability to measure time stamp of events and connecting individual events to form an user journey. Without connecting the dot, there will be no line, without connecting events, there will be no user journey. Without the time stamp, it will be very hard to measure user satisfaction and concern.

Capturing user actions is not very challenging provided that we are the owner of website. Google Analytic can help to capture user actions but it is a bit hard to use in our case because of the limited information that it carry (HTTP GET request). We should understand that this is the only choice that Google Analytic team have because any other kind of requests will be blocked by cross-site scripting prevention.

The better way to carry this information is through HTTP POST request, which can carry the full event object, serialized in JSON format. This is perfectly eligible as the events is sent back to the same domain. To link up the events together, it is best to assign an unique but temporary id for user. We do not need to remember or identify user, therefore, this information may not need to be stored as a persisted cookie on browser. With the temporary id, two separated visits to website by the same user will be logged to 2 different journeys. While it is not optimal, it is still offer some benefits over normal kind of tracking.

If you can persist the cookie on browser or if user login, things will bet more interesting as we can link individual journeys to one.

After this, there come the biggest and most challenging part of the system where you need to figure out one mechanism to optimize customer experiences based on his journey. Unfortunately, this part is too specific for each system that our experience and methods may not be very useful for you at all. However, in general, we can measure user satisfaction and happiness based on the time users spend at each step. We also can figure out user interest by measuring the time spend for each product. From there, please build and optimize your own analysing tool. This is a very challenging but interesting task.

From framework to platform

2014-07-14T09:03:00.003-07:00

When I started my career as a Java developer close to 10 years ago, the industry is going through a revolutionary change. Spring framework, which was released in 2003, was quickly gaining ground and became a serious challenger to the bulky J2EE platform. Having gone through the transition time, I quickly found myself in favour of Spring framework instead of J2EE platform, even the earlier versions of Spring are very tedious to declare beans.

What happened next is the revamping of J2EE standard, which was later renamed to Java EE. Still, dominating of this era is the use of opensource framework over the platform proposed by Sun. This practice gives developers full control over the technologies they used but inflating the deployment size. Slowly, when cloud application become the norm for modern applications, I observed the trend of moving the infrastructure service from framework to platform again. However, this time, it is not motivated by Cloud application.

Framework vs Platform

I have never heard of or had to used any framework in school. However, after joining the industry, it is tough to build scalable and configurable software without the help of any framework.

From my understanding, any application is consist of codes that implement business logic and some other codes that are helpers, utilities or to setup infrastructure. The codes that are not related to business logic, being used repetitively in many projects, can be generalised and extracted for reuse. The output of this extraction process is framework.

To make it shorter, framework is any codes that is not related to business logic but helps to dress common concerns in applications and fit to be reused.

If following this definition then MVC, Dependency Injection, Caching, JDBC Template, ORM are all consider frameworks.

Platform is similar to framework as it also helps to dress common concerns in applications but in contrast to framework, the service is provided outside the application. Therefore, a common service endpoint can serve multiple applications at the same time. The services provided by JEE application server or Amazon Web Services are sample of platforms.

Compare the two approaches, platform is more scalable, easier to use than framework but it also offers less control. Because of these advantage, platform seem to be the better approach to use when we build Cloud Application.

When should we use platform over framework

Moving toward platform does not guarantee that developers will get rid of framework. Rather, platform only complements framework in building applications. However, one some special occasions we have a choice to use platform or framework to achieve final goal. From my personal opinion, platform is greater that framework when following conditions are matched:

Framework is tedious to use and maintain
The service has some common information to be shared among instances.
Can utilize additional hardware to improve performance.

In office, we still uses Spring framework, Play framework or RoR in our applications and this will not change any time soon. However, to move to Cloud era, we migrated some of our existing products from internal hosting to Amazon EC2 servers. In order to make the best use of Amazon infrastructure and improve software quality, we have done some major refactoring to our current software architecture.

Here are some platforms that we are integrating our product to:

Amazon Simple Storage Service (Amazon S3) & Amazon Cloud Front

We found that Amazon Cloud Front is pretty useful to boost average response time for our applications. Previously, we host most of the applications in our internal server farms, which located in UK and US. This lead to noticeable increase in response time for customers in other continents. Fortunately, Amazon has much greater infrastructure with server farms built all around the worlds. That helps to guarantee a constant delivery time for package, no matter customer locations.

Currently, due to manual effort to setup new instance for applications, we feel that the best use for Amazon Cloud Front is with static contents, which we host separately from application in Amazon S3. This practice give us double benefit in performance with more consistent delivery time offered by the CDN plus the separate connection count in browser for the static content.

Amazon Elastic Cache

Caching has never been easy on cluster environment. The word "cluster" means that your object will not be stored and retrieve from system memory. Rather, it was sent and retrieved over the network. This task was quite tricky in the past because developers need to sync the records from one node to another node. Unfortunately, not all caching framework support this feature automatically. Our best framework for distributed caching was Terracotta.

Now, we turned to Amazon Elastic Cache because it is cheap, reliable and save us the huge effort for setting up and maintain distributed cache. It is worth to highlight that distributed caching is never mean to replace local cache. The difference in performance suggest that we should only use distributed caching over local caching when user need to access real-time temporary data.

Event Logging for Data Analytics

In the past, we used Google Analytics for analysing user behaviour but later decided to build internal data warehouse. One of the motivation is the ability to track events from both browsers and servers. The Event Tracking system uses MongoDB as the database as it allow us to quickly store huge amount of events.

To simplify the creation and retrieval of events, we choose JSON as the format for events. We cannot simply send this event directly to event tracking server due to browser prevention of cross-domain attack. For this reason, Google Analytic send the events to server under the form of a GET request for static resource. As we have the full control over how the application was built, we choose to let the events send back to application server first and route to event tracking server later. This approach is much more convenient and powerful.

Knowledge Portal

In the past, applications access data from database or internal file repository. However, to be able to scale better, we gathered all knowledge to build a knowledge portal. We also built query language to retrieve knowledge from this portal. This approach add one additional layer to the knowledge retrieval process but fortunately for us, our system does not need to serve real time data. Therefore, we can utilize caching to improve performance.

Conclusion

Above is some of our experience on transforming software architecture when moving to the Cloud. Please share with us your experience and opinion.

Common mistakes when using Spring MVC

2014-07-05T21:47:00.002-07:00

When I started my career around 10 years ago, Struts MVC is the norm in the market. However, over the years, I observed the Spring MVC slowly gaining popularity. This is not a surprise to me, given the seamless integration of Spring MVC with Spring container and the flexibility and extensibility that it offers.

From my journey with Spring so far, I usually saw people making some common mistakes when configuring Spring framework. This happened more often compare to the time people still used Struts framework. I guess it is the trade off between flexibility and usability. Plus, Spring documentation is full of samples but lack of explanation. To help filling up this gap, this article will try to elaborate and explain 3 common issues that I often see people encounter.

Declare beans in Servlet context definition file

So, everyone of us know that Spring use ContextLoaderListener to load Spring application context. Still, when declaring the DispatcherServlet, we need to create the servlet context definition file with the name "${servlet.name}-context.xml". Ever wonder why?

Application Context Hierarchy

Not all developers know that Spring application context has hierarchy. Let look at this method

org.springframework.context.ApplicationContext.getParent()

It tells us that Spring Application Context has parent. So, what is this parent for?

If you download the source code and do a quick references search, you should find that Spring Application Context treat parent as its extension. If you do not mind to read code, let I show you one example of the usage in method BeanFactoryUtils.beansOfTypeIncludingAncestors():

if (lbf instanceof HierarchicalBeanFactory) {
    HierarchicalBeanFactory hbf = (HierarchicalBeanFactory) lbf;
    if (hbf.getParentBeanFactory() instanceof ListableBeanFactory) {
 Map parentResult = 
              beansOfTypeIncludingAncestors((ListableBeanFactory) hbf.getParentBeanFactory(), type);
 ...
    }
}
return result;
}

If you go through the whole method, you will find that Spring Application Context scan to find beans in internal context before searching parent context. With this strategy, effectively, Spring Application Context will do a reverse breadth first search to look for beans.

ContextLoaderListener

This is a well known class that every developers should know. It helps to load the Spring application context from a pre-defined context definition file. As it implements ServletContextListener, the Spring application context will be loaded as soon as the web application is loaded. This bring indisputable benefit when loading the Spring container that contain beans with @PostContruct annotation or batch jobs.

In contrast, any bean define in the servlet context definition file will not be constructed until the servlet is initialized. When does the servlet be initialized? It is indeterministic. In worst case, you may need to wait until users make the first hit to the servlet mapping URL to get the spring context loaded.

With the above information, where should you declare all your precious beans? I feel the best place to do so is the context definition file loaded by ContextLoaderListener and no where else. The trick here is the storage of ApplicationContext as a servlet attribute under the key

org.springframework.web.context.WebApplicationContext.ROOT_WEB_APPLICATION_CONTEXT_ATTRIBUTE

Later, DispatcherServlet will load this context from ServletContext and assign it as the parent application context.

protected WebApplicationContext initWebApplicationContext() {
   WebApplicationContext rootContext =
      WebApplicationContextUtils.getWebApplicationContext(getServletContext());
   ...
}

Because of this behaviour, it is highly recommended to create an empty servlet application context definition file and define your beans in the parent context. This will help to avoid duplicating the bean creation when web application is loaded and guarantee that batch jobs are executed immediately.

Theoretically, defining the bean in servlet application context definition file make the bean unique and visible to that servlet only. However, in my 8 years of using Spring, I hardly found any use for this feature except defining Web Service end point.

Declare Log4jConfigListener after ContextLoaderListener

This is a minor bug but it catch you when you do not pay attention to it. Log4jConfigListener is my preferred solution over -Dlog4j.configuration as we can control the log4j loading without altering server bootstrap process.

Obviously, this should be the first listener to be declared in your web.xml. Otherwise, all of your effort to declare proper logging configuration will be wasted.

Duplicated Beans due to mismanagement of bean exploration

In the early day of Spring, developers spent more time typing on xml files than Java classes. For every new bean, we need to declare and wiring the dependencies ourselves, which is clean, neat but very painful. No surprise that later versions of Spring framework evolved toward greater usability. Now a day, developers may only need to declare transaction manager, data source, property source, web service endpoint and leave the rest to component scan and auto-wiring.

I like these new features but this great power need to come with great responsibility; otherwise, thing will be messy quickly. Component Scan and bean declaration in XML files are totally independent. Therefore, it is perfectly possible to have identical beans of the same class in the bean container if the bean are annotated for component scan and declare manually as well. Fortunately, this kind of mistake should only happen with beginners.

The situation get more complicated when we need to integrate some embedded components into the final product. Then we really need a strategy to avoid duplicated bean declaration.

The above diagram show a realistic sample of the kind of problems we face in daily life. Most of the time, a system is composed from multiple components and often, one component serves multiple product. Each application and component has it own beans. In this case, what should be the best way to declare to avoid duplicated bean declaration?

Here is my proposed strategy:

Ensure that each component need to start with a dedicated package name. It makes our life easier when we need to do component scan.
Don't dictate the team that develop the component on the approach to declare the bean in the component itself (annotation versus xml declaration). It is the responsibility of the developer whom packs the components to final product to ensure no duplicated bean declaration.
If there is context definition file packed within the component, give it a package rather than in the root of classpath. It is even better to give it a specific name. For example src/main/resources/spring-core/spring-core-context.xml is way better than src/main/resource/application-context.xml. Imagine what can we do if we pack few components that contains the same file application-context.xml on the identical package!
Don't provide any annotation for component scan (@Component, @Service or @Repository) if you already declare the bean in one context file.
Split the environment specific bean like data-source, property-source to a separate file and reuse.
Do not do component scan on the general package. For example, instead of scanning org.springframework package, it is easier to manage if we scan several sub-packages like org.springframework.core, org.springframework.context, org.springframework.ui,...

Conclusions

I hope you found the above tips useful for your daily usage. If there is any doubt or any other ideas, please help to feedback.

How to increase productivity

2014-06-18T10:42:00.002-07:00

Unlock productivity is one of the bigger concerns for any person taking management role. However, people rarely agree on the best approaches to improve performance. Over the years, I have observed different managers using the opposite practices to churn out best performance of the team they are managing. Unfortunately, some works and other don't. To be more accurate, what does not increase performance, actually reduce performance.

In this article, I would like to review what I have seen and learnt over the years and share personal view on the best approaches to unlock productivity.

What factors define teams performance?

Let start with analysing what compose a team. Obviously, a team is composed from team members, each has own expertise, strength and weakness. However, the total productivity of the team is not necessarily the total sum of individual productivity. Other factors like team work, process and environment also have major impact to total performance, which can be both positive or negative.

To sum up, the 3 major factors discussed in this article will be technical skills, working process and culture.

Technical Skills

In a factory, we can count the total productivity as sum of individual productivity of each worker, but this simplicity does not apply to IT field. The differences lie in natural of work. Programming until today is still an innovative work, which cannot be automated. In IT industry, nothing is more valuable than innovation and vision. That explains why Japan may be well known for producing high quality car but US is much more famous for producing well known IT company.

Contradict to factory environment, in a software team, developers does not necessarily do or good at the same things. Even if they have graduated from the same school, taking the same job, personal preference and the self studying quickly make developer's skills different again. For the sake of increasing total productivity, this may be a good thing. There is no use for all of member to be competent on the same kind of tasks. As it is too difficult to good at everything, life will be much easier if members of the team can compensate for each other weakness.

This is not easy to improve on technical skills of the team as it take many years for a developer to build up his/her skill set. The fastest way to pump up the team skill sets is to recruit new talent that offer what the team is lack of. That why the popular practice in the industry is to let the team recruit new member themselves. Because of this, the team, which is slowly built over the years normally normally offers a more balance skills set.

While recruitment is a quick and short term solution, the long term solution is to keep the team up to date with latest trends of technology. In this field, if you do not go forward, you go backward. There is no skill set that can be useful forever. One of my colleague even emphasize that upgrading developers's skills is beneficial to the company in the long run. Even if we do not count inflation, it is quite common that the company will offer pay rise after each annual review to retain staffs. If the staff do not acquire new skills, effectively, the company is paying higher price every year for a depreciating asset. It may be a good suggestion for the company to use monetary prize like KPI to motivate self-study and upgrading.

There are a lot of training courses in the industry but it is not necessarily the best method for upgrading skills. Personally, I feel most of the coursework offer more branding value than real life usage. If a developer is keen to learn, there should be quite sufficient knowledge on internet to pick up anything. Therefore, unless for commercial API or product, spending money on monetary prize should be more worthy than on training course.

Another well-known challenge for self-studying is the human natural laziness. There is nothing surprise about it. However, the best way to fight laziness is to find fun in learning new things. This only can be achieved if developers take programming as his hobby more than professional. Even not, it is quite reasonable that one should re-invest effort on his bread and butter tool. One of my friend even argue that if singer/musician take own responsibility in training, programmer should do the same.

Sometimes, we may feel lost due to the huge amount of technologies exposed to us every year. I myself feel that too. My approach for self studying is adding a delay in absorbing concepts and ideas. I try to understand but do not invest too much until the new concepts and ideas are reasonable accepted by the market.

Working Process

Working process can contribute greatly to team performance, positively or negatively. Great developer write great code but he will not be able to do so if wastes to much effort on something not essential. Obviously, when the process is wrong, developers may feel uncomfortable about their daily life. Unhappy developer may not perform his best.

There is no clear guideline to judge if the working process is well defined but people in the environment will feel it right a way if something is wrong. However, it is not as easy to get it right as people who have the right to make decision not necessarily the guys who suffer from bad process. We need an environment with effective feedback channels to improve on working process.

The common pitfall for working process is the lack of result oriented nature. The process is less effective if it is too reporting oriented, attitude oriented or based on some unreal assumptions. To define the process, it may be good if the executive can decide whether he want to build an innovative company or operation oriented company. The samples for former kind is Google, Facebook, Twitter while the latter may be GM, Ford, Toyota. It is not that operation-oriented company cannot innovate but the process was not built with the first priority for innovation. Therefore, the metric for measuring performance may be slightly different, which causes different results in long term. Not all companies in IT fields are innovative company. One counter example is the outsourcing companies or software house in Asia. To encourage innovation, the working process need to focus on people, minimize hassle, maximize collaboration and sharing.

Through my years in the industry with Water Fall, not so Agile and Agile companies, I feel that Agile work quite well for IT fields. It was built based on the right assumptions that software development is innovation work and less predictable compare to other kinds of engineering.

Company Culture

When Steve Job passed away in 2011, I bought his authorized biography by Walter Isaacson. The book clearly explains why Sony failed to keep its competitive edge because of inner competition amongst its departments. Microsoft suffer similar problem due to the controversy stack ranking system that enforce inner competition. I think that IT fields is getting more complicated and we need more collaboration than in the past to implement new ideas.

It is tough to maintain collaboration when your company grow to become an multi-culture MNC. However, it still can be done if management got the right mindset and continuously communicate their visions to the team. As above, the management need to be clear if they want to build an innovative company as it requires a distinct culture, which is more open, and highly motivated.

In silicon valley, office life end up quite late as most of developers are geeks and they love nothing more than coding. However, it is not necessary a good practice as all of us have a family to take care of. It is up to individual to define his/her own work life balance but the requirement is employee fully charged and feel exited whenever he come to office. He must feel that his work is appreciated and he has the support when he need it.

Conclusions

To makes it short, here are the kind of things that management can apply to increase productivity of the team:

Let the team involve in the recruitment. Recruit the person who takes programming as hobby.
Monetary prize or other kind of encouragements for self-study, self-upgrading.
Save money for company sponsored course unless for commercial products.
Make sure that the working process result oriented.
Apply Agile practices
Encourage collaboration, eliminate inner competition.
Encourage sharing
Encourage feedback.
Maintain employee work-life balance and motivation.
Make sure employee can find support when he need it.

Testing effectively

2014-05-24T02:58:00.003-07:00

Recently, there is a heaty debate regarding TDD which started by DHH when he claimed that TDD is dead.
This ongoing debate managed to capture the attention of developers world, including us.

Some mini debates have happened in our office regarding the right practices to do testing.

In this article, I will represent my own view.

How many kinds of tests have you seen?

From the time I joined industry, here are the kinds of tests that I have worked on:

Unit Test
System/Integration/Functional Test
Regression Test
Test Harness/Load Test
Smoke Test/Spider Test

The above test categories are not necessarily mutually exclusive. For example, you can crate a set of automated functional tests or Smoke tests to be used as regression test. For the benefit of newbie, let do a quick review for these old concepts.

Unit Test

Unit Test aim to test the functional of a unit of code/component. For Java world, unit of code is the class and each Java class suppose to have an unit test. The philosophy of Unit Test is simple. When all the components are working, the system as a whole should work.

A component rarely work alone. Rather, it normally interacts with other components. Therefore, in order to write Unit Test, developers need to mock other components. This is the problem that DHH and James O Coplien criticize Unit Test for huge effort that gain little benefit.

System/Integration/Functional Test

There is no concrete naming as people often use different terms to describe similar things. Contradict to Unit Test, for functional test, developers aim to test a system function as a whole, which may involve multiple components.

Normally, for functional test, the data is retrieved and store to the test database. Of course, there should be a pre-step to set-up test data before running. DHH likes this kind of test. It helps developers test all the functions of the system without huge effort to set-up mock object.

Functional test may involve asserting web output. In the past, it is mostly done with htmlUnit but with recent improvement of Selenium Grid, Selenium became the preferred choice.

Regression Test

In this industry, you may end up spend more time maintaining system than developing new one. Software changes all the time and it is hard to avoid risk whenever making changes. Regression Test supposes to capture any defect that caused by changes.

In the past, software house did have one army of testers but the current trend is automated testing. It means that developers will deliver software with full set of tests that suppose to be broken whenever a function is spoiled.

Whenever a bug is detected, a new test case should be added to cover new bug. Developers create the test, let it fail, and fix the bug to make it pass. This practice is called Test Driven Development.

Test Harness/Load Test

Normal test case does not capture system performance. Therefore, we need to develop another set of tests for this purpose. In the simplest form, we can set the time out for the functional test that run in continuous integration server. The tricky part is this kind of test is very system dependant and may fail if the system is overloaded.

The more popular solution is to run load test manually by using profiling tool like JMeter or create our own load test app.

Smoke Test/Spider Test

Smoke Test and Spider Test are two special kinds of tests that may be more relevant to us. WDS provides KAAS (Knowledge as a Service) for wireless industry. Therefore, our applications are refreshed everyday with data changes rather than business logic changes. It is specific to us that system failure may come from data change rather than business logic.

Smoke Test are set of pre-defined test cases run on integration server with production data. It helps us to find out any potential issues for the daily LIVE deployment.

Similar to Smoke Test, Spider Test runs with real data but it work like a crawler that randomly click on any link or button available. One of our system contains so many combination of inputs that it is not possible to be tested by human (closed to 100.000 combinations of inputs).

Our Smoke Test randomly choose some combination of data to test. If it manage to run for a few hours without any defect, we will proceed with our daily/weekly deployment.

The Test Culture in our environment

To make it short, WDS is a TDD temple. If you create the implementation before writing test cases, better be quiet about it. If you look at WDS self introduction, TDD is mentioned only after Agile and XP

"We are:- agile & XP, TDD & pairing, Java & JavaScript, git & continuous deployment, Linux & AWS, Jeans & T-shirts, Tea & cake"

Many high level executives in WDS start their career as developers. That helps to fostering our culture as an engineering-oriented company. Requesting resources to improve test coverage or infrastructure are common here.

We do not have QA. In worst case, Product Owner or customers detect bugs. In best case, we detect bugs by test cases or by team mates during peer review stage.

Regarding Singapore office, most of our team members grow up absorbing Ken Beck and Martin Fowler books and philosophy. That why most of them are hardcore TDD worshipers.

The focus of testing in our working environment did bear fruits. WDS production defects rate is relatively low.

My own experience and personal view with testing

That is enough about self appraisal. Now, let me share my experience about testing.

Generally, Automated Testing works better than QA

Comparing the output of traditional software house that packed with an army of QA with modern Agile team that deliver fully test coverage products, the latter normally outperform in term of quality and even cost effectiveness. Should QA jobs be extinct soon?

Over monitoring may hint lack of quality

It sounds strange but over the years, I developed insecure feeling whenever I saw a project that have too many layer of monitoring. Over monitoring may hint lack of confidence and in deed, these systems crash very often with unknown reasons.

Writing test cases takes more time that developing features

DDH is definitely right on this. Writing Test Cases mean that you need to mock input and assert lots of things. Unless you keep writing spaghetti code, developing features take much less times compare to writing tests.

UI Testing with javascript is painful

You know it when you did it. Life is much better if you only need to test Restful API or static html pages. Unfortunately, the trend of modern web application development involve lots of javascripts on client side. For UI Testing, Asynchronous is evil.

Whether you want to go with full control testing framework like htmlUnit or using a more practical, generic one like Selenium, it will be a great surprise for me if you never encounter random failures.

I guess every developer know the feeling of failing to get the build pass at the end of the week due to random failure test cases.

Developers always over-estimate their software quality

It is applicable to me as well because I am an optimistic person. We tend to think that our implementation is perfect until the tests failed or someone help to point out a bug.

Sometimes, we change our code to make writing test cases easier

Want it or not, we must agree with DHH on this point. Pertaining to Java world, I have seen people exposing internal variable, creating dummy wrapper for framework object (like HttpSession, HttpRequest,...) so that it is easier to write Unit Test. DHH find it so uncomfortable that he chose to walk way from Unit Test.

On this part, I half agree and half disagree with him. From my own view, altering design, implementation for the sake of testing is not favourable. It is better if developers can write the code without any concern of mocking input.

However, aborting Unit Testing for the sake of having a simple and convenient life is too extreme. The right solution should be designing the system is such a way that business logic is not so tight-coupling with framework or infrastructure.

This is what called Domain Driven Design.

Domain Driven Design

For newbie, Domain Driven Design give us a system with following layers.

If you notice, the above diagram has more abstract layers than Rails or the Java adoption of Rails, Play framework. I understand that creating more abstract layers can cause bloated system but for DDD, it is a reasonable compromise.

Let elaborate further on the content of each layer:

Infrastructure

This layer is where you store your repository implementation or any other environment specific concerns. For infrastructure, keep the API as simple, dummy as possible and avoid having any business logic implemented here.

For this layer, Unit Test is a joke. If there is any thing to write, it should be integration test, which working with real database.

Domain

Domain layer is the most important layer. It contains all system business logics without any framework, infrastructure, environment concern. Your implementation should look like a direct translation of user requirements. Any input, output, parameter are POJO only.

Domain layer should be the first layer to be implemented. To fully complete the logic, you may need interface/API of the infrastructure layer. It is best practice to keep the API in Domain Layer and concrete implementation in Infrastructure layer.

The best kind of test cases for Domain layer is Unit Test as your concern is not the system UI or environment. Therefore, it helps developers to avoid doing dirty works of mocking framework object.

For mocking internal state of object, my preferred choice is using Reflection utility to setup object rather than exposing internal variables through setters.

Application Layer/User Interface

Application Layer is where you start thinking about how to represent your business logic to customer. If the logic is complex or involving many consecutive requests, it is possible to create Facades.

Reaching this point, developers should think more about clients than the system. The major concerns should be customer's devices, UI responsiveness, load balance, stateless or stateful session, Restful API. This is the place for developers to showcase framework talent and knowledge.

For this layer, the better kind of test cases is functional/integration test.

Similar as above, try your best to avoid having any business logic in Application Layer.

Why it is hard to write Unit Test in Rails?

Now, if you look back to Rails or Play framework, there is no clear separation of layers like above. The Controllers render inputs, outputs and may contains business logic as well. Similar behaviours applied if you use the ServletAPI without adding any additional layer.

The Domain object in Rails is an active record and has a tight-coupling with database schema.

Hence, for whatever unit of code that developers want to write test cases, the inputs and output are nots POJO. This make writing Unit Test tough.

We should not blame DHH for this design as he follow another philosophy of software development with many benefits like simple design, low development effort and quick feedback. However, I myself do not follow and adopt all of his ideas for developing enterprise applications.

Some of his ideas like convention over configuration are great and did cause a major mindset change in developers world but other ideas end up as trade off. Being able to quickly bring up a website may later turn to troubles implementing features that Rails/Play do not support.

Conclusion

Unit Test is hard to write if you business logic is tight-coupling to framework.
Focusing and developing business logic first may help you create better design.
Each kinds of components suit different kinds of test cases.

This is my own view of Testing. If you have any other opinions, please feedback.

Software Development and Newton's Laws of Motion

2014-05-23T02:10:00.001-07:00

Intro

I have no idea since when the word velocity found a new home in software development, it is nevertheless popular these days. However I am pretty sure that Mr Isaac Newton would not be happy if you talk about motion without mentioning his laws.

First Law

When viewed in an inertial reference frame, an object either remains at rest or continues to move at a constant velocity, unless acted upon by an external force.

There are a lot of external forces

developers are fixing bugs
developers are adding new features
developers are introducing more bugs (lol)
business requests to cut down the operation cost
third party competition is changing the market
users are changing
this list goes on and on

However a team/product is either dead (therefore remains at rest) or is moving at a constant velocity (let's say generating certain amount of revenue or eating certain amount of buget per day).

Now I declare, it is against the law to talk about team velocity, because what should you do to maintain the team's velocity? Nothing, you should do nothing!

Well, that will upset most of the managers, "I'd rather my developers do something".

So we need another law.

Second Law

F = ma. The vector sum of the forces F on an object is equal to the mass m of that object multiplied by the acceleration vector a of the object.

Acceleration is the ability to change the velocity. The F is treated as a constant here, because, come on, let's be honest, your team is pretty much fix sized, unless you are Google. Your time is pretty much fixed to 24 hours per day unless you live on Mars which is slightly longer, 24.622962 hours to be exact. Now we are screwed ... there is only one variable left to play. According to second law, for a given force F, the acceleration is inversely proportional to the mass. Mass is the burden, it is going against acceloration.

Here is a short list of how to gain some mass

too many good-to-have features
too much technical debt
too many abstractions, layers upon layers, ORM, DAO, service, controller, view. We need all of them to get some trivial {"user_id": 123} out of that database. oh forget to mention, there is SQL, and NoSQL ...
too many processes
too many patterns, EnterprisyStrategyFactoryBuilderAdapterListenerInterceptor
too many communication delegations, business -> project manager -> business analyst -> team leader -> developer (add more roles at your own will)
too many frameworks. JavaEE, Spring, Hibernate, Struts, Bootstrap, jQuery, Angular.js, Ember.js. Dare to lookup JavaEE? There are 39 JSRs listed under JavaEE7!
too many servers. Web servers, relational database servers, NoSQL servers, cache servers, message queue servers, third party integration servers ...

Yet, in the end you do want to make a change, do you? If your answser is NO, grats, you can stop reading here. Even the answer is yes, you can only say so after you read the third law.

Third Law

To every action there is always opposed an equal reaction: or the mutual actions of two bodies upon each other are always equal, and directed to contrary parts.

A: "Can we remove feature XYZ? so that the codes can be greatly simplified"
R: "Please no, that is Shareholder ABC's favorite"
A: "Ooookie, nvm"

A: "Can we change to git?"
R: "Nah, zip and email is our best friend"
A: "Maybe next time"

A: "Can we upgrade java 1.4?"
R: "There are too many servers in production"
A: "Fine, let's stick to manual casting"

Aaaaah, I still want to type some more words but there is an equal reaction preventing me from doing that ... So let's call this a day.

Thanks for wasting your time reading my rants.

Happy Coding ...

Reference

http://en.wikipedia.org/wiki/Velocity_(software_development)
http://en.wikipedia.org/wiki/Newton's_laws_of_motion

MySQL Transaction Isolation Levels and Locks

2014-05-21T07:45:00.001-07:00

Recently, an application that my team was working on encountered problems with a MySQL deadlock situation and it took us some time to figure out the reasons behind it. This application that we deployed was running on a 2-node cluster and they both are connected to an AWS MySQL database. The MySQL db tables are mostly based on InnoDB which supports transaction (meaning all the usual commit and rollback semantics) as well as row-level locking that MyISAM engine does not provide. So the problem arose when our users, due to some poorly designed user interface, was able to execute the same long running operation twice on the database.

As it turned out, due to the fact that we have a dual node cluster, each of the user operation originated from a different web application (which in turn meant 2 different transaction running the same queries). The deadlock query happened to be a “INSERT INTO T… SELECT FROM S WHERE” query that introduced shared locks on the records that were used in the SELECT query. It didn’t help that both T and S in this case happened to be the same table. In effect, both the shared locks and exclusive locks were applied on the same table. An attempt to explain the possible cause of the deadlock on the queries could be explained by the following table. This is based on the assumption that we are using a default REPEATABLE_READ transaction isolation level (I will explain the concept of transaction isolation later)

Assuming that we have a table as such

RowId	Value
1	Collection 1
2	Collection 2
…	Collection N
450000	Collection 450000

The following is a sample sequence that could possibly cause a deadlock based on the 2 transactions running an SQL query like “INSERT INTO T SELECT FROM T WHERE … “ :

Time	Transaction 1	Transaction 2	Comment
T1	Statement executed		Statement executed. A shared lock is applied to records that are read by selection
T2	Read lock s1 on Row 10-20		The lock on the index across a range. InnoDB has a concept of gap locks.
T3		Statement executed	Transaction 2 statement executed. Similar shared lock to s1 applied by selection
T4		Read lock s2 on Row 10-20	Shared read locks allow both transaction to read the records only
T5	Insert lock x1 into Row 13 in index wanted		Transaction 1 attempts to get exclusive lock on Row 13 for insertion but Transaction 2 is holding a shared lock
T6		Insert lock x2 into Row 13 in index wanted	Transaction 2 attempts to get exclusive lock on Row 13 for insertion but Transaction 1 is holding a shared lock
T7			Deadlock!

The above scenario occurs only when we use REPEATABLE_READ (which introduces shared read locks). If we were to lower the transation isolation level to READ_COMMITTED, we would reduce the chances of a deadlock happening. Of course, this would mean relaxing the consistency of the database records. In the case of our data requirements, we do not have such strict requirements for strong consistency. Thus, it is acceptable for one transaction to read records that are committed by other transactions.

So, to delve deeper into the idea of Transaction Isolation, this concept has been defined by ANSI/ISO SQL as the following from highest isolation levels to lowest

Serializable

This is the highest isolation level and usually requires the use of shared read locks and exclusive write locks (as in the case of MySQL).
What this means in essence that any query made will require access to a shared read lock on the records which prevents another transaction’s query to modify these records. Every update statement will require access to an exclusive write lock
Also, range-locks must be acquired when a select statement with a WHERE condition is used. This is implemented as a gap lock in MySQL.
Repeatable Reads

This is the default level used in MySQL. This is mainly similar to Serializable beside the fact that a range lock is not used. However, the way that MySQL implements this level seemed to me a little different. Based on Wikipedia’s article on Transaction Isolation, a range lock is not implemented and so phantom reads can still occur. Phantom reads refer to a possibility that select queries will have additional records when the same query is made within a transaction. However, what I understand from MySQL’s document is that range locks are still used and the same select queries made in the same transaction will always return the same records. Maybe I’m mistaken in my understanding and if there’s any mistakes in my intepretations, I stand ready to be corrected.
Read Committed

This is an isolation level that will maintain a write lock until the end of the transaction but read locks will be released at the end of the SELECT statement. It does not promise that a SELECT statement will find the same data if it is re-run again in the same transaction. It will, however, guarantee that the data that is read are not “dirty” and has been committed.
Read Uncommitted

This is an isolation level that I doubt would be useful for most use cases. Basically, it allows a transaction to see all data that has been modified, including “dirty” or uncommitted data. This is the lowest isolation level

Having gone through the different transaction isolation levels, we could see how the selection of the Transaction Isolation level determines the kind of database locking mechanism. From a practical standpoint, the default MySQL isolation level (REPEATABLE_READ) might not always be a good choice when you are dealing with a scenario like ours where there is really no need for such strong consistency in the data reads. I believe that by lowering the isolation level, it is likely to reduce chances that your database queries meet with a deadlock. Also, it might even allow a higher concurrent access to your database which improve the performance level of your queries. Of course, this comes with the caveat that you need to understand how important consistent reads are for your application. If you are dealing with data where precision is paramount (e.g. your bank accounts), then it is definitely necessary to impose as much isolation as possible so that you would not read inconsistent information within your transaction.

Developers Corner

High volume, low latency system

Background

Architecture Guideline

Knowing the limit

Knowing when to apply microservices architecture

Stream Processing

Auto Recovery

Implementations Guideline

Monitoring

Testing on Production

Understanding Machine

Understandable Code

In-memory computing

Conclusion

MySQL Partition Pruning

Background

Partition by hash

Performance issue

Solution

Retrospective

Why retrospective is so important

How to run retrospective

Let's Implement "Login with Github" button

Spring Oauth2 with JWT Sample

Background

OAuth 2 and JWT

Spring Security OAuth 2

System Design

Overview

Authorization Server Configuration

Resource Server Configuration

Basic Authentication Security Configuration

Testing

Should you mind your own business?

Mind your own business

Or mind the other people business as well

So, what is the best compromise?

Designing database

Can java optimize empty array allocation?

Rethinking database schema with RDF and Ontology

First Agile impression

Migrating Spring Web MVC from JSP to AngularJS

Target Audience

Sample Pet Clinic for reference

Introduction to AngularJS

Scopes in AngularJS

Directives in AngularJS

Differences in architecture between JSP and AngularJS

Application structure

Comparisons betwen AngularJS and JSP custom tags

Considerations when moving from JSP to AngularJS

Converting your Spring controllers to RESTful services

Synchronizing states between the backend and your AngularJS application

Authentication

Testing

Conclusion

Authentication Mechanisms for Web Applications

AngularJs as the alternative choice for building web interface

Exploration of ideas

Agile is a simple topic

Stateless Session for multi-tenant application using Spring Security

Distributed Crawling

The Emergence of DevOps and the Fall of the Old Order

27 July 2014

Information is money

From framework to platform

Common mistakes when using Spring MVC

How to increase productivity

Testing effectively

Software Development and Newton's Laws of Motion

Intro

First Law

Second Law

Third Law

MySQL Transaction Isolation Levels and Locks