Tuesday, 24 February 2015

AngularJs as the alternative choice for building web interface

Recently, we were invited to conduct a joint talk with a UX designer, Andrew from Viki on Improving communications between Design and Development for Singasug.  The main focus of the talk was to introduce AngularJS and elaborate on how it help to improve collaboration between developers and designer from both sides of view. To illustrate the topic, we work together on revamping the Pet Clinic Spring sample application.

Over one month of working remotely, mostly on spare time, we have managed to refresh the Pet Clinic application with a new interface, which built on AngularJS and Bootstrap. Based on what we have been through, we delivered the talk to share what have been done with Spring community.

In this article, we want to re-share the story with a different focus, that is using AngularJs as the alternative choice for building web application

Project kick starting

The project was initiated last year by Michael Isvy and Sergiu Bodiu, the two organizers of the Singapore Spring User Group. Michael asked our help to deliver a talk about AngularJS for the web night. He even went further by introducing us an UX designer named Andrew and nicely asked if we can collaborate on revamping the Pet Clinic application. Feeling interested on this idea, we agreed to work on the project and delivered the talk together.

We set the initial scope of the project to be one month and aimed to show as much functionality as possible rather than fully complete the application. Also, because of Andrew involvement, we take this opportunity to revamp the information architect of the website and give it with new layout.

Project Delivery

The biggest problem we have to solve is geographic location. All of us working on spare time and can afford limited face to face communication. Due to personal schedule, it is difficult to setup a common working place or time. For discussing about the project and planning, we only scheduled a weekly meeting during weekend.

To replace direct communication, we contact each other through Whatapp. We hosted the project on Github and turned on the notification feature so that each member is informed after any check-in. Even though there was a bit of concern at the beginning but each member play their part and the project progressed well.

To prepare in advance, we converted SpringMVC controllers with Jsp view to the Restful controllers. After that, Andrew showed us the wireframe flow that laid the common understanding of how the application should behave. Based on that, we built the respective controllers and integrated with html templates Andrew provided.

After half of the project time, Andrew picked up some understanding about Angular directive and be able to work directly on the project source code. That helped to boost our development speed.

When the project has been through half of the timeline, Michael came back to us and asked if it is possible to make the website functional without the Restful API. He think that would help designers to be able to develop html template without deploying it into an application server. We bought into this idea and provided this feature for the application. By turning on a flag, the AngularJS application will no longer make any http call to server. Instead, the mock http service will returned the pre-defined static Json files.

What we have achieved

After one month, we have a new website with beautiful theme that look like a commercial website rather than sample application. The new Pet Clinic is now a single page application that accessing Restful API.

We did figure out an effective way of working remotely even when we did not know each other before the project started.

We also found many benefits of building web interface with AngularJs. That is what we want to share with you today.

Understanding AngularJS

If you are new to AngularJS, think of it as a new MVC framework for front-end application. MVC is a well known pattern for server side application, but it is applicable for front-end application as well. With AngularJs, developer build a model based on the data received from server. After that, Angular will bind the model data to the html controls, displaying output on screen.

The good thing here is AngularJs provide the two-way binding instead of one-way binding. In this case, if user update the value in the html control, the model is automatically updated without developer effort.

This is the illustration of the boring one-way binding versus the two-way binding




Angular also provides scope for each model, so that the 2-way binding is only active within the boundary of a controller. To instruct the binding, AngularJs uses directives, which embedded directly to html controls. The Angularised html will look similar to this:

<div class="col-md-3" data-ng-repeat="owner in owners">
  <div class="thumbnail">
      <div class="caption">
      <h3 class="caption-heading">
           <a href="show.html" data-ng-bind="owner.firstName + ' ' +owner.lastName"></a>
      </h3>
      <p data-ng-bind="owner.city"></p>
      <p data-ng-bin="owner.address"></p>
    </div>
  </div>
</div>

For the above binding to work, you need to prepare the data for the model "owners" in your controller

var OwnerController = ['$scope','$state','Owner',function($scope,$state,Owner) {
 $scope.owners = Owner.query();
}];

If you feel curious how the code of the controller look so short, it is because we implement an Owner service with JPA liked methods

var Owner = ['$resource','context', function($resource, context) {
 return $resource(context + '/api/owners/:id');
}];

Look at the above template, if we take out all Angular directives, which start with data-ng or ng, we should have back the original html template. The only exception here is the directive data-ng-repeat, which function like a for loop that help to shorten the html code.

After Angularised the template file and create controller, the last thing one need to do is to declare them in the global app.js file

....
.state({ name: "owners",
  url: "/owners",
  templateUrl: "components/owners/owners.html",
  controller: "OwnerController",
  data: { requireLogin : true })

....
app.controller('OwnerController', OwnerController);
...
app.factory('Owner', Owner);

So far, that is the effort to Angularised one owners page template. The above example is mostly about showing data, but if we replace the tag <p/> with <input/> then users should be able to edit and view the owner at the same time.

Pros and cons of AngularJS

We have done evaluation on AngularJS before adopting it and we have evaluated it again after using. To recap our experience, we will share some benefits and issues of AngularJS.

Facilitate adoption of Restful API

Obviously, one need to introduce Restful API to work with AngularJS. Given the changes happened in this industry over last decade, Restful API is slowly but surely will be the standard practice for future applications.

In the past, a typical Spring framework developer should know JDBC, JPA, Spring, Jsp in order to develop web applications. But they are no longer enough. At first, there are some big players like Twitter, Facebook or Google that introducing the APIs for third-party web applications to integrate with their services. Later, there is a booming trend of mobile applications that no longer render UI based on html.

Because of that, there is an increasing demand for building application that serving data instead of serving html. For example, any start-up that want to play safe with will start with building Restful API before building a front-end application. That can save a lot of effort if there is a need to build for another client rather than browser in the future.

There is another contributing factor from development point of view. Splitting back-end and front-end applications make parallel development easier. It is not necessary to wait until the back-end services completed before building web interface. It is also beneficial in term of utilizing resources. In the past, we need developers that know Java and Javascript to develop java web application. Similarly, we need developers that know .NET and Javascript to develop .NET application. However, the web applications nowadays are using a lot more Javascript than in the past. That make finding developers that good at both languages harder to find. With Restful API, it is possible to recruit front-end developers that have strong understanding of Javascript and Css to build web interface while back-end developers can focus on security, scalability and performance issues.

Because we adopted Restful API from development benefit, it is important for us to showcase the ability of running Angular application without back-end service. Because AngularJS injects the modules by declaration, we have a single point of integration to configure the http service for the whole application. This look very favorable, especially for a Spring developer, because we have better control over the technology usage.

Faster development

Jsp by nature is only for viewing. Changing of data happened by other methods like form submitting or Ajax request. That why most of us think the MVC pattern only include one-way binding for displaying view. That not necessary to be the case if you remember desktop applications. For front end, MVC is not that popular. Some developers still like to manually populate values for html controls after each Ajax request. We also do but only sometimes. Some other times, we prefer automation over control. That why we love AngularJS.

As any other MVC framework, AngularJS provide a model behind the scene for each html control. But it is a Wow factor to have it reversed way, when the html control update the model. Think about it practically, sometimes our html controls are inter-related. For example, updating the value of country from the first drop down suppose to display corresponding states or provinces in the next drop down. AngularJS allows us to do this without coding.

Directive is cleaner comparing to Jsp or jQuery

I think the biggest benefit that AngularJS offer is custom directive. We adopted AngularJS because of two-way binding but custom directive is what make us committed.

Directive captured our interest at first sight. It is a clean solution for the long term problems that we face, the continuity problem and tracking problem.

Let start with a short example of a for loop in jsp file


<% for(int i = 0; i < owners.size(); i+=1) { %>

...

<% } %>

The code between opening and closing of for loop helps to render one element. Whenever we write this kind of code, we break the original html template into smaller chunks. Therefore, it is understandable that designer don't like to maintain jsp file. They are meant to be looked at by developers not designers.

To avoid inserting opening and closing of for loop at different places, we can use jsp tag. Then, we have another issue with the html code go missing in the jsp file. There is no clean and easy way for us to represent this jsp contents to designers. If there is a minor change to be made to the web interface, it normally comes as an UI patch request from designers.

Thing did not get better when developers moved to jQuery based UI. We always have concern regarding DOM manipulation and event handlers registration. In a typical jQuery web application, viewing source give us very limited information on what the user would see. Instead of being the source of truth, HTML become the material for Javascript developers to perform magic. That even shun the designers away more than Jsp.

Even for developers, thing does not go well if you are not the author of the code. It may take guess work to find out the code that being executed when user clicks a button, unless you know the convention. The Css selector is too flexible. It allows one developer to define behavior in a non-deterministic way and let the other developer go around searching for it.

Directive helps us get rid of both problems.

The directive data-ng-repeat="owner in owners" is inserted directly to the html tag. That make it is easier to keep track when the loop actually end both for us and the designers. The html inside the div remained similar to the original template, which make minor modification is possible. For tracking purpose, seeing ng-click directive on the html tags tell us immediately which method to look for. The ng-controller directive or the routing configuration gave us clue which controller that the method should be defined in. That make reaching source much easier.

Finally, the AngularJS team allow us to define our own directive. This is amazing because it allows us to create new directive to solve un-foreseen circumstance in the future and tap into existing directives library created by community.

Let get ourselves familiarized with custom directive by this example. In our Pet Clinic app, clicking some links on the banner will scroll the page down to respective part of the website.

To achieve this behavior in a jQuery app, the fastest solution is to make it as a link and put

href="javascript:scrollToTarget(destination)"

For an Angular app, it is not necessarily to be a link and the code look slightly different

ng-click="scrollToTarget(destination)"

If he decided to make a directive for this function, the code will look like

scroll-to-target="destination"

For us, the last one is the best. The directive is a familiar concept for any designer, as they should have used class, id, name or onclick. Having a new well meaning directive will not be that hard to pick up even for a designer. For us, it is fun and cool. We are no longer depend on World Wide Web consortium to release the feature that we want. If we need somethings, we can hunt for the directive, we also can modify it or create it our selves. Slowly, we can have a community-backed directives library that forever improving.

Automation versus Control

After AngularJS presentation, I think not every folk will be ready to jump the boat. The AngularJS deliver so much magic with a big trade off that you need to relinquish the control of your application over to AngularJS.

Instead of having work done yourself, you need to provide information for AngularJS to do the work. Two-way binding, injection of services all happen behind the scene. When we need to modify the way it work, it is not so convenient to do so.

For us, relinquishing control to build an configurable application is tolerable. However, not able to alter the behavior when we need is much less tolerable. This is where Spring framework shines. It does not simply give you the bean, it allow you to do whatever way you want with it like executing bean life-cycle functions, injecting bean factory or defining scope.

Contrast to this, I find it hard to go out of Angular way when I want to. For example, I would like Angular to help us populate content of a textarea and later running CkEditor Javascript to convert this textarea to a html editor. To get this done, the application need to load CkEditor Javascript only after the binding is completed. It can be done by altering AngularJS behavior or converting the textarea to a directive.

But I am not satisfied enough with both, the former solution does not look clean and the latter solution seem tedious. It look cooler if we have life-cycle support to inject additional behavior like we have with Spring framework

FAQ about AngularJS

Up to this point, I hope you already have some ideas and can make a choice for yourself whether AngularJS is the right choice. To contribute to your decision making, I will share our answers of the questions we received in the presentation.

Is it possible to protect confidential data when you build web interface with AngularJS?

For me, this is not AngularJS issue, this is how you design your Restful API. If there is something you do not want users to see, don't ever show it on your Restful API. For a Spring developer, this means avoid using entity as @ResponseBody if you are not ready to fully expose it. At least, put in some annotation like @JsonIgnore to avoid the confidential fields.

How you pass object around different pages?

There are many ways to do so. Each page have its own controller but the whole application share a single rootscope that you can place the object there. However, we should not use rootscope unless there is no other way. Our prefer solution is using url.

For example, our routing configuration specify:

       
        .state({
  name: "owners",
  url: "/owners",
  templateUrl: "components/owners/owners.html",
  controller: "OwnerController",
  data: { requireLogin : true }
 }).state({
  name: "ownerDetails",
  url: "/owners/:id",
  templateUrl: "components/owners/owner_details.html",
  controller: "OwnerDetailsController",
  data: {requireLogin : true}
 })


When user route to the owner detail page, the controller load the owner again from the id in url. It may look redundant but it allow the browser bookmarking.

Is this really possible for designers to commit to the project?

Yes, that what we have tried and it worked. One day before the presentation, our designer, Andrew redesign the landing page without our involvement. He kept the all the directives we put in and there is no impact to functionality after the design changed.

Should I adopt AngularJs 1 or wait until AngularJs 2 available?

I think you should go ahead with AngularJS 1. For us, AngularJs 2 is so different that it can be treated as another technology stack and support for AngularJS will continue for at least a year after AngularJS 2 is released (slated for 2016) We feel that due to community support, AngularJs 2 will continue to thrive like Play framework 1 after the release of Play framework 2.


Conclusion

So, we have done our part to introduce a new way of building web interface for Spring framework user. We hope you are convinced that Restful API is the way to go and AngularJs is worth trying. If you have any other idea, please help to feedback.

If you are keen to see the project or original talk, here is the reference:

https://github.com/singularity-sg/spring-petclinic

http://petclinic.hopto.org/petclinic/slides/#/

http://petclinic.hopto.org/petclinic/#/

Enjoy!

Tuesday, 4 November 2014

Exploration of ideas

There are many professionals for an individual to choose and I believe that e should follow the professional that he like most or hate least. The chance of success and quality of life are both much better that way. So, if you ask me why I choose software development as a career, I can assure you that programming is a fun career. Of course, it is fun not by sitting in front of the monitor and typing endlessly on the keyboard. It is fun because we control what we write and we can let our innovation run wild.

In this article, I want to share some ideas that I have tried and please see for your self if any idea fit your work.

TOOLS

Using OpenExtern and AnyEdit plugins for Eclipse

Of all the plugins that I have tried, these two plugins are my two most favorites. They are small, fast and stable.

OpenExtern give you the shortcut to open file explorer or console for any eclipse resource. At the beginning day of using Eclipse, I often found myself opening project properties just to copy project folder to open console or file explorer. OpenExtern plugin make the tedious 5 to 10 seconds process become 1 second mouse click. This simple tool actually helps a lot because many of us running Maven or Git commands from console.

The other plugin that I find useful is AnyEdit. It adds handful of converting, comparing and sorting tool to Eclipse editor. It eliminates the need to use external editor or comparison tool for Eclipse. I also like to turn on auto formatting and removing trailing spaces when saving. This works pretty well if all of us have the same formatter configuration (line wrapping, indentation,...). Before we standardized our code formatter, we had a hard time comparing and merging source code after each check in.

Other than these two, in the past, I often installed the Decompiler and Json Editor plugins. However, as most of Maven artifacts now a day are uploaded with source code and Json content can be viewed easily using Chrome Json plugin, I do not find these plugins useful anymore.

Using JArchitect to monitor code quality

In the past, we monitored the code quality of projects with eye balls. That seems good enough when you have time to take care of all modules. Things get more complicated when the team is growing or multiple team working on the same project. Eventually, the code still need to be reviewed and we need to be alerted if things go wrong.

Fortunately, I got an offer from JArchitect to try out their latest product. First and foremost, this is a stand alone product rather than traditional way of integrating to IDE. For me, it is a plus point because you may not want to make your IDE too heavy. The second good thing is JArchitect can understand Maven, which is a rare feature in the market. The third good thing is that JArchitect create their own project file in their own workspace, which make no harm to the original Java project. Currently, the product is commercial but you may want to take a look to see if the benefit justify cost



SETTING UP PROJECTS

As we all know, a Java web project has both unit test and functional  test. For functional test, if you use framework like Spring MVC, it is possible to create tests for controller without bringing up the server. Otherwise, it is quite normal that developers need to start up server, run functional test, then shutdown the server. This process is a bit tedious given the fact that the project may be created by someone else, which we have never communicated before. Therefore, we are trying to setup project in such a way that people can just download it and run without much hassle.

Server

In the past, we had the server setup manually for each local development box and the integration server (Cruise Control or Hudson). Gradually, our common practice shift toward checking in the server to every project. The motivation behind this move is saving of effort to setup a new project after checking out. Because the server is embedded inside project, there is no need to download or setup the server for each box. Moreover, this practice discourages shared server among projects, which is less error prone.

Other than server, there are two other elements inside a project that are server dependence. They are properties files and database. For each of this element, we have slightly different practice, depending on situation.

Properties file

Checking in properties template rather than properties file

Each developer need to clone the template file and edit when necessary. For continuous integration server, it is a bit trickier. Developer can manually create the file in the workspace or simply check in the build server properties file to the project.

The former practice is not used any more because it is too error prone. Any attempt to clean workspace will delete properties file and we cannot track back the modification of properties in the file. Moreover, as we setup Jenkins as a cluster rather than single node like in the past, it is not applicable any more.

For second practice, rather than checking in my-project.properties, I would rather checkin my-project.properties-template and my-project.properties-jenkins. The first file can be used as guidance to setup local properties file while the second can be renamed and used for Jenkins.

Using host name to define external network connection

This practice may work better sometimes if we need to setup similar network connections for various projects. Let say we need to configure database for around 10 similar project pointing to the same database. In this case, we can hard code the database server in the properties file and manually setup the host file in each build node to point the pre-defined domain of the database.

For non essential properties, provide as much default value as possible

There is nothing much to say about this practice. We only need to remind ourselves to be less lazy so that other people may enjoy handling our projects

Using LandLord Service

This is a pretty new practice that we only apply from last year. With the regulation in our office, Web Service is the only team that manage and have access to UAT and PRODUCTION server. That why we need to guide them and they need to do the repetitive setup for at least 2 environments, both normally require clustering. It is quite tedious to them until the day we consolidate the properties of all environments, all project to a single landlord service.

From that time, each project would start up and connecting to the landlord service with an environment id and application id. The landlord will happily serve them all the information they need.

Database

Using DBUnit to setup database once and let each test case automatically roll back transaction

This is the traditional practice which is still work quite well now a day. However, it still require developer to create an empty schema for DBUnit to connect to. For it to work, the project must have a transaction management framework that support auto roll back in test environment. It also requires that the database operation happen within the test environment. For example, if in the functional test, we attempt to send a HTTP request to the server. In this case, the database operation happen in the server itself rather than in the test environment wen cannot do anything to roll it back.

Running database in memory

This is a built-in feature of Play framework. Developer will work with in-memory database in development mode and external database in production mode. This is doable as long as developer only work with JPA entities and has no knowledge of underlying database system.

Database evolutions

This is a borrowed idea from Ror. Rather than setting up the database from beginning, we can simply check current version and sequentially run the delta script so that the database got the wanted schema. Similar to above, it is expensive to do it yourselves unless there is native support from a framework like Ror or Play.

CODING

I have been in the industry for 10 years and I can tell you that software development is like fashion. There is no clear separation between good and bad practice. Whatever things are classified as bad practices may come back another day as new best practices. Let summarize some of the heaty debates we have

Access modifiers for class attributes

Most of us were taught that we should hide class attributes from external access. Instead, we suppose to create getter and setter for each attribute. I strictly followed this rule in my early days even I did not know that most IDE can auto generate getter and setter.

However, later, I was introduced to another idea that setter is dangerous. It allows other developers to spoil your object and mess up the system. In this case, you should create immutable object and do not provide the attribute setter.

The challenge is how should we write our Unit Test if we hide the setter for class attributes. Sometimes, the attributes is inserted to the object using IOC framework like Spring. If there is no framework around, we may need to use reflection util to insert mock dependency to the test object.

I have seen many developers solving problem this way but I think it is over-engineering. If we compromise a bit, it is much more convenient to use package modifier for the attribute. As best practices, test cases will always be on the same package with implementation and we should have no issue injecting mock dependencies. The package is normally controlled and contributed by the same individual or team; therefore the chance of spoiling object is minimal. Finally, as package modifier is the default modifier, it saves a few bytes of your codes.

Monolithic application versus Micro Service architecture

When I joined industry, the word "Enterprise" mean lots of xml, lots of deployment steps, huge application, many layers and the code is very domain oriented. As things evolve, we adopted ORM, learned to split business logic from presentation logic, learn how to simplify and enrich our application with AOP. It goes well until I was told again that the way to move forward is Micro Service Architect, which make Java Enterprise application functions similarly to a Php application. The biggest benefit we have with Java now may be the performance advantage of Java Vitual Machine.

When adopting Micro Service architecture, it is obvious that our application will be database oriented. By removing the layers, it also minimize the benefit of AOP. The code base sure will shrink but it will be harder to write Unit Test.

Tuesday, 16 September 2014

Agile is a simple topic

Agile manifesto is probably one of the best ever written manifestos in software development if not the best. Simple and elegant. Good vs Bad 1 2 3 4, done. It is so simple that I am constantly disappointed by the amount of stuff that floating on the Internet about, what is agile what is not, how to do agile, Scrum, Kanban and who knows what will pop up next year claiming to to be another king of agile.

If I ever tell you we are the purist agile team and we don't have sprint, we don't have stand up meetings, we don't story board, we don't have burn down charts, we don't have planning poker cards, we don't have any of the buzzwords, most of the so called IT consultants will hang me on the spot.

Let's face it, being pure isn't about what you have, it is about what you don't! The pure gold has nothing but gold that's why it is super valuable. We should build our teams on developers, codes and business needs. The three pure ingredient of a team, any one taken away a team is no more.

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away. 
Antoine de Saint-Exupery

Exactly the manifesto is saying "we value less on processes and tools" and yet we have seen all kinds of weird super imposed processes and tools everywhere. "Look, we have standups, we have sprints, we have story boards therefore we agile". NO, absolutely NOT. You can walk like a duck, quack like duck, but you are still not a duck.

But why the hype anyway?

Partly the consulting companies are to be blamed, they try to sell the buzz words to the management so that they can make $$$ by simply asking the developers to do what they already know, writing codes, but in a different way.

The biggest enemies are all the developers especially the team leaders and managers. Because they are lazy to know the developers (the people), lazy to learn the codes (the working software) the lazy to analyse the business needs. Because "in the end of the day I need to show my developers that I am doing a manager's work", "what is the shortcut?", "look, I just got this scrum from a random blog post, standups 5 mins, no problems. Poker cards, easy. Story boards, no big deal ... ". "Done, now we are scrum, now we are agile, if the things fail, it is the developers problem". Goodbye, there goes a team.

So now you question me, "you said agile is simple, why it looks so hard now?"
Any fool can make something complicated. It takes a genius to make it simple.
Woody Guthrie
People are born equal, a genius doesn't magically popup, it takes real hard work to reach that level. Let's go back to the origin, the mighty manifesto.

Get rid of all unnecessary processes and tools, and go talk to people. "What is Jimmy's strength? What can we do to make up for Sam's weakness? Is David and Carl a good pair?".

Stop typing inside Words or Excel, go read the real codes, "What can we do the enhance the clarity of the codes, how to improve the performance without too much sacrifice, what are the alternative ways to extend our software".

Stop coming up with imaginary use cases, go meet the customer "What are your point points, what are the 3 most important features that need to be enhanced and delivered. Based on our statistics, we believe if we build feature X in such a way, the business can grow in Y%, do you think we should do this?"

Stop wasting our life on keeping a useless backlog, go see the 3 biggest opportunities and threats and work on them, rise and repeat. If fact that is exactly how evolution bring our human to this stage, "eliminate the immediate threat to ensure the short term survival, and seek the opportunities for long term growth". As we all decadents of the mother nature, we are incapable of out smart her, so learn from her.

real process/methodology grows from the team not super imposed on to the team

real process/methodology does not have a name because it is unique to each team

Grow your own dream team!



Thanks for wasting your time reading my rant

Sunday, 7 September 2014

Stateless Session for multi-tenant application using Spring Security

Once upon a time, I published one article explaining the principle to build Stateless Session. Coincidentally, we are working on the same task again, but this time, for a multi-tenant application. This time, instead of building the authentication mechanism ourselves, we integrate our solution into Spring Security framework.

This article will explain our approach and implementation.





Business Requirement

We need to build authentication mechanism for an Saas application. Each customer access the application through a dedicated sub-domain. Because the application will be deployed on the cloud, it is pretty obvious that Stateless Session is the preferred choice because it allow us to deploy additional instances without hassle.

In the project glossary, each customer is one site. Each application is one app. For example, site may be Microsoft or Google. App may be Gmail, GooglePlus or Google Drive. A sub-domain that user use to access the application will include both app and site. For example, it may looks like microsoft.mail.somedomain.com or google.map.somedomain.com

User once login to one app, can access any other apps as long as they are for the same site. Session will be timeout after a certain inactive period.

Background

Stateless Session

Stateless application with timeout is nothing new. Play framework has been stateless from the first release in 2007. We also switched to Stateless Session many years ago. The benefit is pretty clear. Your Load Balancer do not need stickiness; hence, it is easier to configure. As the session in on the browser, we can simply bring in new servers to boost capacity immediately. However, the disadvantage is that your session is not so big and not so confidential anymore.

Comparing to stateful application where the session is store in server, stateless application store the session in HTTP cookie, which can not grow more than 4KB. Moreover, as it is cookie, it is recommended that developers only store text or digit on the session rather than complicated data structure. The session is stored in browser and transfer to server in every single request. Therefore, we should keep the session as small as possible and avoid placing any confidential data on it. To put it short, stateless session force developer to change the way application using session. It should be user identity rather than convenient store.

Security Framework

The idea behind Security Framework is pretty simple, it helps to identify the principle that executing code, checking if he has permission to execute some services and throws exceptions if user does not. In term of implementation, security framework integrate with your service in an AOP style architecture. Every check will be done by the framework before method call. The mechanism for implementing permission check may be filter or proxy.

Normally, security framework will store principal information in the thread storage (ThreadLocal in Java). That why it can give developers a static method access to the principal anytime. I think this is somethings developers should know well; otherwise, they may implement permission check or getting principal in some background jobs that running in separate threads. In this situation, it is obviously that the security framework will not be able to find the principal.

Single Sign On

Single Sign On in mostly implemented using Authentication Server. It is independent of the mechanism to implement session (stateless or stateful). Each application still maintain their own session. On the first access to an application, it will contact authentication server to authenticate user then create its own session.

Food for Thought

Framework or build from scratch

As stateless session is the standard, the biggest concern for us is to use or not to use a security framework. If we use, then Spring Security is the cheapest and fastest solution because we already use Spring Framework in our application. For the benefit, any security framework provide us quick and declarative way to declare assess rule. However, it will not be business logic aware access rule. For example, we can define that only Agent can access the products but we can not define that one agent can only access some products that belong to him.

In this situation, we have two choices, building our own business logic permission check from scratch or build 2 layers of permission check, one is only role based, one is business logic aware. After comparing two approaches, we chose the latter one because it is cheaper and faster to build. Our application will function similar to any other Spring Security application. It means that user will be redirected to login page if accessing protected content without session. If the session exist, user will get status code 403. If user access protected content with valid role but unauthorized records, he will get 401 instead.

Authentication

The next concern is how to integrate our authentication and Authorization mechanism with Spring Security. A standard Spring Security application may process a request like below:



The diagram is simplified but still give us a raw idea how things work. If the request is login or logout, the top two filters update the server side session. After that, another filter help check access permission for the request. If the permission check success, another filter will help to store user session to thread storage. After that, controller will execute code with the properly setup environment.

For us, we prefer to create our authentication mechanism because the credential need to contain website domain. For example, we may have Joe from Xerox and Joe from WDS accessing Saas application. As Spring Security take control of preparing authentication token and authentication provider, we find it is cheaper to implement login and logout ourselves at the controller level rather than spending effort on customizing Spring Security.

As we implement stateless session, there are two works we need to implements here. At first, we need to to construct the session from cookie before any authorization check. We also need to update the session time stamp so that the session is refreshed every time browser send request to server.

Because of the earlier decision to do authentication in controller, we face a challenge here. We should not refresh the session before controller executes because we do authentication here. However, some controller methods is attached with the View Resolver that write to output stream immediately. Therefore, we have no chance to refresh cookie after controller being executed. Finally, we choose a slightly compromised solution by using HandlerInterceptorAdapter. This handler interceptor allow us to do extra processing before and after each controller method. We implement refreshing cookie after controller method if the method is for authentication and before controller methods for any other purpose. The new diagram should look like this



Cookie

To be meaningful, user should have only one session cookie. As the session always change time stamp after each request, we need to update session on every single response. By HTTP protocol, this can only be done if the cookies match name, path and domain.

When getting this business requirement, we prefer to try new way of implementing SSO by sharing session cookie. If every application are under the same parent domain and understand the same session cookie, effectively we have a global session. Therefore, there is no need for authentication server any more. To achieve that vision, we must set the domain as the parent domain of all applications.

To illustrate this global session, let come back to the earlier example where we have two applications that contain the domain name as microsoft.mail.somedomain.com or google.map.somedomain.com

For the session cookie to be global, we will set the domain as somedomain.com. Obviously, the session cookie can be seen and maintained by both applications as long as they share the same secret key to sign.

Performance

Theoretically, stateless session should be slower. Assuming that the server implementation store session table in memory, passing in JSESSIONID cookie will only trigger a one time read of object from the session table and optional one time write to update last access (for calculating session timeout). In contrast, for stateless session, we need to calculate the hash to validate session cookie, load principal from database, assigning new time stamp and hash again.

However, with today server performance, hashing should not add too much delay in server response time. The bigger concern is querying data from database, and for this, we can speed up by using cache.

In best case scenario, stateless session can perform closely enough to stateful if there is no DB call made. In stead of loading from session table, which maintained by container, the session is loaded from internal cache, which is maintained by application. In the worst case scenario, requests are being routed to many different servers and the principal object is stored in many instances. This add additional effort to load principal to the cache once per server. While the cost may be high, it occurs only once in a while.

If we apply stickiness routing to load balancer, we should be able to achieve best case scenario performance. With this, we can perceive the stateless session cookie as similar mechanism to JSESSIONID but with fall back ability to reconstruct session object.

Implementation

I have published the sample of this implementation to https://github.com/tuanngda/sgdev-blog repository. Kindly check the stateless-session project. The project requires a mysql database to work. Hence, kindly setup a schema following build.properties or modify the properties file to fit your schema.

The project include maven configuration to start up a tomcat server at port 8686. Therefore, you can simply type mvn cargo:run to start up the server.

Here is the project hierarchy:


I packed both Tomcat 7 server and the database so that it work without any other installation except MySQL. The Tomcat configuration file TOMCAT_HOME/conf/context.xml contain the DataSource declaration and project properties file.

Now, let look closer at the implementation

Session

We need two session objects, one represent the session cookie, one represent the session object that we build internally in Spring security framework:

public class SessionCookieData {
 
 private int userId;
 
 private String appId;
 
 private int siteId;
 
 private Date timeStamp;
}

and

public class UserSession {
 
 private User user;
 
 private Site site;

 public SessionCookieData generateSessionCookieData(){
  return new SessionCookieData(user.getId(), user.getAppId(), site.getId());
 }
}

With this combo, we have the objects to store session object in cookie and memory. The next step is to implement a method that allow us to build session object from cookie data.

public interface UserSessionService {
 
 public UserSession getUserSession(SessionCookieData sessionData);
}

Now, one more service to retrieve and generate cookie from cookie data.

public class SessionCookieService {

 public Cookie generateSessionCookie(SessionCookieData cookieData, String domain);

 public SessionCookieData getSessionCookieData(Cookie sessionCookie);

 public Cookie generateSignCookie(Cookie sessionCookie);
}

Up to this point, We have the service that help us to do the conversion

Cookie --> SessionCookieData --> UserSession

and

Session --> SessionCookieData --> Cookie

Now, we should have enough material to integrate stateless session with Spring Security framework

Integrate with Spring security

At first, we need to add a filter to construct session from cookie. Because this should happen before permission check, it is better to use AbstractPreAuthenticatedProcessingFilter

@Component(value="cookieSessionFilter")
public class CookieSessionFilter extends AbstractPreAuthenticatedProcessingFilter {
 
...
 
 @Override
 protected Object getPreAuthenticatedPrincipal(HttpServletRequest request) {
  SecurityContext securityContext = extractSecurityContext(request);
  
  if (securityContext.getAuthentication()!=null  
     && securityContext.getAuthentication().isAuthenticated()){
   UserAuthentication userAuthentication = (UserAuthentication) securityContext.getAuthentication();
   UserSession session = (UserSession) userAuthentication.getDetails();
   SecurityContextHolder.setContext(securityContext);
   return session;
  }
  
  return new UserSession();
 }
 ...
 
}

The filter above construct principal object from session cookie. The filter also create a PreAuthenticatedAuthenticationToken that will be used later for authentication. It is obviously that Spring will not understand this Principal. Therefore, we need to provide our own AuthenticationProvider that manage to authenticate user based on this principal.

public class UserAuthenticationProvider implements AuthenticationProvider {
@Override
  public Authentication authenticate(Authentication authentication) throws AuthenticationException {
    PreAuthenticatedAuthenticationToken token = (PreAuthenticatedAuthenticationToken) authentication;

    UserSession session = (UserSession)token.getPrincipal();

    if (session != null && session.getUser() != null){
      SecurityContext securityContext = SecurityContextHolder.getContext();
      securityContext.setAuthentication(new UserAuthentication(session));
      return new UserAuthentication(session);
    }

    throw new BadCredentialsException("Unknown user name or password");
  }
}

This is Spring way. User is authenticated if we manage to provide a valid Authentication object. Practically, we let user login by session cookie for every single request.

However, there are times that we need to alter user session and we can do it as usual in controller method. We simply overwrite the SecurityContext, which is setup earlier in the pre-authentication filter.


public ModelAndView login(String login, String password, String siteCode) throws IOException{
    
    if(StringUtils.isEmpty(login) || StringUtils.isEmpty(password)){
      throw new HttpServerErrorException(HttpStatus.BAD_REQUEST, "Missing login and password");
    }
    
    User user = authService.login(siteCode, login, password);
    if(user!=null){
      SecurityContext securityContext = SecurityContextHolder.getContext();
      UserSession userSession = new UserSession();
      userSession.setSite(user.getSite());
      userSession.setUser(user);
      securityContext.setAuthentication(new UserAuthentication(userSession));
    }else{
      throw new HttpServerErrorException(HttpStatus.UNAUTHORIZED, "Invalid login or password");
    }
    
    return new ModelAndView(new MappingJackson2JsonView());
    
  }

Refresh Session

Up to now, you may notice that we have never mentioned the writing of cookie. Provided that we have a valid Authentication object and our SecurityContext contain the UserSession, it is important that we need to send this information back to browser.

Before the HttpServletResponse is generated, we must generate and attach the session cookie to it. This new session cookie,  which has similar domain and path will replace the older session cookie that the browser is keeping.

As discussed above, refreshing session is better to be done after controller method because we implement authentication at this layer. However, there is a challenge caused by ViewResolver of Spring MVC. Sometimes, it writes to OutputStream so soon that any attempt to add cookie to response will be useless.

After consideration, we come up with a compromise solution that refresh session before controller methods for normal requests and after controller methods for authentication requests. To know whether requests is for authentication, we place an newly defined annotation at the authentication methods.

  @Override
  public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler) throws Exception {
    if (handler instanceof HandlerMethod){
      HandlerMethod handlerMethod = (HandlerMethod) handler;
      SessionUpdate sessionUpdateAnnotation = handlerMethod.getMethod().getAnnotation(SessionUpdate.class);
      
      if (sessionUpdateAnnotation == null){
        SecurityContext context = SecurityContextHolder.getContext();
        if (context.getAuthentication() instanceof UserAuthentication){
          UserAuthentication userAuthentication = (UserAuthentication)context.getAuthentication();
          UserSession session = (UserSession) userAuthentication.getDetails();
          persistSessionCookie(response, session);
        }
      }
    }
    return true;
  }

  @Override
  public void postHandle(HttpServletRequest request, HttpServletResponse response, Object handler,
      ModelAndView modelAndView) throws Exception {
    if (handler instanceof HandlerMethod){
      HandlerMethod handlerMethod = (HandlerMethod) handler;
      SessionUpdate sessionUpdateAnnotation = handlerMethod.getMethod().getAnnotation(SessionUpdate.class);
      
      if (sessionUpdateAnnotation != null){
        SecurityContext context = SecurityContextHolder.getContext();
        if (context.getAuthentication() instanceof UserAuthentication){
          UserAuthentication userAuthentication = (UserAuthentication)context.getAuthentication();
          UserSession session = (UserSession) userAuthentication.getDetails();
          persistSessionCookie(response, session);
        }
      }
    }
  }

Conclusion

The solution works well for us but we do not have the confident that this is the best practices possible. However, it is simple and does not cost us much effort to implement (around 3 days include testing).

Kindly feedback if you have any better idea to build stateless session with Spring.

Thursday, 28 August 2014

Distributed Crawling

Around 3 months ago, I have posted one article explaining our approach and consideration to build Cloud Application. From this article, I will gradually share our practical design to solve this challenge.

As mentioned before, our final goal is to build a Saas big data analysis application, which will deployed in AWS servers. In order to fulfill this goal, we need to build distributed crawling, indexing and distributed training systems.

The focus of this article is how to build the distributed crawling system. The fancy name for this system will be Black Widow.

Requirements

As usual, let start with the business requirement for the system. Our goal is to build a scalable crawling system that can be deployed on the cloud. The system should be able to function in an unreliable, high-latency network and can recover automatically from a partial hardware or network failure.

For the first release, the system can crawl from 3 kind of sources, Datasift, Twitter API and Rss feeds. The data crawled back are called Comment. The Rss crawlers suppose to read public sources like website or blog. It is free of charge. DataSift and Twitter both provide proprietary APIs to access their streaming service. Datasift charges its users by comment count and the complexity of CSLD (Curated Stream Definition Language, their own query language). Twitter, in the other hand, offers free Twitter Sampler streaming.

In order to do cost control, we need to implement mechanism to limit the amount of comments crawled from commercial source like Datasift. As Datasift provided Twitter comment, it is possible to have single comment coming from different sources. At the moment, we did not try to eliminate and accept it as data duplication. However, this problem can be eliminated manually by user configuration (avoid choosing both Twitter and Datasift Twitter together).

For future extension, the system should be able to link up related comments to from a conversation.

Food for Thought

Centralized Architecture

Our first thought when getting requirement is to build the crawling on the nodes, which we called Spawn and let the hub, which we called Black Widow to manage the collaboration of effort among nodes. This idea was quickly accepted by team members as it allows the system to scale well with the hub doing limited work.

As any other centralized system, Black Widow suffers from single point of failure problem. To help easing this problem, we allow the node to function independently for a short period after losing connection to Black Widow. This will give the support team a breathing room to bring up backup server.

Another bottle neck in the system is data storage. For the volume of data being crawled (easily reach few thousands records per seconds), NoSQL is clearly the choice for storing the crawled comments. We have experiences working with Lucene and MongoDB. However, after research and some minor experiments, we choose Cassandra as the NoSQL database.

With that few thoughts, we visualize the distributed crawling system to be build following this prototype:



In the diagram above, Black Widow, or the hub is the only server that has access to the SQL database system. This is where we store the configuration for crawling. Therefore, all the Spawns, or crawling nodes are fully stateless. It simply wakes up, registers itself to Black Widow and does the assigned jobs. After getting the comments, the Spawn stores it to Cassandra cluster and also push it to some queues for further processing.

Brainstorming of possible issues

To explain the design to non-technical people, we like to relate the business requirement to a similar problem in real life so that it can be easier to understand. The similar problem we choose would be collaborating of efforts among volunteers.

Imagine if we need to do a lot of preparation work for the upcoming Olympic and decide to recruit volunteers all around the world to help. We do not know volunteers but the volunteers know our email, so they can contact us to register. Only then, we know their emails and may send tasks to them through email. We would not want to send one task to two volunteers or left some tasks unattended. We want to distribute the tasks evenly so that no volunteers are suffering too much.

Due to cost issue, we would not contact them through mobile phone. However, because email is less reliable, when sending out tasks to volunteers, we would request a confirmation. The task is consider assigned only when the volunteer replied with confirmation.

With above example, the volunteers represent Spawn nodes while email communication represent unreliable and high latency network. Here are some problems that we need to solve:

1/ Node failure

For this problems, the best way is to check regularly. If a volunteer stop responding to the regular progress check email, the task should be re-assign to someone else.

2/ Optimization of tasks assigning

Some tasks are related. Therefore assigning related tasks to the same person can help to reduce total effort. This happen with our crawling as well because some crawling configurations have similar search terms, grouping  them together to share the streaming channel will help to reduce final bill.

Another concern is the fairness or ability to distribute the amount of works evenly among volunteers. The simplest strategy we can think of is Round Robin but with a minor tweak by remembering earlier assignments. Therefore, if a task is pretty similar to the tasks we assigned before, the task can be skipped from Round Robin selection and directly assign to the same volunteer.

3/ The hub is not working

If due to some reasons, our email server is down and we cannot contact volunteer any more, it is better to let the volunteers stop working on the assigning tasks. The main concern here is over-running of cost or wasted efforts. However, stopping working immediately is too hasty as temporary infrastructure issue may cause the communication problem.

Hence, we need to find a reasonable amount of time for the node to continue functioning after being detached from the hub.

4/ Cost control

Due to business requirement, there are two kinds of cost control that we need to implement. First is the total of comments being crawled per crawler and second is the total of comments crawled by all crawlers belong to the same user.

This is where we have a debate about the best approach to implement cost control. It is very straight forward to implement the limit for each crawler. We can simply pass this limit to the Spawn node and it will automatically stop the crawler when the limit is reached.

However, for the limit per user, it is not so straight forward and we have two possible approaches. For the simpler choice, we can send all the crawlers of one user to the same node. Then, similar to the earlier problem, the Spawn node knows  the amount of comments collected and stops all crawlers when limit reached. This approach is simple but it limits the ability to distribute jobs evenly among nodes. The alternative approach is to let all the nodes retrieve and update a global counter. This approach creates huge network traffic internally and add considerable delay to comment processing time.

At this point, we temporarily choose the global counter approach. This can be considered again if the performance become a huge concern.

5/ Deploy on the cloud

As any other Cloud application, we can not put too much trust in the network or infrastructure. Here is how we make our application conform to the check-list mentioned in last article:
  • Stateless: Our spawn node is stateless but the hub is not. Therefore, in our design, the nodes do actual work and the hub only collaborates efforts.
  • Idempotence: We implement hashCode and equal methods for every crawler configuration. We store the crawler configurations in the Map or Set. Therefore, the crawler configuration can be sent multiple times without any other side effect. Moreover, our node selection approach ensure that the job will be sent to the same node.
  • Data Access Object: We apply the JsonIgnore filter on every model objects to make sure no confidential data flying around in the network.
  • Play Safe: We implement health-check API for each node and the hub itself. The first level of support will get notified immediately when anything wrong happened.
6/ Recovery

We try our best to make the system heal itself from partial failure. There are some type of failure that we can recover from:
  • Hub failure: Node register itself to the hub when it start up. From then, it is the one way communication when only the hub send jobs to node and also poll for status update. The node is consider detached if it failed to get any contact from Hub for a pre-defined period. If a node is detached, it will clear all the job configurations and start registering itself to the hub again. If the incident is caused by hub failure, a new hub will fetch crawling configurations from database and start distributing jobs again. All the existing jobs on Spawn nodes will be cleared when the Spawn node go to detached mode.
  • Node failure: When hub fail to poll a node, it will do a hard reset by removing all working jobs and re-distribute from beginning again to the working nodes. This re-distribution process help to ensure optimized distribution.
  • Job failure: There are two kind of failures happened when the hub do sending and polling jobs. If a job is failed in the polling process but the Spawn node is still working well, Black Widow can re-assign the job to the same node again. The same thing can be done if the job sending failed. 

Implementation

Data Source and Subscriber

In the initial thought, each crawler can open it own channel to retrieve data but this does not make sense any more when inspecting further. For Rss, we can scan all URLs once and find out the keywords that may belong to multiple crawlers. For Twitter, it supports up to 200 search terms for one single query. Therefore, it is possible for us to open single channel that serve multiple crawlers. For Datasift, it is quite rare, but due to human mistake or luck, it is possible to have crawlers with identical search terms.

This situation lead us to split out crawler to two entities: subscriber and data source. Subscriber is in charge of consuming the comments while data source is in charge of crawling the comments. With this design, if there are two crawlers with similar keywords, a single data source will be created to serve two subscribers, each processing the comments their own ways.

Data source will be created when and only when no similar data source exist. It starts working when having the first subscriber subscribe to it and retire when the last subscriber unsubscribe from it. With the help of Black Widow to send similar subscribers to the same node, we can minimize the amount of data sources created and indirectly, minimize the crawling cost.

Data Structure

The biggest concern of data structure is Thread Safe issue. In the Spawn node, we must store all running subscribers and data sources in memory. There are a few scenarios that we need to modify or access these data:

  • When a subscriber hit the limit, it automatically unsubscribe from data source, which may lead to deactivation of data source.
  • When Black Widow send a new subscriber to Spawn nodes. 
  • When Black Widow send a request to unsubscribe an existing subscriber. 
  • Health check API expose all running subscribers and data sources. 
  • Black Widow regularly polls the status of each assigned subscriber.
  • The Spawn node regularly checks and disables orphan subscribers (subscriber which is not polled by Black Widow).
Another concern of data structure is idempotence of operations. Any of operation above can be missing or being duplicated. To handle this problem, here is our approach
  • Implement hashCode and equals method for every subscriber and data source. 
  • We choose the Set or Map to store collection of subscribers and data sources. For records with identical hash code, Map will replace the record when there is new insertion but Set will skip the new record. Therefore, if we use Set, we need to ensure new records can replace old record. 
  • We use synchronized in data access code.
  • If Spawn node receive a new subscriber that similar to existing subscriber, it will compare and prefer to update existing subscriber instead of replacing. This avoid the process of unsubscribing and subscribing identical subscribers, which may interrupt data source streaming.
Routing

As mentioned before, we need to find a routing mechanism that serve two purposes:
  • Distribute the jobs evenly among Spawn nodes.
  • Route similar jobs to the same nodes.
We solved this problem by generating an unique representation of each query  named uuid. After that, we can use a simple modular function to find out the note to route:


int size = activeBwsNodes.size();
int hashCode = uuid.hashCode();
int index = hashCode % size;
assignedNode = activeBwsNodes.get(index);

With this implementation, subscribers with similar uuid will always be sent to the same node and each node has equals chance of being selected to serve a subscriber. 

This whole practice can be screwed up when there is change to the collection of active Spawn nodes. Therefore, Black Widow must clear up all running jobs and reassign from beginning whenever there is a node change. However, node change should be quite rare in production environment.

Handshake

Below is the sequence diagram of Black Widow and Node collaboration


Black Widow does not know Spawn node. It wait for the Spawn node to register itself to the Black Widow. From there, Black Widow has the responsibility to poll the node to maintain connectivity. If Black Widow fail to poll a node, it will remove the node from the its container. The orphan node will eventually go to detached mode because it is not being polled any more. In this mode, Spawn node will clear existing jobs and try to register itself again.

The next diagram is the subscriber life-cycle



Similar to above, Black Widow has the responsibility of polling the subscribers it send to Spawn node. If a subscriber is not being polled by Black Widow anymore, Spawn node will treat the subscriber as orphan and remove it. This practice help to eliminate the threat of Spawn node running obsoleted subscriber.

On Black Widow, when a subscriber polling fails, it will try to get a new node to assign the job. If the Spawn node of the subscriber still available, it is likely that the same job will go to the same node again due to our routing mechanism we used.

Monitoring

In a happy scenario, all the subscribers are running, Black Widow is polling and nothing else happen. However, this is not likely to happen in real life. There will be changes in Black Widow and Spawn nodes from time to time, triggered by various events.

For Black Widow, there will be changes under following circumstances:

  • Subscriber hit limit
  • Found new subscriber
  • Existing subscriber disabled by user
  • Polling of subscriber fails
  • Polling of Spawn node fails
To handle changes, Black Widow monitoring tool offers two services: hard reload and soft reload. Hard Reload happen on node change while Soft Reload happen on subscriber change. Hard Reload process takes back all running jobs, redistribute from beginning over available nodes. Soft Reload process removes obsoleted jobs, assigns new jobs and re-assigns failed jobs.


Compare to Black Widow, the monitoring of Spawn node is simpler. The two main concerns are maintaining connectivity to Black Widow and removing orphan subscribers.


Deployment Strategy

The deployment strategy is straight forward. We need to bring up Black Widow and at least one Spawn node. The Spawn node should know the URL of Black Widow. From then, the Health Check API will give use the amount of subscribers per node. We can integrate Health Check with AWS API to automatically bring up new Spawn node if existing nodes are overloaded. The Spawn node image will need to have Spawn application running as service. Similarly, when the nodes are not utilized, we can bring down redundant Spawn nodes.

Black Widow need special treatment due to its importance. If Black Widow fails, we can restart the application. This will cause all existing jobs on Spawn nodes to become orphan and all the Spawn nodes go to detached mode. Slowly, all the nodes will clean up itself and try to register again. Under default configuration, the whole restarting process will happen within 15 minutes.

Threats and possible improvement

When choosing centralized architecture, we know that Black Widow is the biggest risk to the system. While Spawn node failure only causes a minor interruption in the affected subscribers, Black Widow failure finally lead to Spawn nodes restart, which will take much longer time to recover. 

Moreover, even the system can recover from partial, there still be interruption of service in recovery process. Therefore, if the polling requests failed too often due to unstable infrastructure, the operation will be greatly hampered. 

Scalability is another concern for centralized architecture. We have not had a concrete amount of maximum Spawn nodes that the Black Widow can handle. Theoretically, this should be very high because Black Widow only do minor processing, most of its effort are on sending out HTTP requests. It is possible that network is the main limit factor for this architecture. Because of this, we let the Black Widow polling the nodes rather than the nodes polling Black Widow (other people do this, like Hadoop). With this approach, Black Widow may work at its own pace, not under pressure of Spawn nodes.

One of the first question we got is whether it is a Map Reduce problem and the answer is No. Each subscriber in our Distributed Crawling System processes its own comments and does not reporting result back to Black Widow. That why we do not use any Map Reduce product like Hadoop. Our monitor is business logic aware rather than purely infrastructure monitoring, that why we choose to build ourselves over using monitoring tools like Zoo Keeper or AKKA

For future improvement, it is better to walk away from Centralized Architecture by having multiple hubs collaborating with each other. This should not be too difficult provided that the only time Black Widow accessing database is loading subscriber. Therefore, we can slice the data and let each Black Widow load a portion of it. 

Another point that make me feel pretty unsatisfied is the checking of global counter for user limit. As the check happened on every comment crawled, this greatly increases internal network traffic and limit the scalability of system. The better strategy should be divide of quota based on processing speed. Black Widow can regulate and redistribute quota for each subscriber (on different nodes).

Wednesday, 20 August 2014

The Emergence of DevOps and the Fall of the Old Order

Software Engineering has always been dependent on IT operations to take care of the deployment of software to a production environment. In the various roles that I have been in, the role of IT operations has come in various monikers from “Data Center” to “Web Services”. An organisation delivering software used to be able to separate these roles cleanly. Software Engineering and IT Operations were able to work in a somewhat isolated manner, with neither having the need to really know the knowledge that the other hold in their respective domains. Software Engineering would communicate with IT operations through “Deployment Requests”. This is usually done after ensuring that adequate tests have been conducted on their software.
However, the traditional way of organising departments in a software delivery organisation is starting to seem obsolete. The reason is that software infrastructure have moved towards the direction of being “agile”. The same buzzword that had gripped the software development world has started to exert its effect on IT infrastructure. The evidence of this seismic shift is seen in the fastest growing (and disruptive) companies today. Companies like Netflix, Whatsapp and many tech companies have gone into what we would call “cloud” infrastructure that is dominated by Amazon Web Services.
There is huge progress in the virtualization technologies of hardware resources. This have in turn allowed companies like AWS and Rackspace to convert their server farms into discrete units of computing resources that can be diced and parcelled and redistributed as a service to their customers in an efficient manner. It is inevitable that all this configurable “hardware” resources will eventually be some form of “software” resource that can be maximally utilized by businesses. This has in turn bred a whole new genre of skillset that is required to manage, control and deploy these Infrastructure As A Service (IaaS). Some of the tools used by these services include provisioning tools like Chef or Puppet. Together with the software apis provided by the IaaS vendors, infrastructure can be brought up or down as required.
The availability of large quantities of computing resources without all the upfront costs associated with capital expenditures on hardware have led to an explosion in the number of startups trying to solve problems of all kinds imaginable and coupled with the prevalence of powerful mobile devices have led to a digital renaissance for many industries. However, this renaissance has also led to the demand for a different kind of software organisation. As someone who has been part of software engineering and development, I am witness to the rapid evolution of profession.
The increasing scale of data and processing needs requires a complete shift in paradigm from the old software delivery organisation to a new one that melds software engineering and IT operations together. This is where the role of a “DevOps” come into the picture. Recruiting DevOps in an organisation and restructuring the IT operations around such roles enable businesses to be Agile. Some businesses whose survival depends on the availability of their software on the Internet will find it imperative to model their software delivery organisation around DevOps. Having the ability to capitalise on software automation to deploy infrastructure within minutes allows a business to scale up quickly. Being able to practise continuous delivery of software allow features to get into the market quickly and allows a feedback loop in which a business can improve itself.
We are witness to a new world order and software delivery organisations that cannot successfully transition to this Brave New World will find themselves falling behind quickly especially when a competitor is able to scale and deliver software faster, reliably and with less personnel.

Sunday, 3 August 2014

Information is money

When people ask me what am I doing, my immediate response is IT. Even though, the answer is not very specific, it is the easiest to understand and it still helps to describe what we are doing. In fact, it doesn't matter what programming languages we use, our responsibility is to build the information system, which deliver information to end-user. Therefore, we should value information more than anyone else. However, in reality, I feel there are so much wasted information in modern information system.

In this article, I would like to discuss the opportunity to collect user behaviour and measure user happiness when building information system. I also want to share my idea on how to improve user experience based on data collected.

How important is user's behaviour information

Let begin with a story that happened in my earlier career. We need to implement an online betting system for customer, which function similarly to a stock market. In this system, there is no traditional bookmakers like William Hill. Instead, each user can people offer and accept the bets from another. Because it is a mass market with big pool of users, the rate offered is quite accurate and the commission is pretty small. However, the betting system is not our focus today. What capture my intention most is not the  technical aspect of the project, even though it is quite challenging. In stead, I feel interested with the way the system silently but legally make huge amount of profit based on the information it collected.

The system captured the bet history of every user, through that, identify top winners and top losers of each month. Based on that information, the system automatically place the bet follow the winners and against the losers. Can you imagine that you are the only person in the world who know Warren Buffett's activities in real time? Then, it should be quite simple to simulate his performance, even without his knowledge? Needless to say, this hidden feature generated profit at level of hundred thousands dollars every single day.

In the open market, information is everything and we see why the law punish insider trading or any other attempt to gain advantage of information so strictly like that. However, there is no such kind of law for online gambling activity yet and this practice is still legal. That early experience gave me a deep impression on how important is information.

Later, I have interest in applying psychology when dealing with customer. In order to persuade one person or making sale happened, one guy need to observe and understand his client. Relate what I have learnt to the information system that I built before, I feel that it is not so nice to implement a system just only serve as information provider or selling tool. Actually, we do have chance to do much better if we really want.

Website authors know the importance of user experience and they did try their best to collect user information using online survey. However, personally, I feel this approach will never work. I have never answer any survey myself. Any time I saw a popup, it doesn't matter how polite is the words or how beautiful is the design, I will just click on close button.

We should not forget that no matter how important is the user feedback, it is not the user's benefit to answer our survey. In fact, no sale person approach client to ask them to do customer experience survey, unless there is incentive to do it.

Hence, the information is still need to be collected, but in a way that user does not notice it (remember how Google silently monitor anyone using their services?)

How should we use the information?

We should not waste effort collecting information if we even don't know what to do with it. However, this is not something new. Whenever I go to a professional selling site like Amazon, I find it is quite cool that they have managed to use every single piece of information they have to push sale. One time, I went there searching for helmet, next time I saw all the items for a rider like me. They remembers every single item that users have viewed or bought and regularly offer new things based on the data they collected.

Google and Facebook also do similar things. They will try to guess what you like or care about before delivering any ads to you. The million dollars question is can we do any better than this?

I vote yes. It does not means that I do not appreciate the talent and the profession of the product teams in Amazon, Google for Facebook. However, I feel that there is still a distance between these products and an experience sale person. Let imagine there is a real person that sharing desktop view with customer, seeing every mouse-click, movement and keys entered. Given this guy can pause user for a while, so that he can think, analyse and decide what user will see next, what will he do?

It is apparently that the information we collect from user screen cannot compare to the information from a face to face communication, but we have not used up this information yet. Most of the system automatically make the guess that any product that customer clicked on is what he like. A person can do better than that. If an user open the phone in 3 seconds and immediately move to other phones, he may accidentally click on the phone rather than by intention. Moreover, if he spend more time on a phone, keep coming back to it and even open the specification, we can be very sure that this is what he is looking for.

How should we collect the information?

As mentioned above, it will never work if we interrupt users to ask questions. The right mechanism for collecting information must be observation. For all the available solution in the market, I think what is missing in the ability to measure time stamp of events and connecting individual events to form an user journey. Without connecting the dot, there will be no line, without connecting events, there will be no user journey. Without the time stamp, it will be very hard to measure user satisfaction and concern.

Capturing user actions is not very challenging provided that we are the owner of website. Google Analytic can help to capture user actions but it is a bit hard to use in our case because of the limited information that it carry (HTTP GET request). We should understand that this is the only choice that Google Analytic team have because any other kind of requests will be blocked by cross-site scripting prevention.

The better way to carry this information is through HTTP POST request, which can carry the full event object, serialized in JSON format. This is perfectly eligible as the events is sent back to the same domain. To link up the events together, it is best to assign an unique but temporary id for user. We do not need to remember or identify user, therefore, this information may not need to be stored as a persisted cookie on browser. With the temporary id, two separated visits to website by the same user will be logged to 2 different journeys. While it is not optimal, it is still offer some benefits over normal kind of tracking.

If you can persist the cookie on browser or if user login, things will bet more interesting as we can link individual journeys to one.

After this, there come the biggest and most challenging part of the system where you need to figure out one mechanism to optimize customer experiences based on his journey. Unfortunately, this part is too specific for each system that our experience and methods may not be very useful for you at all. However, in general, we can measure user satisfaction and happiness based on the time users spend at each step. We also can figure out user interest by measuring the time spend for each product. From there, please build and optimize your own analysing tool. This is a very challenging but interesting task.