Tuesday, 27 June 2017

High volume, low latency system

We are currently in the process of optimizing the bidding platforms for PocketMath, one of the largest supply of mobile programmatic inventory in the world. With a fleet of 15 to 35 bidders around the world, this platform help to serve 40 billion to 70 billion of requests per day. The latency of the system is pretty good with 95 percentile of response time fall below 2 milliseconds.

The optimization process gives us a precious opportunity to think about which factors are crucial to building a high performance, low latency system.

In this article, let us share the principles that guide our development process.

Background


As some of you may have been familiar with, Real-time bidding (RTB) is a means by which advertising inventory is bought and sold on a per-impression basis, via programmatic instantaneous auction. In this ecosystem, PocketMath is a Demand-side Platform, which helps the buyers to buy impressions from Ad Exchange.


Because all of the buying and selling happen per impression, the latency requirement for RTB is very strict. Most of the Ad Exchanges will not accept any response from DSP after 100 milliseconds. This time constraint is quite tight if we take the network round-trip into consideration. Normally, network transfer contributes more to the total response time than processing bid request. Therefore,  to reduce the timeout risk, DSP will normally self-impose a much lower limit.

Other than fast response time, there is another requirement for DSP is stability. It is a common practice for Ad Exchange to throttle the traffic to DSP if the timeout happens too often.

Architecture Guideline


As a design mistake can be very costly in the long run, it is better to get it right from the starting point.


Knowing the limit


Compare to other domains like FinTech, the non-functional requirement of RTB is quite special that. it enforces the maximum response time very strictly but does not make it mandatory to process every request. However, the server should still send a no bid response when it intends to skip a request. in RTB, skipping processing only causes opportunity lost, which is not too bad compared to what may happen to a mission critical system. However, failing to process a request in time is much worse because it does not only causes loss of opportunity but also wastage of resource.

Therefore, we designed our system to always operate at the optimal throughput regardless of the traffic volume. To ensure each component in the system is functioning at the optimal level, surge protectors are added at the component client so that additional load can be automatically discarded as early as possible.

Every quarter, while reviewing the volume of inventory, we also adjust hardware and calibrate all the rate limiters to keep the system running at the best value for money. In case the load suddenly surges due to a spike of traffic or campaign configuration, the system should still continue to process the traffic at its designed capacity and skip the load it cannot handle. The response time is consistent regardless of the load.


Knowing when to apply microservices architecture


Microservices architecture gives us a lot of flexibility to develop and maintain the system. However, it also adds network latency to time-critical tasks. Therefore, we need to think twice before applying microservice architecture to our system. For time-critical request, it is better to minimize the number of network hops that the information needs to travel before the response can be generated. We keep the component that facing exchanges as a near single monolithic application. It is a huge component with lots of logic and information embedded to process the majority of requests. Only for some requests that the required information is too big to cache or need to be real-time, then this component will make a network to other components in the system. Moreover, this component can operate partially by discarding the requests it cannot handle if some the external components are not available.

Stream Processing


It is a simple fact that centralized architecture won't scale. Therefore, to be scalable, the system architecture should resemble a graphic card design more than a CPU design where the information can be processed in parallel and independently. To achieve that, it is necessary to keep all of the components stateless and the information package to be self-sufficient for processing.

Try your best to avoid any processing that may require shared resource like a physical database. For example, if the data is immutable, we can clone the data to many read-only databases or caches to avoid centralize processing. It is even more crucial to avoid the scenario where the information can not be processed independently like locking by unique indexes.

Eventually, if these conditions cannot be fulfilled,  we should try to reduce the impact by minimizing the common part by applying MapReduce processing or in-memory computing. We should also add some redundancy to the components that handling the bottleneck to minimize the risk.

Auto Recovery


Even after applying all of the good practice, maintaining system stability is still a very challenging task because there are too many unknown factors that can affect the system throughput. For example, average processing time can be highly variable while system performance can be temporarily degraded due to backup, hardware upgrade or intermittent network issue.

The easy way to improve system stability is to increase the redundancy. However, redundancy only works best when the system is so critical that efficiency is not a concern at all. Otherwise, developers should resort to a smarter method to cope with this challenge.

Fortunately, the approach in this use case is pretty straightforward. When being overloaded, you are left with 2 choices, upgrading the hardware or reducing the load. If possible, we should do both. However, autoscaling can be regarded as Devops responsibility but reducing load is a challenge that should be tackled at the architecture level. In order to do that, we need to build a feedback mechanism so that the front end components can slow down or stop responding to new requests when the backend components are overloaded so that the system can go back to the balanced state.

Implementations Guideline


For a high-performance system, the crappy code will be punished as long as it was exposed. Therefore, it is never redundant to optimize your implementation twice before rolling it out to the production. Here are some of our experience with developing the high-performance system.

Monitoring


For whatever purpose, it is always to good practice to implement health check on the system. However, for a high-performance system, the requirement for monitoring is even more sophisticated with the need for collecting insight about system operation. This information can be very crucial for detecting anomalies, preventing crash and system tuning. We should not only care if the implementation works but also how well is the execution.

There are some well-known APM in the market like NewRelic or DataDog that can help us collecting operation metrics and providing alerts when bad things occur. The license of APM may not be cheap but it is highly recommendable to afford one because of the benefit they will bring in the long term.

In addition to APM, it is also a good idea to embed debug API into health check so that developers can do the in-depth investigation on Production environment whenever they need to. This practice has proven to be very useful in resolving outages and troubleshooting user inquiries.

Testing on Production


This practice will surely trigger some concern as it is considered a taboo in the IT world until recently. Simply speaking, the landscape of software industry has changed. In the past, software development is normally a side function of the big corporates with the mission to build domain specific applications. However, nowadays, software development tend to play a much bigger role of transforming life in various startups around the world. In this new role, for an IT project to be successful, the pace of changes may be more important than maintaining the stability of the system. Therefore, we need to make a case by case judgment on the balance between time to market and the quality control.

For high-performance systems, testing in Staging is less effective because most of the performance issues only appear under heavy load. It is also difficult to load test them using simulated data because there are too many possible combinations of inputs that may expose hidden bugs. Therefore, similar to what Facebook and Google have done, it is not necessarily harmful to sacrifice a small part of traffic for testing new features. The key requirement for this practice is the ability to identify and contain the damage when it happens.

Understanding Machine

At the beginning of this century, there are many initiatives to make programming easier by isolating the business logic from underlying machine execution. Therefore, the development of complicated application become lots easier with additional layers of abstraction. However, as a side effect, developers manage to go through many projects without collecting fundamental knowledge about underlying execution.

However, if you are lucky enough to work on a high-performance system, this knowledge will be important and relevant again. We have seen tremendous benefit of well-optimized code that makes a good use of hardware to get the work done. This benefit can sometimes come as higher throughput, smaller memory footprint or even more critical outcomes like lower latency and more stable performance. It is easy to see that the latter outcomes are somethings not easy to achieve with more hardware but only better implementation.

At the basic level, developers should understand the underlying implementation of programming language for common syntaxes. For the advanced level, it is important to pick up knowledge about operating system, network and the hardware infrastructure as well.

Understandable Code


The biggest source that contributes to performance or functional bugs in our system is code complexity. It is difficult to add new features if existing codebase is too difficult to understand. Patching mentality will continue to increase the technical debt further until it is almost impossible to avoid making mistake.

Moreover, complicated codes will have a negative impact on CPU utilization as the biggest performance blunder comes from missing CPU cache rather than processing speed. Hence, a high-performance implementation is also a clean, easy to understand and straight-forward implementation.

In-memory computing


In-memory computing is a hot trend recently due to memory getting much cheaper and bigger. In the past, when we need more performance, the most common trick is to increase the concurrency level. Most of the web servers in the past have a high number of CPU cores but a relatively low amount of memory per core. That fact implies that web server role is processing of web request rather than data. Most of the data processing usually happens in the data warehouse rather than application. However, for low latency system, retrieving and processing data remotely is considered too expensive. Therefore, if an application can not churn out higher throughput or lower latency with more memory, it might not be well optimized to utilize all available hardware. In the perfect scenario, the system should reach max CPU and memory utilization at the same time. When under load, if only one of them is the bottleneck while the other still has lots of redundant capacity, then it may be a good indicator of wrong hardware configuration or bad optimization.

In a high-performance system, both CPU and memory should be treated as the precious resource. We should be careful to conserve both with well-known techniques like object pools, primitive types, suitable data structure and efficient implementation.

Conclusion


Developing a low latency and high throughput application requires some special skill sets that not easy to find in the mass market. A common perception is good developers will write performance code. This is true for most of the time. However, many experienced developers, who shine in building other applications but still struggle when dealing with the high-performance system because of old habits and lack of performance consideration in mind. The key point for success in this area should lie in the self-reliant analysis, fundamental understanding and logical thinking.

It is also worth highlighting it is not always better to follow the trend in development world because many new methods are good for some other purposes rather than performance. Therefore, it is good to keep learning new things but should always understand the cost versus benefit for each of them based on your priority.

22 comments:

  1. Worth reading this article... Keep rocking. Implement SharePoint online from veelead solutions

    ReplyDelete
  2. LifeVoxel.AI has developed a Interactive Streaming and AI Platform for medical imaging using GPU clusters cloud computing. It is a leap in cloud technology platform in medical imaging that encompasses use cases in visualization, AI, image management and workflow. It’s approach is unique that it has been granted 12 International patents. LifeVoxel.AI’s platform is certified for HIPAA compliancy. LifeVoxel’s cloud addresses the Internet limitations of bandwidth, latency and scalability which are pivotal in this respect. The platform was granted an FDA 510K approval for use in diagnostic interpretation of medical images.

    LifeVoxel.AI Interactive Streaming AI Platform RIS PACS server Medical Imaging

    ReplyDelete
  3. I have read your Excellent Post.This is Great Job. i have enjoyed your reading your post first time.
    i wnat to say thanks for this post..Thank YouSattaking
    sattaKing

    ReplyDelete
  4. Excellent article. It is Very interesting to read. I really love to read such a nice article.
    Thanks! keep rocking. Satta King

    ReplyDelete
  5. Try to focus some of your exercises on increasing speed. play bazaar satta king Do jumping squats to increase the speed of your reflex muscles. Squat down and jump on a step. Stand straight up and then jump back down and into a squat. Repeat this exercise 40 times daily and gradually increase the height of the step.play bazaar satta king

    ReplyDelete
  6. LifeVoxel saves customers 50%+ over conventional RIS PACS system with higher functionality

    RIS PACS
    RIS PACS Software

    ReplyDelete
  7. Thanks for provide great informatic and looking beautiful blog, really nice required information & the things i never imagined and i would request, wright more blog and blog post like that for us. Thanks you once agian

    court marriage in delhi ncr
    court marriage in delhi
    court marriage in noida
    court marriage in ghaziabad
    court marriage in gurgaon
    court marriage in faridabad
    court marriage in greater noida
    name change online
    court marriage in chandigarh
    court marriage in bangalore

    ReplyDelete
  8. i-LEND is an online marketplace connecting borrowers and lenders for loans. Although i-LEND verifies credentials of registered users on the site, it does not guarantee any loan offers by lenders nor does it guarantee any repayments by borrowers. Users make offers/loan requests at their own discretion with the understanding of the risks involved in such transactions including loss of entire capital and/or no guarantee of recovery. Please read our Legal agreements to understand more. personal loan

    ReplyDelete
  9. Thanks for this blog are more informative contents step by step. I here by attached my site would you see this blog.

    7 tips to start a career in digital marketing

    “Digital marketing is the marketing of product or service using digital technologies, mainly on the Internet, but also including mobile phones, display advertising, and any other digital medium”. This is the definition that you would get when you search for the term “Digital marketing” in google. Let’s give out a simpler explanation by saying, “the form of marketing, using the internet and technologies like phones, computer etc”.

    we have offered to the advanced syllabus course digital marketing for available join now.

    more details click the link now.

    https://www.webdschool.com/digital-marketing-course-in-chennai.html

    ReplyDelete
  10. Amazing blog useful information.

    Web designing trends in 2020

    When we look into the trends, everything which is ruling today’s world was once a start up and slowly begun getting into. But Now they have literally transformed our lives on a tremendous note. To name a few, Facebook, Whats App, Twitter can be a promising proof for such a transformation and have a true impact on the digital world.

    we have offered to the advanced syllabus course web design and development for available join now.

    more details click the link now.

    https://www.webdschool.com/web-development-course-in-chennai.html

    ReplyDelete
  11. Thanks for sharing such a nice information with us...
    Java Course in Bangalore

    ReplyDelete
  12. Second Innings Home is the first and only premium home & health care service in India. Second Innings Home proposed across the nation features a beautiful campus ideally located in a well-maintained gated community in the format of a Star Hotel with luxurious amenities. It’s convenient to enjoy the privacy and to be near the city and nearby facilities. And yet it retains a sense of community spirit and the warmth of a small community. retirement homes in Hyderabad

    ReplyDelete
  13. Great blog !It is best institute.Top Training institute In chennai
    http://chennaitraining.in/openspan-training-in-chennai/
    http://chennaitraining.in/uipath-training-in-chennai/
    http://chennaitraining.in/automation-anywhere-training-in-chennai/
    http://chennaitraining.in/microsoft-azure-training-in-chennai/
    http://chennaitraining.in/workday-training-in-chennai/
    http://chennaitraining.in/vmware-training-in-chennai/

    ReplyDelete
  14. DrainVac Hyderabad is the leading dealer of Central Vacuum System for residential purpose. They have a wide range of vacuum system consisting of modern features and user-friendly, giving a whole new experience of cleaning interior area. But before you go ahead to buy this system, it is important to know what is Central Vacuum System is and how it works?

    ReplyDelete
  15. Global Interscope is the best swimming pools designing in Hyderabad where the you can get all the facilities to improve better.

    Swimming pool equipment in hyderabad, Execution, Equipment Supplies, Services and Lifestyle Products.

    ReplyDelete
  16. KloudWIFI truly believes that reliable, fast networks have been the game changers in driving innovation, productivity and instant collaboration supported by the relentless growth of convenient cloud-hosted applications. Even with a powerful, proven network infrastructure like Cisco Meraki and Ekahau, the end user experience can only be truly optimized by considering all the internal and external factors to the end user experiences. meraki insight partner in hyderabad

    ReplyDelete