Embracing On-Call System in Software Development Process

Timothy Agustian
Tokopedia Engineering
6 min readJul 4, 2020

--

Every downtime your system had,
Every mistakes that user experienced,
Every risk of data loss that goes unresolved,
Is a fatal blow for the trust of the customers.

Every day in any software development lifecycle, there is always a chance of incident happen. Whether it is your system suddenly having a memory leak causing a downtime, or connection to the database suddenly got lost making all the process got error, or maybe is it just a bad code deployment.

Whatever the reason is, An engineer while focusing on his task still needs to shift when things goes awry.
Prioritizing to stop the bleeding before continuing to do his/her task.

But is there any better ways to implement this incident handling management ?
hoping that we could increase the acceleration of resolving the issues while still delivering the committed task also making sure that the engineer doesn’t have to burn out in a long marathon of software development lifecycle.

Introducing, On-call system.

What is On-Call System ?

On-call system is implementing the schedule that rolls in a team for certain period of time to set the first person who should respond or acknowledge a request or incident. This includes all the incident or request that happens inside or outside work hour.

Why We Need It ?

At 2013, Amazon.com went down for around 30 minutes. Any user that tried to access it were replied with an error message

Oops! We’re very sorry, but we’re having trouble doing what you just asked us to do. Please give us another chance — click the Back button on your browser and try your request again

The outage theoretically cost Amazon around $66,240 per minute or nearly $2 million based on Amazon’s 2012 net sales. And I think it is more than just about the money, The risk of any transactions that could deduct the money from the buyers while not showing any orders will always haunt the customer’s trust and losing customer is much more worst than losing a lots of money.

To summarise and to add additional points, we could say the main reasons are :

  • Decreasing Mean Time To Repair (MTTR) thus increasing the satisfaction of customer
    Those who already scheduled as an On-Call engineer will have a responsibility to easily shifting when an incident happens at work hour and acknowledge swiftly when it happens outside of work hour. This will directly decrease the time need to stop the bleeding.
  • Increasing the responsibility, awareness, skill and ownership of an engineer
    Being an On-Call is much more tiresome than being a regular engineer, Having a responsibility to handle a critical matter which every second really count is a great task. This will help us to be more aware for every code we make and every deployment we do, how the system could impact the whole business process and how knowledgeable ourselves in the system that we own. It is really an impactful job for any engineer.
  • Scheduled system that minimalise the burn out of an engineer.
    The schedule will keep a fair share of time for an engineer being an On-Call. It’s best to keep minimum 2 weeks gap for 1 engineer being an On-Call again depends on the size of your team.
    This will prevent 1 engineer for handling too much incident over a long period of time causing a burnout that makes the work of an engineer feels like a burden.

Procedure of Incident Response

We could try to use this as a guideline to improve the incident management guidelines and how the On-call works when things goes bad.

  • Alert
    This will triggers the first step of the incident response. Make sure that your alert system already working correctly and could directly notify to the On-Call Engineer (using On-Call software such as PagerDuty, Opsgenie). Also every alert must be an early notice before the real problems occurred but still relevant excluding all the false alarm as much as possible.
  • Triage
    After the On-Call got the alert, he/she will need to stop the wound as soon as possible. If you stuck and don’t know what to do, escalate the incident and ask for help, focus on solving the issue first and never be ashamed for asking.
  • Notify
    After the wound is healed, don’t forget to notify to your teams, operations, or any others impacted stakeholder. They may have a needs to check their own system and notify the customer.
  • Investigation
    After the problem is recover, The engineer needs to investigate it throughly. What is really the root cause ? What cause it in the first place ? What was the flow that cause the system to break ?
  • Resolution
    After the investigation process, On-Call Engineer could do a discussion within a team to decide any short term and long term solution for this kind of problems, and plan the execution.
  • Documentation
    Wraps all the information above in a complete detail of documentation. This will help the others On-Call engineer when similar problem arise.

How to implement the recommended On-Call system ?

There are several things that need to highlight before implementing the On-Call system, and this will define how good the incident management in your team.

A. On-Call Engineer and Regular Engineer
Having a same number of task and treatment between On-Call and regular engineer is really not a good way of planning. While regular engineer could focus on any task without any limitation, an On-Call engineer must have a task that have low urgency (Preferable reliability task) since there will always be a chance that a nonstop trains of issue will haunt the On-Call repeatedly.
Make sure that the On-Call schedule is distributed fairly between team members with a reasonable gap in each schedule.

B. Determine The Priority Levels
Not every incident or every request must be handled swiftly, there are things that better to be handled inside work hours based on the urgency. Any kind of stuff that impacts directly to customer must be categorized as first priority whereas the On-Call must act as fast as possible. This will minimalize the On-Call needs to handling any works outside work hour.

C. Primary and Secondary
This might not be applicable for everyone but still can be considerable depends on the size of the teams, the frequency of incidents and the complexity of the problems.
Setting 2 engineers which works as Primary and Secondary On-Call Engineer.
Primary have the responsibilities as the first person to acknowledge and receive the alerts and keeping things safe inside or outside work hour while the Secondary act as a backup to response and solving things if by any cases the Primary is unreachable. But if by any chance, there is a critical issues, then it is recommended that both of them to work together.
Primary and Secondary must always have quick access to resolving an incident so taking a trip to a mountain, movies or any uncontactable place might not a wise decision.

D. Great On-Call Culture
Almost every engineer have a bad paradigm about the job as On-Call. The uncertainty of the problem that could impact the whole company, heavy responsibility and the availability for the whole shift might be too much for everyone. But it is a critical job to be done and the team have to build a culture whereas the On-call is a privilege and great responsibility to deepen the knowledge and maintain reliability. There are things to think about and discuss in a team related to On-Call paradigm :

  • Blameless culture
    In my point of view, every code that has already been in production is a code owned by team. In every incident, Let’s focus on the problems, investigation, solution and improvement rather than blaming each other.
  • Dispensation
    Those On-Call who have to stay up until morning needs to be given a dispensation according to the company regulation. This could be discussed within a team to decide the rules.
  • Evaluation
    In every periods of time, an evaluation is needed to improve the best way to implement the On-Call system that benefits everyone and doesn’t have to sacrifice every On-Call engineer.

In Summary

On-Call is a critical system to implement for tackling any unexpected incident that might occurs. This will help the growth of an engineer while maintaining customer satisfaction. By implementing the correct procedure and mindset, On-Call job could achieve its maximum benefit and potential without having to make an engineer feels burn out.

At Tokopedia, we do believe that we could focus on our customer’s needs through reliability and availability of our system and with the On-Call system, it does help to swiftly handles any problem, thus increasing the satisfaction and trust of our customer.

As always, we have an opening at Tokopedia.
We are an Indonesian technology company with a mission to democratize commerce through technology and help everyone achieve more.
Find your Dream Job with us in Tokopedia!
https://www.tokopedia.com/careers/

--

--