TABLE OF CONTENTS

What is On-Call Management?

On-Call Management helps organizations contain the impact of critical incidents by getting the right person to start working on an incident fast. It does so by automating escalation as per pre-configured calendars, rosters, communication channels, and notification rules.


Why do we need On-Call Management?

Mathew is an L1 agent at Bank of Universe. On a non-descript Friday evening he clocks-in for his shift. The numerous wall mounted monitors light up the NOC centre. He grabs a coffee, settles into a chair, and keeps an eye on the notifications scrolling past the screens surrounding him. His gaze flickers from one screen to another. Years of experience ensure that anything out of the ordinary would catch his eye. And it does.


The notifications for the database server convey a warning. Mathew studies the alert content, and executes the standard operating procedure. That, peculiarly, turns the situation from bad to worse. Now the email server is down as well. Mathew strains to make sense of the alert logs while simultaneously pinging his teammates on Slack. Edwina and Ajay log in. A huddle later they decide to escalate the issue. But to whom?  


Mathew: “Who exactly is in charge of the database server? Ann?”

Ajay: “Yes, but would she be available at this hour? 

Mathew: “Let’s check”

Edwina: “She is on a flight. Who else can we call?”

Mathew: “I’m not sure. Who does Ann report to? Let’s use the org chart.”

Ajay: “Good idea… Here it is, and looks like it is Rupert.”

Edwina: “Cool. Dialing…. No response!”

Ajay: “Should we text him?”

Edwina: “Yes, but send an email as well. And copy Ira and her team.”

Mathew: “But how would we know for sure if someone is looking into this issue?”


An hour and multiple phone calls, texts, slack messages, emails, and hundreds of thousands of dollars of lost revenue later, Bill, Ann’s team member, acknowledges the incident and starts working on it.

 

If only Mathew knew who exactly to escalate the issue to as soon as it came to his notice, the business would not have had to incur such enormous losses.


What is Freshservice On-Call Management?

Freshservice On-Call Management enables organizations to limit the disruption caused by critical incidents and restore business operations fast by streamlining incident response. 


It provides a unified platform for Dev, ITOps, and Business teams to collaborate on while addressing any critical issue. It eliminates ambiguity about accountability, minimizes wastage of time and effort reaching out to relevant team members, and accelerates incident resolution by having all associated information in one place – all while sidestepping burnout.    


Components of Freshservice On-Call Management


Freshservice On-Call Management is built on five pillars:

  • On call schedules

  • On call rotations

  • Escalation paths

  • Notification channels and rules

  • On-call calendar


Let’s delve into each of them in detail.


On call schedules

An on-call schedule is an availability plan that ensures that the most suitable person is always available – whether day or night, weekday, weekend, or on a holiday – to address critical incidents.


  • On-call management for an organization could have multiple on-call schedules

  • Each on-call schedule is either mapped to an agent group specializing in a domain, or is responsible for a location, or a combination of the two

  • A schedule covers all hours and all days i.e. at all times, someone or the other is available to address an incident


For example, you could have separate on-call schedules for Database, mapped to the Database agent group. Similarly, you could have Service Design mapped to the Service Design agent group. Likewise, you could have schedules for Release, Supplier Management, Email, etc., each mapped to their associated agent groups.



On call Rotation

On-call rotation is the process of rotating shift work across all team members responsible for a specific domain to ensure everybody gets to contribute, learn, and is held accountable. 


  • Each on-call schedule could feature one or more shifts

  • A shift is a time slot by day or time zone or both. For example, there could be a weekday shift, a weekend shift, APAC shift, North America shift, or North America weekday shift. 

  • Each shift is mapped to a set of on-call agents categorised as primary, secondary, and tertiary as per their experience and hierarchy

  • All the agents categorized as primary on-call make up the primary on-call roster. Similarly, all the agents categorized as secondary on-call constitute the secondary on-call roster. Likewise, all the agents labelled tertiary on-call comprise the tertiary on-call roster.

  • Each roster accommodates up to 25 agents

  • Shifts could be rotated by day, week, or month to equally distribute ‘work’ during off-work hours, weekends, and holidays


Let’s consider on-call management for Database EMEA service. You create a weekday shift starting October 11, 2021 to December 31, 2021, without excluding holidays. What this means is that the agents mapped to this particular shift are on call Monday to Friday including holidays, between these dates. You can, however, customize the shift timings. The agents could be on call 24 hours a day, or for a particular time slot such as 9AM to 5PM or 6PM to 11PM.



The shift is rotated between the agents in a rosters. In the above example, the shift is rotated daily. So, if on Monday, Ali from the Primary roster is on call, on Tuesday it would be Ann's turn. And on Wednesday, Bill would be on-call. Thereafter, it would once again be Ann’s turn. However, if on a day Ann is unavailable, the associated secondary on-call agent would be on-call. In this case, that would be Christine, who is the secondary on-call agent for that entire week. If even Christine is unavailable, the agent on the Tertiary on-call roster, i.e.  Masaba would be contacted.







Escalation Paths

An escalation path is the order in which team members belonging to an on-call shift are notified about an incident. It could also feature individuals that do not belong to an agent group, but are subject matter experts.


  • Escalation ends when an incident is acknowledged or resolved 

  • If the incident remains unacknowledged or unresolved even after agents across all the levels have been contacted, the same escalation path is repeated

  • The module also makes it possible to inform specific stakeholders when an incident is assigned to an agent

  • Freshservice currently offers 5 levels of escalation

  • If an incident remains unacknowledged even after exhausting all the levels, the entire path is repeated up to 5 times

  • If the incident still remains unacknowledged, notifications are terminated and the activity is captured under the Activities tab on the Incident detail page.


In our example, if on a day Ann does not acknowledge the incident, the issue would be escalated to level 2 i.e. both secondary and tertiary on-call agents for that shift. Both Christine and Masaba would be intimated. The escalation path clarifies who exactly to be intimated in case the incident remains unacknowledged.


An incident could also be manually escalated by an agent. If an agent feels that they are unable to address the issue, they can manually escalate it to the next level. If a level has multiple agents and all of them take some action, only the first action would be registered.




Note: The levels of escalation may not necessarily coincide with primary, secondary and tertiary on-call agents. Escalation path offers another level of customization by including up to 10 individuals in the same level. If none of them acknowledge the incident even after 5 rounds of notifications, the issue is escalated to the next level, where even subject matter experts could be roped in. This setup is completely customizable.



Notification Channels & Rules

Notification channels are the means of communication used to intimate agents and stakeholders. In Freshservice, notifications can be sent over:



Notification Rules are the conditions for intimating agents and stakeholders until someone acknowledges the incident. There are three kinds of Notification Rules:

  • Specifying the agents to be intimated at each escalation level

  • Specifying the channels of communication for each level of agents 

  • Setting the time gaps for triggering successive channels of communication

  • Specifying the number of times agents at a particular level must be notified before escalating the issue to the next level



In this example, the primary on-call agent is to be notified over email, SMS, and phone call simultaneously. This process is to be repeated if the incident remains unacknowledged for 5 minutes. If the primary on-call agent does not acknowledge the incident after being intimated 3 times, the issue would be escalated to the secondary on-call agent.


The secondary on-call agent, in this example, would also be notified over email, SMS, and Slack (This configuration is customizable). Once again, the agent would be intimated thrice across 5 minute intervals until the incident is unacknowledged. If the incident still remains unacknowledged, the issue would be escalated to the Tertiary on-call agent who would be notified using the channels and at the frequency specified. 


Note: You won't be charged additionally for the notifications across any or all channels as they are a part of your existing plan already i.e. Growth, Pro, & Enterprise.




On-Call Calendar

An on-call calendar provides a bird’s eye view of the availability of agents and the schedules they are associated with. 


  • An on-call calendar shows the availability of an agent or an agent group

  • The calendar can be viewed by day, week, or month for a specific schedule

  • It can be exported and viewed from any device of the user’s choice

  • This calendar makes it possible for users to check who all are the on-call members of any group at any time of the day








Takeaway

Any organization would want to minimize the business impact of critical incidents while driving growth. Simultaneously, DevOps requires a flexible and resilient setup to scale services while collaborating with ITOps and Biz teams. In such an environment, critical incidents are a necessary evil. 


The smart approach is to design a system that uses a common platform to share relevant and updated information on-demand with all stakeholders, makes everyone accountable for solving problems without getting burned-out, and enables everyone to share their knowledge and learn from others. The magic lies in designing a plan that makes the right person take notice of the incident in minimum amount of time. This is exactly what Freshservice On-Call Management makes possible. It is a system housed on the Freshservice ITSM platform with integrated Alert Management. It enables organizations to deal with critical incidents efficiently. The time saved in chasing the right individual to address an incident is better invested in root cause analysis. This, in turn, paves the way for higher quality long term fixes, thereby fostering an agile, resilient, and scalable growth environment.


What to read next:

How to use Freshservice On-Call Management