TABLE OF CONTENTS

What is On-Call Management?

On-Call Management helps organizations contain the impact of critical incidents by getting the right person to start working on an incident fast. It does so by automating escalation as per pre-configured calendars, rosters, communication channels, and notification rules.


Why do we need On-Call Management?

Mathew is an L1 agent at Bank of Universe. On a non-descript Friday evening he clocks-in for his shift. The numerous wall mounted monitors light up the NOC centre. He grabs a coffee, settles into a chair, and keeps an eye on the notifications scrolling past the screens surrounding him. His gaze flickers from one screen to another. Years of experience ensure that anything out of the ordinary would catch his eye. And it does.


The notifications for the database server convey a warning. Mathew studies the alert content, and executes the standard operating procedure. That, peculiarly, turns the situation from bad to worse. Now the email server is down as well. Mathew strains to make sense of the alert logs while simultaneously pinging his teammates on Slack. Edwina and Ajay log in. A huddle later they decide to escalate the issue. But to whom?  


Mathew: “Who exactly is in charge of the database server? Ann?”

Ajay: “Yes, but would she be available at this hour? 

Mathew: “Let’s check”

Edwina: “She is on a flight. Who else can we call?”

Mathew: “I’m not sure. Who does Ann report to? Let’s use the org chart.”

Ajay: “Good idea… Here it is, and looks like it is Rupert.”

Edwina: “Cool. Dialing…. No response!”

Ajay: “Should we text him?”

Edwina: “Yes, but send an email as well. And copy Ira and her team.”

Mathew: “But how would we know for sure if someone is looking into this issue?”


An hour and multiple phone calls, texts, slack messages, emails, and hundreds of thousands of dollars of lost revenue later, Bill, Ann’s team member, acknowledges the incident and starts working on it.

 

If only Mathew knew who exactly to escalate the issue to as soon as it came to his notice, the business would not have had to incur such enormous losses.


What is Freshservice On-Call Management?

Freshservice On-Call Management enables organizations to limit the disruption caused by critical incidents and restore business operations fast by streamlining incident response. 


It provides a unified platform for Dev, ITOps, and Business teams to collaborate on while addressing any critical issue. It eliminates ambiguity about accountability, minimizes wastage of time and effort reaching out to relevant team members, and accelerates incident resolution by having all associated information in one place – all while sidestepping burnout.    


Components of Freshservice On-Call Management


Freshservice On-Call Management is built on four pillars:

  • On call schedules

  • On call rotations

  • Escalation policies

  • On-call calendar


Let’s delve into each of them in detail.


On call schedules

An on-call schedule is an availability plan that ensures that the most suitable person is always available – whether day or night, weekday, weekend, or on a holiday – to address critical incidents.


  • On-call management for an organization could have multiple on-call schedules

  • Each on-call schedule is either mapped to an agent group specializing in a domain, or is responsible for a location, or a combination of the two

  • A schedule covers all hours and all days i.e. at all times, someone or the other is available to address an incident

  • Each on-call schedule can have one or more shifts

  • A shift is a pattern for time slots by day or time zone or both. For example, there could be a weekday shift, a weekend shift, APAC shift, North America shift, or North America weekday shift.


For example, you could have an on-call schedule for Database management mapped to the Database agent group. This on-call schedule could have two shifts – one to cover weekends, and the other for weekdays. 



On call Rotation

On-call rotation is the process of rotating shift work across all team members responsible for a specific domain to ensure everybody gets to contribute, learn, and is held accountable. 


  • Each shift is mapped to a set of on-call agents categorised as primary, secondary, and tertiary as per their experience, hierarchy, and/or availability

  • All the agents categorized as primary on-call make up the primary on-call roster. Similarly, all the agents categorized as secondary on-call constitute the secondary on-call roster. Likewise, all the agents labelled tertiary on-call comprise the tertiary on-call roster.

  • Each roster accommodates up to 25 agents

  • Shifts could be rotated by day, week, month, or custom periods to equally distribute ‘work’ during off-work hours, weekends, and holidays


Let’s consider on-call management for the EMEA Schedule. You create a weekday shift starting September 20, 2023 to December 31, 2023, including holidays. What this means is that the agents mapped to this particular shift are on call Monday to Friday including holidays, between these dates. You can, however, customize the shift timings. The agents could be on call 24 hours a day, or for a particular time slot such as 9AM to 5PM or 6PM to 11PM, or during Business Hours.



The shift is rotated between the agents in a rosters. In the above example, the shift is rotated daily. So, if on Monday, Ali from the Primary roster is on call, on Tuesday it would be Ann's turn. And on Wednesday, Bill would be on-call. Thereafter, it would once again be Ann’s turn. However, if on a day Ann is unavailable, the associated secondary on-call agent would be on-call. In this case, that would be Christine, who is the secondary on-call agent for that entire week. If even Christine is unavailable, the agent on the Tertiary on-call roster, i.e.  Masaba would be contacted.



Escalation Policies

An escalation policy specifies the conditions for escalating an incident to specific on-call agents, rosters, or subject matter experts (SMEs) using certain communication channels, with a frequency of notification suitable for that incident. An escalation policy can be applied to any shift within an on-call schedule. 


Structurally, an escalation policy is comprised of a condition/s combined with an escalation path.


Conditions

The conditions for following an escalation policy could be based on one or more ticket fields such as:

  • Priority
  • Status
  • Urgency
  • Impact
  • Source
  • Category
  • Subject
  • Custom fields


For example, the screenshot below displays the condition on which the Urgent Priority Escalation Policy is designed. It also shows the other ticket fields available for fine-tuning the reason for escalation even further.



Escalation Path

When an incident meets certain conditions, an escalation path kicks into action. It specifies:

  • The on-call agent/s, rosters, and/or subject matter experts to be notified
  • The notification channels over which they must be notified
  • The frequency at which they should be notified until the incident is acknowledged


,In the example shared below, an incident will immediately be escalated to the primary on-call roster along with a subject matter expert (Amit Kumar Singh) over email. If the incident remains unacknowledged or unresolved for five minutes, the same recipients are to be notified over phone call and SMS. If after a further 5 minutes the incident remains unacknowledged or unresolved, the recipients must be notified yet again over a phone call, SMS, and WhatsApp.


If level 1 escalation fails to elicit the requisite response from the on-call agents, level 2 escalation kicks in, characterised by a larger scale (including secondary on-call agents and additional notification channels such as push notifications on the Freshservice mobile app).

  

Points to note:

  • Escalation ends when an incident is acknowledged or resolved 

  • If the incident remains unacknowledged or unresolved even after agents across all the levels have been contacted, the same escalation path is repeated

  • The module also makes it possible to inform specific stakeholders when an incident is assigned to an agent

  • Freshservice currently offers 5 levels of escalation

  • If an incident remains unacknowledged even after exhausting all the levels, the entire path is repeated up to 5 times

  • If the incident still remains unacknowledged, notifications are terminated and the activity is captured under the Activities tab on the Incident detail page.

  • In Freshservice, notifications can be sent over:



Note: You won't be charged additionally for the notifications across any or all channels as they are a part of your existing plan already i.e. Growth, Pro, & Enterprise.




On-Call Calendar

An on-call calendar provides a bird’s eye view of the availability of agents and the schedules they are associated with. 


  • An on-call calendar shows the availability of an agent or an agent group

  • The calendar can be viewed by day, week, or month for a specific schedule

  • It can be exported and viewed from any device of the user’s choice

  • This calendar makes it possible for users to check who all are the on-call members of any group at any time of the day








Takeaway

Any organization would want to minimize the business impact of critical incidents while driving growth. Simultaneously, DevOps requires a flexible and resilient setup to scale services while collaborating with ITOps and Biz teams. In such an environment, critical incidents are a necessary evil. 


The smart approach is to design a system that uses a common platform to share relevant and updated information on-demand with all stakeholders, makes everyone accountable for solving problems without getting burned-out, and enables everyone to share their knowledge and learn from others. The magic lies in designing a plan that makes the right person take notice of the incident in minimum amount of time. This is exactly what Freshservice On-Call Management makes possible. It is a system housed on the Freshservice ITSM platform with integrated Alert Management. It enables organizations to deal with critical incidents efficiently. The time saved in chasing the right individual to address an incident is better invested in root cause analysis. This, in turn, paves the way for higher quality long term fixes, thereby fostering an agile, resilient, and scalable growth environment.


What to read next:

How to use Freshservice On-Call Management