Level up your production resiliency with Alert Escalation & Incident Management

Published on : Nov 30, 2023

Sandhiya Sivakumar

Author

Many businesses dealing with BizTalk components face common challenges like being unaware of server or service downtimes, relying on periodic manual checks to find out issues, and performing repetitive resolution methods without the aid of automation. BizTalk360 does all these heavy lifting for BizTalk users and provides a complete picture of health by proactively monitoring all the required Servers, Applications, Azure services, End Points and more, based on the threshold and metrics that are configured. In addition to this, recently we have made significant improvements to BizTalk360 monitoring and notification. In this blog, Let’s take a look at one of the improvements regarding Alert Escalation.

First, let’s see the key topics covered in this blog.

Challenges of unnoticed and unresolved issues.
What is an Escalation Policy and how escalation happens in BizTalk360?
Handling Incidents and tracking issue resolution status.
Send reminders and prevent duplicate Incidents in ServiceNow or PagerDuty.
Standard versus Escalation notifications.

Challenges of unnoticed and unresolved issues

In BizTalk360, the user can configure a threshold alarm by specifying the number of alerts that need to be sent upon detecting a violation and when to reset the alert count. A reset of the alert count can occur either upon detecting a new violation or after a predefined reset interval. This is the “Standard” approach we have been using to date.

Whenever a problem is detected, say, for instance, a specific SQL Server is encountering downtime, BizTalk360 sends out real-time alerts notifying the relevant stakeholders about the potential issue. By alerting, teams can swiftly address emerging problems, mitigate downtime, and prevent operational disruptions.

Of course, that isn’t always the case in the real world. Imagine sending an email alert to indicate the downtime of SQL Server, and due to unforeseen circumstances, the recipient doesn’t notice the alert. Once the maximum number of alerts for that specific alarm has been reached, no further alerts will be sent regarding the server downtime. Choosing the “Alert Reset” option will reset the alert count, either after a certain period or when a new issue is detected and as a result, the alerts will be sent to the same recipient once again. But what if the intended recipient is on vacation or unable to address the problem in a timely manner? In such cases, even a minor issue left unaddressed can lead to more significant problems. This situation can significantly impact organizations that rely solely on BizTalk360 to monitor their BizTalk resources.

An optimal solution in such cases involves triggering an escalation through alternative communication channels, such as Teams messages or Twilio phone call, to remind the primary recipient (Level 1) or escalate to one or more higher levels (Say, Level 2). The only way to escalate a problem in previous versions of BizTalk360 was to integrate it with third-party tools like ServiceNow or PagerDuty.

Recognizing these challenges, with the unveiling of BizTalk360 version 10.8, we’ve rolled out a seamless method to escalate alerts across different levels and an efficient incident management capability without the need for any additional tools.

What is an Escalation Policy?

An Escalation Policy in BizTalk360 outlines the procedures for escalating alerts. Creating a policy is a straightforward process: simply provide a name and define the levels. These levels determine who should be notified at each stage of escalation. Each level comes with a maximum wait duration of 24 hours, and you can set up to three escalation levels per policy. If no one acknowledge or closes the incident, you can also repeat the policy up to 5 times.

Note: By default, the maximum escalation levels in a policy is set to 3; if necessary, you can increase this maximum level by running the following SQL query against the BizTalk360 Database. The below query set the maximum escalation levels to ‘5’, change the number according to your requirement.

Update [b360_admin_GlobalProperties] Set SettingValue = 5 where SettingKey = 'MAX_ESCALATION_LEVELS_ALLOWED'

Escalation Process

When BizTalk360 detects a violation, it creates an incident and promptly sends alerts to the recipients in the first level. If the incident is neither acknowledged nor closed, the alerts will be escalated to the recipients in the subsequent levels based on the specified escalation time settings.

Note: Escalation Policy is specifically designed for Threshold monitoring alarms. In the case of Health checks and Data monitoring, the existing alert mechanism is used, and the Escalation Policy doesn’t come into play.

Let’s look at a sample business scenario and how escalation policies will help resolve the issues in real time.

Consider a commercial bank that offers services such as funds transfers and currency exchanges. This bank relies on both internal APIs within its business architecture and third-party APIs to provide these services to customers. The smooth operation of these APIs is crucial, as any disruptions can directly impact the bank’s business activities.

In this scenario, BizTalk360’s Web Endpoint Monitoring can be utilized to continuously monitor the availability and responsiveness of these APIs and send alerts on detecting any issues. But what if the alert notifications about API’s downtime go unnoticed or unaddressed? This could have a serious impact on vital operations such as bank transfers and currency exchanges. The situation may get worse if the internal team or customers are unaware of service failures and discover them only after a significant delay.

In such cases, if notifying the Level 1 (L1) stakeholders doesn’t solve the situation and the endpoints remain inactive for a long period, then it’s time to definitely escalate the situation!

The below animation illustrates an example of creating an escalation policy with the following levels:

Level 1 (L1): Immediately notify the support team via email.
Level 2 (L2): Send a Microsoft Teams notification if the issue persists and the incident is neither acknowledged nor closed after 30 minutes.
Level 3 (L3): Finally, Escalate to a personal mobile number via Twilio after an interval of 1 hour.

The escalation policy can be mapped to the corresponding alarm, and when BizTalk360 identifies any new issues in the web endpoint, it will create an incident and send an alert to L1 through the specified email. The notified alert contains essential information, including the incident number, the current escalation level assigned, and a link to view full details of the incident. In addition, near the description of each issue, the Id of respective incidents under which it has been reported is displayed.

Incident Handling Made Easy

Initially, the incident will be in the “Open” state. When your team is aware of the reported issues and started working on resolving them, they can simply acknowledge the incident to prevent further alert escalations. If the incident remains open and the reported issues persist even after the specified wait duration, it will be escalated further. In the above-discussed case, it will be escalated to the Teams notification channel. Then, the appropriate team can start resolving the issue without any delay.

Once all the problems are solved, you can officially close the incident, marking the end of the issue resolution process. BizTalk360 allows you to acknowledge or close incidents by adding a comment to inform the team of any steps that have been taken to remedy the issue. You also have the choice to close incidents once all issues have been resolved automatically.

Check out the “Incident” page in the Alert History section for a precise overview of all incidents related to your environment. Here, you’ll find a clear display of incidents associated with specific alarms, the current escalation level, and the number of remaining issues that need to be resolved. This page acts as a central hub where BizTalk360 periodically updates which problems are fixed and which still need attention under each incident. Additionally, you can easily access the complete incident history, allowing you to determine who acknowledged or closed the incident and when it occurred.

Send reminder and prevent duplicate Incidents in ServiceNow and PagerDuty

Some companies may not opt for escalating alerts to a higher level, as there is a specific person or group responsible for resolving the issue. However, they do prefer sending a reminder to the user if the previous alert goes unnoticed. Let’s consider a scenario where a BizTalk360 alarm of the Standard notification type is configured to send notifications to a designated email address and create an incident in ServiceNow. In this case, the user wants to receive two notifications for each violation, so the number of alerts per violation is set to ‘2’. When an issue arises, it sends two notifications to the user’s email and simultaneously triggers two ServiceNow incidents for that specific issue. Once the issue is resolved, the user must close both incidents related to the same problem. However, they just want to send a reminder notification, which leads to duplicate incidents, and dealing with overwhelming incidents can be quite a cumbersome process.

With the help of the Escalation Policy, the user can notify the recipient multiple times and create a single incident in ServiceNow or PagerDuty. To achieve this, we have to create a policy with the first level to notify the user via email and create a ServiceNow incident. Then, the second level is to send a reminder alert to the same email address if the problem remains unresolved for a specified duration. Additionally, specifying high priority to the second level mail indicates the delay and urgency.

You may have concerns about handling incidents in both BizTalk360 and ServiceNow, which is resulting in two tasks. To ensure seamless operations with minimal manual intervention, BizTalk360 provides the convenience of automatically closing incidents once all issues have been resolved. You can opt for this by enabling the “Auto close incidents when violation becomes healthy” option in the escalation policy.

Below is an example of an escalation policy, as described now. This approach effectively prevents duplicate incidents while ensuring that unattended ones are brought to attention through reminders

Standard vs Escalation notification

Before choosing between the “Standard” and “Escalation” notification types, it is recommended to fully understand the differences between them. To assist in this understanding, the following table offers a concise summary of the key differences. You can use this information to tailor your optimal monitoring strategy, be it Standard or Escalation Notifications so that it better fits the dynamics of your team and your resolution processes.

Metric/Behavior	Standard Notification	Escalation Notification
Suitable Monitoring Type	Threshold Monitoring Health Check Monitoring Data Monitoring	Threshold Monitoring only
Use-case	Suitable for sending one or more alerts to the same recipients on detecting issues	Suitable for escalating incidents to different levels if issues are not resolved
Alert Generation	Generates “X” number of alerts on detecting any violation	Escalate 1 alert to the respective level
Recipient	Send notifications to the same recipients	Sends notification to designated recipients of the escalated level.
Auto reset of alert count	Applicable (Either when a new violation occurs, or a reset interval is reached)	Not Applicable
Incident Management	Does not create any incident	Create an incident on detecting new issue(s)
Alert Termination	When the maximum alert count is reached When all issues are resolved	When the respective incident is acknowledged or closed When all issues reported under an incident are resolved
Auto Correction / Up Alert	Alert sent to the specified recipients	Alerts sent to the recipients specified in the levels escalated so far
Alert Schedule	Applicable (Does not send alerts during restricted time)	Applicable (Does not escalate incidents during restricted time)

Conclusion

We hope these enhancements in threshold monitoring will contribute significantly to your overall business value. Choose between these options according to what aligns best with your team’s operational preferences and workflows.

If you’re new to BizTalk360, we also offer a free trial so you can get started directly!

Back to Blog