VIDEO BLOG SERIES: Monitoring Mission Critical SQL Servers - How to utilize alerting to predict system bottlenecks
In the first part of my video blogs I talked about the reasons, why SQL Servers are critical for businesses, what are the main reasons behind SQL Server performance problems, and how to solve those. On these next videos of the series, I will go through the different types of alerting mechanisms that can be categorized under the reactive monitoring methods.
Basic warnings and alerts in SQL Server monitoring
I like to divide the different warnings and alerts in SQL Server monitoring into different categories, which I call levels. The higher the level, the more sophisticated the alerting mechanism. The most basic warnings and alerts are based on threshold levels that are based on best practices.
The good thing about the basic warnings and alerts is that it follows the best practices. You can add up more logic to such warnings and alerts by controlling them based on criticality level. For example, if you have a high number of SQL Server instances, I would recommend to divide the alerts based on their criticality level to critical, high importance, and less important ones.
On the downside of these basic alerts, you can end up getting too many alerts, if there are many servers often breaking the threshold limits. Another con is that there might be a lot to configure. It can, for example, be hard to define the criticality levels for hundreds of servers.
Self-adjusting alerts can be adapted to baselines
The self-adjusting alerts in SQL Server monitoring can be adapted to system baselines. This is for the purpose where there are lots of different database, instance, and server workloads.
In order to define valid threshold levels, we need to have the statistics of all the log history and see how the performance counters behave over time. One simple mechanism is that if we take average and maximum values of a certain performance counter, and then we take an average of the average and maximum, and that will be the warning level. The warning level, on the other hand, would be an average of the average and maximum, and a maximum value.
It’s good that the self-adjusting alerts adapt to different kind of servers, instances, and databases, depending on their behavioral characteristics over time. Also, you will get fewer, but more relevant alerts by utilizing this method.
What you should be careful of is that this method might allow some unhealthy performance. Let's take an example: there’s a SQL Server instance that is already unhealthy. We calculate the baselines for the self-adjusting alerts based on the unhealthy performance of that instance. Then we end up getting lower values on average than we should, which would again lead into unhealthy performance.
Consecutive alerts inform about recurring problems
Consecutive alerts are a relatively simple, yet smart alerting mechanism. They inform about recurring problems in the system.
Imagine a SQL Server that has several instances running on a same server. There is one particular instance, which you don't want to exceed a 20% workload on a two-minute interval. On the other hand, there is another SQL Server instance that exceeds that 20% level quite often, but that is okay in the case of that instance. You just don’t want it to continuously exceed the 20% level. For this purpose, we can set a consecutive alert for the instance. You can define the alert level for that instance so that it has to have a certain number of consecutive intervals exceeding the threshold point before the alert is given.
This method is good for picking up constant problems. On the other hand, you might have difficulties in prioritizing these alerts compared to other alerts.
SLA-driven alerts are based on technical service levels
SLA-driven alerts are based on technical service levels over a given time window.
Let’s take, for example, the peaking CPU utilization over time in two-minute intervals for certain SQL Server as the performance counter. You would like to set a threshold level of 60% for the server. But the server also has a service level defined that says that 99% of the time the value should be lower than 60%. In this case, only if the service level would exceed 60% on more than 1% of the time within a given time window, you would like to get an alert.
To have the best coverage over the whole platform, my recommendation is to combine the SLA-driven alerts with consecutive alerts. This mechanism alerts only when the overall service level is endangered, and thus it will create less alerts.
Jani K. Savolainen
Founder & CTO
DB Pro / SQL Governor