VIDEO BLOG SERIES: Monitoring Mission Critical SQL Servers - How to utilize alerting to predict system bottlenecks
UPDATED 23.7.2020
In the previous part of the Monitoring Mission Critical SQL Servers -video blog series I described the different types of alerting mechanisms that can be categorized under the reactive monitoring methods. Now it's time to talk about the proactive monitoring methods and how they can be utilized for preventing performance problems to occur in the data platform. Proactive monitoring is my favorite topic as I think it's really the "gold nugget" in performance monitoring.
Check out the previous parts as well:
Part 1-3: What are the real reasons behind performance problems
Part 4-7: Reactive SQL Server monitoring methods
Part 8-10: Proactive SQL Server monitoring methods
Part 11-12: The future of performance monitoring
In this blog post we cover parts 8-10 of the series and introduce three proactive monitoring methods. Predictive alerting, pattern-based alerting and anomaly detection are all featured in the SQL Governor software for Microsoft SQL Server optimization.
Trend based alerting mechanism: Predictive alerting
Predictive alerting is a trend based alerting mechanism, such as linear regression or exponential curve for a certain performance counter.
Linear regression can be used to forecast a trend based change in certain performance counter. For example, if there is an even growth trend in CPU utilization, linear regression can show with high probability how the growth will continue. That is, assuming the resource utilization would continue the linear growth. This forecast data can then be used in raising a predictive alert.
It is also important to understand that it is possible to combine different kind of curves, such as exponential curve, in predictive alerting to have the best fit for the history that you have been surveilling. If the system is very volatile, it might also make sense to shorten the time window of the analysis of the average momentum of the performance counter. Linear regression could, for example, show that there would be no change in the average momentum of a performance counter over a long period of time, even though there actually are, if you look at the data in a shorter time window.
"Predictive alerting gives a proactive view to the future and helps to prevent performance problems even before they occur."
Predictive alerting gives a proactive view to the future and therefore it's possible to prevent the performance problems even before they occur. Typically linear regression and exponential curve are very good and simple statistical alerting mechanisms for proactive predictions. But the downside is that they might sometimes generate false alerts. Also, sometimes you may not be able to predict any rapid changes. This means that linear regression is always a certain compromise of the data points that are monitored and thus very rapid changes in for example CPU utilization can't necessarily be tracked. Luckily we have some other mechanisms for these situations.
Pattern-oriented alerting is based on machine learning
Pattern-oriented alerting is an alerting model that is based on machine learning. That means that the algorithm of the monitoring software is able to predict a certain behavioral pattern of a performance counter over time.
The simplest way to understand a pattern is to look at a situation of peaking CPU utilization on server level over time. Let's take an example where the peaking CPU utilization is zigzagging between 20% and 60% every two minutes. And it always does that, meaning that it will continue to do so with a 100% probability. This is one of the simplest patterns there can be.
Pattern-oriented alerting will not only tell you what is the probability of a certain pattern to occur, but also when a certain threshold point is exceeded.
One constraint for being able to use pattern-oriented alerting is that you need to have very detailed log data available from at least one month of time in order to understand what are the patterns that occur there. The good news is that you don't have to understand those patterns yourself—it's the pattern-oriented alerting mechanism that can find the most common patterns and train the result sets. You can even re-train the result sets for next month, the month after that, and so on, in order to improve the accuracy of the forecast and keep it up to date.
Pattern-oriented alerting is good for finding hidden patterns and predict the future. You can get more time to react, to figure out the root cause for a problem that the mechanism forecasts to happen, and even be able to fix it before it occurs—ultimately resulting into higher up-time and better service levels.
Anomaly detection identifies irregularities
I think that anomaly detection is the most magical one out of the methods of proactive monitoring. Anomaly detection has the ability to find hidden anomalies in the monitoring data by utilizing machine learning, AI, and different kinds of mathematical methods, based on time dimension analysis.
One way to do this is to cross-query all the levels of the time dimension over time, and to predict with different kind of statistical mechanisms what is the probability of a certain hidden event to occur. This type of forecasting would not be possible with ordinary alerts, continuous alerts, SLA-driven alerts, by prioritizing the alerts, or with proactive alerts such as predictive alerts or pattern oriented alerts.
"Anomaly detection discovers anomalies that could endanger our system's service levels and up-time before they have ever occurred."
One of the challenges of pattern-oriented alerts is that it's only able to find out something that has occurred earlier. Of course, it can also understand the trends of those patterns, but still it needs to have occurred before. Anomaly detection can find out those anomalies that could endanger our system's service levels and up-time before they have ever occurred. It can, for example, predict that "next Friday at 2:00 p.m. there will be too high processor queues, too high CPU levels, too low page life expectancy, etc." that would slow down your system.
Similarly to the other methods in proactive monitoring, anomaly detection can let you predict the future. But in addition, it also can sometimes cause false alerts, if the prediction is not right. In any case, with anomaly detection, you can have a really long time window to react to the alert. For example you could already find out on the previous day that next day at 2:00 p.m. there will be, with certain probability, a problem in your production environment.
What's next?
In the next part of this series I will tell you about the future of SQL Server monitoring methods, and we will summarize the series.
Read the last part of the blog series here – Monitoring Mission Critical SQL Servers - Part 11-12
Jani K. Savolainen
Founder & CTO
DB Pro / SQL Governor