VIDEO BLOG SERIES: Monitoring Mission Critical SQL Servers - How to utilize alerting to predict system bottlenecks
In the previous part of my video blogs I described the different types of alerting mechanisms that can be categorized under the reactive monitoring methods. Next it's time to talk about the proactive monitoring methods, and how they can be utilized for preventing performance problems to occur. Proactive monitoring is my favorite topic as I think it's really the "gold nugget" in performance monitoring.
Trend based alerting mechanism: Predictive alerting
Predictive alerting is a trend based alerting mechanism, such as linear regression or exponential curve for a certain performance counter.
Linear regression can be used to forecast a trend based change in certain performance counter. For example, if there is an even growth trend in CPU utilization, linear regression can show with high probability, how the growth will continue. That is, assuming the resource utilization would continue the linear growth. This forecast data can then be used in raising a predictive alert.
It is also important to understand that it is possible to combine different kind of curves, such as exponential curve, in predictive alerting, to have the best fit for the history that you have been surveilling. If the system is very volatile, it might also make sense to shorten the time window of the analysis of the average momentum of the performance counter. Linear regression could for example show that there would be no change in the average momentum of a performance counter over a long period of time, even though there actually are, if you look at the data in a shorter time window.
Predictive alerting gives a proactive view to the future and therefore it's possible to prevent the performance problems even before they occur. Typically linear regression and exponential curve are very good and simple statistical alerting mechanisms for proactive predictions. But the downside is that it might sometimes generate false alerts. Also, sometimes you may not be able to predict any rapid changes. This means that linear regression is always a certain compromise of the data points that are monitored and thus very rapid changes in for example CPU utilization can't necessarily be tracked. But for these situations we have some other mechanisms.
Pattern-oriented alerting is based on machine learning
Pattern-oriented alerting is an alerting model that is based on machine learning. That means that the algorithm is able to predict a certain behavioral pattern of a performance counter over time.
Simplest way to understand a pattern is for example to look at a situation of peaking CPU utilization on server level over time. Let's take an example, where the peaking CPU utilization is zigzagging between 20% and 60% every two minutes. And it always does that, meaning that it will continue to do so with a 100% probability. This is one of the simplest patterns there can be.
Pattern-oriented alerting will not only tell you what is the probability of a certain pattern to occur, but also when a certain threshold
point is exceeded.
One constraint for being able to use pattern-oriented alerting is that you need to have very detailed log data available from at least one month of time, to understand what are the patterns that occur there. The good news is that you don't have to understand those patterns yourself—it's the pattern-oriented alerting mechanism that can find the most common patterns and trains the result sets. You can even re-train the result sets for next month, the month after that, and so on, in order to improve the accuracy of the forecast and keep it up
Pattern-oriented alerting is good for finding hidden patterns and predict the future. You can get more time to react, to figure out the root cause for a problem that the mechanism forecasts to happen, and even be able to fix it before it occurs—ultimately resulting into higher up-time and better service levels.
Anomaly detection identifies irregularities
I think that anomaly detection is the most magical one out of the methods of proactive monitoring. Anomaly detection has the ability to find hidden anomalies in the monitoring data by utilizing machine learning, AI, and different kinds of mathematical methods, based on time dimension analysis. One way to do this is to cross-query all the levels of the time dimension over time, and to predict with different kind of statistical mechanisms that what is the probability of a certain hidden event to occur. This type of forecasting would not be possible with ordinary alerts, continuous alerts, SLA-driven alerts, by prioritizing the alerts, or with proactive alerts such as predictive
alerts or pattern oriented alerts.
One of the challenges of pattern-oriented alerts is that it's only able to find out something that has occurred earlier. Of course, it can also
understand the trends of those patterns, but still it needs to have occurred before. Anomaly detection can find out those anomalies that
could endanger our system's service levels and up-time before they have ever occurred. It can, for example, predict that "next Friday at 2:00 p.m. there will be too high processor queues, too high CPU levels, too low page life expectancy, etc." that would slow down your system.
Similarly to the other methods in proactive monitoring, anomaly detection can let you predict the future. But in addition, it also can sometimes cause false alerts, if the prediction is not right. In any case, with anomaly detection, you can have a really long time window to react to the alert. For example you could already find out on the previous day that next day at 2:00 p.m. there will be, with certain probability, a problem in your production environment.
In the next part of this series I will tell you about the future of SQL Server monitoring methods, and we will summarize the series.
Jani K. Savolainen
Founder & CTO
DB Pro / SQL Governor