Predictive monitoring features in SQL Governor

If you had an expensive office full of people and valuable stuff in it, wouldn’t it be totally crazy not to take care of your fire alarm system? But what if you were able to predict a potential fire even many days before it would actually occur? At its best, this would give enough time to prevent the fire and therefore save priceless lives and valuable resources.

In the world of databases, SQL Servers hold the most valuable and business-critical data - discontinuity in the workings of the data platform can often lead to severe financial losses or even worse. This is the fact, especially in so-called hard systems – the environments taking care of peoples’ health and safety.

For this concrete business need, SQL Governor has developed predictive monitoring features for years. These features include:

  • Regression-based alerts
  • Machine Learning-based pattern alerts
  • Time-series-based anomaly detection for monitoring and T-SQL query plans
  • Early Warnings

All these predictive alerting mechanisms together give good coverage on recognizing potential SQL Server threats and unavailability situations in various business scenarios.

 

Regression-based alerts

The idea of a regression-based alert is very simple: It follows the average momentum of a server-, instance, and database monitoring performance counters over time and forecasts when an alert threshold level is predicted to exceed, based on a linear regression model. Especially memory and CPU leaks, as well as constant dropping of the SQL Server instance Page Life Expectancy, are often caught by this type of alert. The forecast time window is adjustable in hours.

 

Example of linear regression data.

Linear regression data

 

Machine Learning-based pattern alerts

These alerts work at their best in short-term alerting. This feature is able to identify undesired growth patterns in monitoring performance counter data just before it happens. You will also get a probability calculation on potential threats to actualize, such as 90%. These patterns are componentized into several distinct data points and the algorithm predicts not only the continuity of a pattern but is also able to identify the growth component of a pattern. The forecast time window is self-adjusting and is typically from 5 to 60 minutes, therefore having the best fit for a DBA's “fire fighting” scenarios.

 

Pattern alert in action

Pattern alert in actions

 

Time-series-based anomaly detection (international patent pending)

These warnings are based on so-called brute-forcing of hour-series regression curves fitting over different time dimension levels. The algorithm inspects the following continuity in the monitoring data time series:

  • Consecutive weekdays within any given hour
  • Same weekdays of consecutive weeks within any given hour
  • First days of consecutive months within any given hour
  • Last days of consecutive months within any given hour

So, for example, if there is a good fit, based on R-squared analysis, in regression over the server CPU usage % trend of following Fridays at 4 pm, an alert is thrown if the threshold level is exceeded for the following Friday at 4 pm. The forecast time window is typically from 1 day to 30 days.

This same feature can be applied against a selected problematical database to identify bad T-SQL query plans. It inspects the following performance counters in a similar fashion on a T-SQL batch/query plan basis:

  • EXECUTIONS_PER_SEC
  • WORKER_TIME_PER_SEC
  • PHYSICAL_READS_PER_SEC
  • LOGICAL_READS_PER_SEC
  • LOGICAL_WRITES_PER_SEC
  • ELAPSED_TIME_PER_SEC
  • ROWS_PER_SEC

 

One way to visualize brute-forcing paths of anomaly detection.

Anomaly detection brute forcing paths

 

Early Warnings (international patent pending)

The idea of early warnings is to identify such server performance counters, which are changing in a worse direction over time in terms of workload type from a statistical viewpoint. The change in workload type is inspected from the preceding month and baselining is done from the time series 3 months before that. The following statistical components are inspected:

  • Regression slope
  • Workload Volatility
  • Skewness
  • Kurtosis

Based on these variables, the algorithm is able to identify performance counters which are at preliminary risk: We want to understand why the workload is changing in a worse direction before it starts to affect the system performance.

Early warnings

As you can see, all these predictive alerting mechanisms together compose a strong base to prevent many of the potential threats on your SQL Servers, instances, and databases before they even exist.

Jani K. Savolainen

Founder & CTO

SQL Governor