VIDEO BLOG SERIES: Monitoring Mission Critical SQL Servers - How to utilize alerting to predict system bottlenecks
In my video blogs, I share my thoughts regarding monitoring of Microsoft SQL Server platforms. I have a long history of Microsoft SQL Server technology, from software development to business intelligence and data science, to name a few. In these first videos we will be focusing on the reasons, why SQL Servers are critical for businesses, what are the main reasons behind SQL Server performance problems, and how to solve those.
Why are SQL Servers critical for business?
For understanding the reasons, why SQL Servers are critical for businesses, let's take an example. Let's think about a large manufacturing company that has thousands of employees working in production. This production environment has an ERP database residing on SQL Server. That database, on the other hand, handles tens of millions of transactions daily, taking terabytes of data in volume.
What happens, if this system crashes? If the server would be down for more than a day, the manufacturing company would probably go out of business. And even if the downtime is only a few minutes, it will have a severe impact for the business. Even just slowness in the system would have some effect.
This is just one example of a business that has critical information and big data volumes. But there are many businesses having the same situation, and these businesses need best possible service levels and highest possible uptimes.
The main reasons behind SQL Server performance problems
Most of the SQL Server problems that you face in production are performance related. There are two main reasons behind the performance problems.
About 70% of performance issues in SQL Servers are related to bad coding. Maybe the query logic isn't written in an optimal way, maybe the schema design is not good enough, or maybe it is related to missing, unused or fragmented indexes. Then the rest 30% of the problems are caused by poor capacity planning and infrastructure issues. These issues are usually linked to the fact that the capacity plans don't meet the actual current situation.
What is then needed to solve the 70-30 problem? For fixing the 70% of performance problems that are related to bad coding, you need a proper performance monitoring software that helps to analyze the system's bottlenecks, what is the root cause of them, and thus helps in fixing those issues. For the rest 30%, you need a capacity planning software that can understand server benchmark ratios, seasonal characteristics and trends, in addition to utilize advanced mathematics such as machine learning.
What are the differences in reactive and proactive monitoring methods?
Reactive monitoring answers to the question: where do we have potential performance issues at the moment? Because of this, it is often in practice constant firefighting. You have a threshold limit that is exceeded, and then you get an alert and start fixing the issue.
Proactive monitoring, on the other hand, answers the question: where will we have potential performance issues in the future? In order to answer such question, applied mathematics is needed. With the help of statistical analysis, or even further things such as anomaly detection or pattern-oriented analysis, we can forecast problems before they occur. Even the kind of hidden problems we could not see ourselves from the data.
Stay tuned for the next part of this series, where we will dive more deep into the reactive monitoring methods!
Jani K. Savolainen
Founder & CTO
DB Pro / SQL Governor