It was supposed to be like any other day at the office. But then something went wrong, which caused a huge number of users trying to contact your help desk. You were not prepared for the sudden spike in the traffic. And neither was your server, which eventually crashed. The whole team was assigned to find the problem and fix it. Which you eventually did—but you still were not able to avoid the unexpected downtime in your service for 2 hours. The root cause was related to an unseen bottleneck in a database server.
Does the story sound relatable? It is not uncommon for the IT organizations to work in the reactive-mode, where you are solving the issues and problems once they happen instead of proactively preventing them. But would it be possible to avoid the situation such as described above to happen entirely?
Why are you easily stuck in the reactive mode?
There are various reasons that drive organizations to work in a reactive mode, when it comes to the management of the database platform.
Considering the IT infrastructure, the priorities are often focused on the application level, which is the side that’s visible to the users and customers. As there are always other priorities keeping you busy, it is tempting to leave the database servers to run on their own, unless something goes wrong.
Sometimes you might lack competence or tools to help you in the process. Do you have the knowledge and resources to take efficient care of your SQL Servers? Do you have modern tools in use, or have you settled for the more basic options, in addition to manual work?
Why does it make sense to be more proactive?
The key difference in reactive and proactive management is that are you reacting to issues or problems once they happen or are you able to identify them from certain indicators before that and prevent them from happening.
Preventing the problems from happening can save the business a lot of money. Have you calculated what an unexpected server downtime costs in your situation? Besides saving money, you are keeping your users and customers happier, the more stable and reliable your service is.
It is surprisingly common to assume that the probability for a failure in a database server is small or almost nonexistent, as the SQL technology is considered very robust. However, the reliability of the database servers tends to degenerate over time. The more they are used, and the more queries are being processed, the more eroding happens. For example, there are several system and database settings that could have been ideal when the server was initially taken into use but may not match the current situation anymore. Also, the massive amount of the database queries will lead to disorder—kind of like the mess you have in the kitchen after cooking. Thirdly, the growth in the amount of data and users might lead to running out of resources, in addition to data recovery being more difficult.
It is true that with good luck a database server can run smoothly for a good period of time, but the risk for problems is gradually increasing—and then one day it can seriously slow down or crash. It is impossible for the System Administrator or DBA to figure out all the possible scenarios that affect to the databases beforehand.
For these reasons, it will definitely pay off to invest enough focus on the database platform before any problems occur.
How to implement a proactive process for data platform management
Ready to start being more proactive? Follow these steps and you are well on your way.
- Map out your SQL estate.
If you haven’t mapped out your SQL estate for a long time or even not at all, this is where you should start. It is necessary to understand what servers you have and what they’re used for, what is their age and version, what is their work load and capacity situation, and how the overall architecture is looking like.
- Conduct diagnostics and health check.
Platform diagnostics will let you understand how your current situation is regarding the performance and availability status of your database servers. A proper health check will cover all critical aspects, such as state of the system settings and the condition of the database servers. A health check will reveal the underlying problems and risks affecting the throughput and reliability and create a baseline for improvement actions. It is a good practice to conduct the diagnostics and health checks regularly to ensure the optimal condition of the database platform.
- Anticipate and prevent risks.
Create a Disaster Recovery Plan (DRP). A DRP describes the process of how to recover the IT infrastructure in the event of a disaster. This is a good way to increase reliability and prevent problems in addition to increasing the effectiveness of problem solving. Also, utilize High Availability (HA) solutions where applicable to ensure durable and failure-free server operation. In an HA solution, elements such as fault hardware (clusters) and replicated databases (High Availability groups) are used to increase availability, and in some cases even scalability, of the system.
- Optimize the platform capacity and architecture—don’t rely on guesswork.
It is important to ensure you are running the servers with optimal capacity levels. If there is shortage on server capacity, you will run into performance issues for sure. Vice versa, the costs end up unnecessarily high.
- If you are running a large platform, consolidate.
Having a large, scattered data platform multiplies the risks for problems. A compact platform is easier to manage, both in terms of administration and the disaster recovery. Also, the backup and failover investments will become more cost effective.
- Implement modern tools for data platform monitoring and life cycle management.
Performing all the above listed activities is very cumbersome or even impossible without proper tools. There are two main benefits in utilizing modern tools in your data platform life cycle management. The best-in-class tools will give such monitoring data and predictive alerts that enable to forecast and prevent the problems before they occur. They will also help you to drill down to the root-causes of the problems, if something happens, making solving the situations a lot faster.
- Ensure you have enough SQL-centered resources and know-how.
Assess the level of resources you have and consider adding them to improve proactiveness. Consider what is the best strategy for your business: having the skills in-house or outsourcing them, and what is the balance between these two. In case you have outsourced your system administration, ensure that they have enough senior-level knowledge on SQL Servers. You can also compensate the need for additional resources by utilizing advanced tools automating many routine tasks.
As we have discussed, putting enough effort into the proactive database platform management will pay off. If you avoid the problems, you avoid the costs that your business needs to cover from any unexpected downtime. But you can also save in the resources, if you have less of the critical problem situations that sometimes happen also outside the office hours.
When your server architecture is well planned, and the data platform is compact, you will save in the database platform life cycle costs. You will also need less time and resources for the administration work and taking backups. So, eventually you will see all the benefits of proactiveness below-the-line.