This blogpost continues my series of blog posts regarding advanced capacity planning techniques on data migration. In my previous blogpost I wrote an intro regarding this topic. This blogpost is about scaling up on-premises Microsoft SQL Servers.
Sometimes there is a situation wherein business needs are straightforward: There is a rapid need to improve a certain SQL Server’s performance due to the problems with workload capacity. Users may have complained about the system slowing down or applications not responding to the user requests. Most of the time, scaling up SQL Server is the easiest and fastest way to overcome this obstacle. But still, with this said, sometimes acquiring new hardware parts on an existing server can be very time consuming, depending on your own stock and market situation.
Reasons for this can vary:
Constant growth on Microsoft SQL Server workloads
This is a root cause that is typically due to the users, transactions and / or data growth over time. With predictive monitoring software such as SQL Governor, it is easy to follow SQL Server telemetry data and follow the workloads evolving and to forecast how much capacity is needed over time.
Preparing to seasonal workload peak
As is well known, the great majority of the SQL Server workloads are very seasonal and predictable over time. In many businesses, there are certain seasonal peaks, which may challenge the performance of SQL Server for a short but critical period. A good example of this is for example a Black Friday sale, which can be very intensive for web shops and such.
Poor Microsoft SQL Server capacity planning
Based on my experience, about 30% of all SQL Server performance problems are due the poor capacity planning. Reasons can be such as overprovisioning of the VM’s, or insufficient CPU, RAM, or storage capacity. With predictive capacity planning software such as SQL Governor, you can mitigate these problems significantly by being able to address the right amount of hardware resources for predicted workload over certain period and being able to scale up when needed.
Microsoft SQL Server related issue
Quite often, in my opinion about 70% of all time, SQL Server related issues are the ones that may lead to poor server performance. Root causes in these cases may vary a lot. Some of the most typical problems are SQL Server or application version upgrades, bad indexing or index maintenance, suboptimal instance configurations or database options, slow T-SQL queries or even SQL Server licensing when there is just too much data for Standard edition of SQL Server.
Growth burst in business.
Sometimes business grows faster than anticipated. These situations may lead to insufficient hardware resources due to the rapidly increased workloads. Sometimes it is hard to anticipate such events and even scaling up may not be enough. In SQL Governor software, the early warning feature helps address these hidden troublemakers. Still, with that said, sometimes even scaling up is not enough. This is when data platform elasticity may into the question and scaling out (such as hybrid cloud) would be a better option. I will talk more about this in my latter blogpost, which focuses on scaling out workloads.
Microsoft SQL Server estate TCO and Time-To-Solution
From the Total cost of ownership standpoint, scaling up SQL Server is typically a low-cost element, which can improve the SQL Server performance big time up to a certain point. Especially, if you have for example four CPU sockets and only two of the are in use, it is easy to double up existing computational power just by applying up to more processors and doubling the RAM capacity as well, within the limits of your server hardware. Also, it is trivial to scale up VM provisioning from, for example, from 8 virtual cores to 16 and address some extra RAM to it. Of course, typically adding up processors or virtual cores most likely increases your SQL Server license expenditure, especially when dealing with SQL Server enterprise licenses, so quite often SQL Server performance optimization would be a sufficient option scaling up the SQL Server performance and availability what comes to the overall costs. Also, what comes to adding up RAM on server, with SQL Server there is a 128 GB RAM limit on the buffer pool for standard edition instances so adding up memory may lead into a need to upgrade from Standard to Enterprise SQL Server edition.
If the bottleneck is on storage side, this may be not as trivial task to do and would need more analysis with software such as SQL Governor, which is able to show the current state and predicted needs for the storage workload elements such as:
- Disk I/O
- Reserved and allocated capacity
- Disk latencies
Storage can be costly, independent of whether you own the storage or rent it from the Managed Service Provider (MSP), so knowing your optimized storage needs now, and in the future will give you an optimized CAPEX / OPEX on storage costs over longer period. Scaling up storage can be fast, especially when having a hyperconverged platform such as Nutanix with some storage reserve.
I strongly recommend investing in comprehensive SQL Server performance and capacity optimization software such as SQL Governor, to optimize your scale-up needs during the whole lifecycle of your server and storage hardware. Your TCO is significantly reduced by this, resulting in up to 50-60% savings on licensing and hardware costs, as well as reducing your carbon footprint and electricity / rack space costs.
Hint: Never start blindly scaling up your server hardware without first diagnosing the root cause and potentially fix the SQL Server related issue first. This way your long-term TCO is always optimized!
Referential TCO / Time-To-Solution matrix.
Typically, company politics does not affect scaling up SQL Servers that much, yet though in some companies I have faced a couple of politics scenarios, of which some of them may be even surprising – it is just the way it goes:
- Always scale up us much as possible before migrating to new server(s).
- If there is insufficient server capacity, check / optimize SQL Server configurations and T-SQL code before acquiring any additive hardware components.
- Always buy bigger / better hardware when facing performance / availability issues.
Data center and data platform architecture
In general, data center costs can be very different depending on whether you have your own data center or if you are using an MSP or public cloud vendor. Also, CAPEX vs. OPEX structuring is very different in these scenarios: On-premises data center deals more with CAPEX, wherein public cloud deals with pay-as-you-go model (OPEX). When scaling up your SQL estate, costs are just slightly adjusted. Certain minimalism plays with the data architecture as well: Nothing typically needs to be changed.
If you upgrade your SQL Server licensing from Standard to Enterprise edition, you in most cases need to refactor and optimize your databases and T-SQL code to enable the Enterprise features. Some Enterprise features need refactoring your existing SQL Server configuration and code like when utilizing resource governor. Microsoft SQL Server 2022 Enterprise edition brings in a bunch of performance and scale enhancements, such as intelligent database features, NUMA aware and large page memory and buffer array allocation and In-memory database improvements for better OLTP workload handling.
Other technical constraints
Sometimes, when scaling up SQL Server, hardware can be so old that it is wiser to buy a new server with faster CPU and still effective hardware assurance.
From a non-functional perspective, scaling up your SQL Servers gives you better performance and availability with relatively small effort, in most cases.
How much would I need to scale up and when to scale down?
How much to scale up / down your SQL Server depends on following facts:
- What is my current baseline?
- What is the business criticality of my SQL Server?
- What technical SLA level do we need to reach on a daily / monthly / annual basis?
- What is the workload growth rate over time?
- When do we have service time?
- When do we have patching / service breaks?
All these questions are trivial to answer with SQL Governor software and therefore easy to estimate. You can follow your workload baselines, set your target workload levels on desired server with selected criticality and calculate future workload needs on trends, service time, service breaks and seasonality data of your SQL Server.
What is healthy Microsoft SQL Server baseline
Typically, baselines can be divided into 3 different categories:
- Critical workloads (i.e., hard systems)
- Medium criticality workloads (such as CRM, DW)
- Low criticality workloads (test, development)
There are many things that affect the definition of healthy baselines, such as workload volatility. I prefer to use following, conservative guidelines for CPU workload baselines in general:
- For critical workloads, SQL Server average CPU should not exceed 30 to 40% and maximum CPU should not exceed 60% over any consecutive 5 minutes more than 0.1% of all service time.
- For workloads of medium criticality, SQL Server average CPU should not exceed 40 to 50% and maximum CPU should not exceed 70% over any consecutive 5 minutes more than 0.5% of all service time.
- For workloads of low criticality, SQL Server average CPU should not exceed 60% and maximum CPU should not exceed 80% over any consecutive 5 minutes more than 1% of all service time.
Excerpt from SQL Governor: Healthy, predictable CPU baseline (~50%) on low criticality server.
Excerpt from SQL Governor: Technical SLA prediction, CPU usage / month to date.
What comes to the SQL Server baselines, a good rule of thumb is to leave at least 10% extra memory reserve in addition to the OS reserve and SQL Server allocation. Furthermore, because SQL Server is a memory hog, you may want to reserve future RAM allocation in same relation compared to the growth rate ratio of SQL databases over time on that server. One way to estimate SQL Server RAM usage is to inspect buffer pool and plan cache allocations on database level. This is doable in SQL Governor software.
It is also equally important to understand storage needs on SQL Server, instance, and database level, even up to the datafile volume allocations. Those baselines should be compared to the current storage specifications. I personally prefer following storage volume latency guidelines for SQL Servers:
- 0 - 1 milliseconds: Excellent storage performance
- 2 - 10 milliseconds: Good storage performance
- 11 - 20 milliseconds: Satisfactory storage performance
- 21 – 50 milliseconds: Poor storage performance
- 50+ milliseconds: Problematical storage performance
Identifying CPU contention in SQL Server capacity planning
One of the worst problems, what comes to the SQL Server capacity, is a phenomenon called CPU contention. This happens when SQL Server CPU is too busy to handle all the requests. On a physical server, this means average CPU workload being over 80% and average CPU queue being two times or more the count of logical CPU cores simultaneously for 5 consecutive minutes or more. On a virtual server, CPU queue limit is typically a bit higher (three times or more the count of logical CPU cores). SQL Governor can identify and calculate technical SLAs for such events. CPU contention slows down SQL Server and can even halt the whole server, so you would like to give a critical server a 100% SLA and better than 99.95% SLA for even medium criticality server, by default.
Excerpt from SQL Governor: Trend-extrapolated workload prediction for next one year on a server having growing CPU trend.
Excerpt from SQL Governor: Healthy, predictable CPU baseline (~50%) on low criticality server.
Excerpt from SQL Governor: Workload latency issues on server storage workload prediction.
Excerpt from SQL Governor: CPU contention analysis on SQL server right sizing.
Do you have your Microsoft SQL Servers under continuous monitoring and optimization? If not, contact me, and we can discuss how to improve your TCO and SLAs with SQL Governor software!
Jani K. Savolainen
CEO & Chairman,