In the third part of the SQL Server capacity planning blog series, we conduct a best fit analysis of the environment based on the heatmap analysis and right-size the servers.
Be sure to check out the previous blogpost of this series where we conducted a thorough heatmap analysis of the SQL Server FCI environment and some interesting findings on server CPU usage were made.
As a reminder, here's the distribution of the actual server level CPU usage on discrete scale:
Right-sizing the servers
Because of the high CPU utilization on SOURCEDB2, we want to drill down into server log details and check out the processor queues related to the CPU utilization as well. As a rule of thumb for critical server workloads, there should not be...
- consecutive time series having more than 3 times the logical CPU core count in terms of processor queue length, and
- no more than 60% of peaking CPU usage over equal or greater than 4 consecutive minutes of log data in 2-minute sampling sets.
Let’s dig deeper into the CPU side of these servers next.
In “Best fit” -functionality of the SQL Governor software, we can calculate the right size for existing servers without changing the hardware or VM setup, in order to quickly estimate the current CPU needs on our SQL Servers. Correct and accurate calculations are naturally very important when it comes to the SQL Server licensing.
With SQL Governor, we can select the criticality level of the servers, the monitoring data date range, and for how many years we want the current server workloads to be forecasted.
It is also possible to bypass the trend calculation for the time series data. For example, in the case of short-term snapshot data such as sudden true-up with recently installed software and time series data from only a couple of weeks or so. In such a case, no trustworthy trend for long-term resource usage can yet be calculated, so we need to bypass it.
By analysing the last 12 months of monitoring time series data, taking in count the growth trend of the CPU workloads, we notice that instead of 16 cores, the current server should have 27 calculational cores (round this up to 28) to fulfil the requirements given for total server level workload. As mentioned earlier in the analysis, the requirements include a maximum of 40% of average workload and maximum of 60% of peaking workload.
Also, as a remarkable finding, we can see that current technical service levels are very low. Only about 28% of all the average CPU rows and 14% of all the peaking CPU rows are within the agreed limits. On top of that, the maximum processor queue is very high (over 21% of all the monitoring data rows exceed allowed threshold level) and and maximum processor queue to CPU utilization ratio is way too high (71 / 76%). This is a sign of CPU contention.
The good news is that there isn't an uplifting trend for the CPU usage % for this server monitoring data. This can be seen from the results below: if the trend is negative or 0, we are not taking that in count in our capacity planning project.
From the results above, we can see that even though SOURCEDB1 is in much better shape than SOURCEDB2 according to the CPU utilization levels and processor queues, it still has some high peaking CPU utilization (100 / 115%).
The CPU utilization service times look much better than on SOURCEDB2: average CPU service level is 99.74% and peaking CPU service level is 95.72%. There is room for improvement, though.
On the other hand, too high processor queue lengths are relatively rare and somewhat reasonable (4 x logical core count). Therefore, significant slowness of the CPU workload processing should occur very rarely during those high peaks – less than or equal to 0.09% of time, to be exact – and according to the processor queue statistics, bad QUEUE TO CPU ratio is never exceeded, which is better than high CPU at the same time.
To be able to fill the technical service levels of this server, we would need 24 CPU cores instead of 12.
Conclusion of the heatmap and best fit analysis
All in all, it is obvious that combining the Active-Active FCI workloads of the SOURCEDB1 and SOURCEDB2 would be in a worst case fatal, due the hard-hitting workloads on SOURCEDB2 combined with some high peaks on SOURCEDB1. We should next investigate the causality in between the weekdays in the system, before going to the actual cluster failover analysis.
Note! It is also possible to analyse the probability of the CPU time series collision in between the Active-Active server nodes in SQL Governor, but that is beyond the scope of this blogpost.
SQL Governor process for SQL Server capacity planning
With SQL Governor software, the overall process for the capacity planning is as follows:
- Diagnosing existing FCI nodes and instances
- Conducting a heatmap analysis of the environment
- Conducting best fit analysis of the environment
- Conducting a recurring workload analysis of the environment
- Conducting FCI failover analysis and selecting the right-sized hardware
In my next blogpost, I will go through the step 4: Conducting recurring workload analysis of the environment.
Founder & CTO