In part one of this series I introduced the Microsoft Call Quality Methodology and have a high level overview of what CQM was and how you could start your journey to improving user experience for your Lync or Skype for Business Server deployment. In Part 2 we dive into running Key Health Indicators or (KHI) for short on your servers. Microsoft has published KHI guidance for your Lync and Skype for Business Servers. The notion is that if your servers are not running optimally then you might expect a less than desirable user experience.
KHI or Key Health Indicators are a refined list of critical Performance Counters (with accompanying thresholds) considered vital to the health of a Lync Server 2013 or Skype for Business Server deployment.
The KHI’s are grouped into 3 categories or “rings” which outline the priority of the KHIs. While all the KHIs are important, the logical groupings correspond to which KHIs are expected to have the biggest impact.
Ring 0 is comprised of all System KHIs (e.g. CPU, Disk, Memory and Network). When KHIs in this ring are beyond threshold, there is high likelihood that KHIs in Ring 1 and Ring 2 are beyond threshold. Therefore, when developing a remediation plan, focus should first be on Ring 0 issues before taking other actions to resolve KHI issues in other rings.
Ring 1 mostly contains request queue latencies, SIP problem indicators and network response queues. Often KHI problems in Ring 0 contribute to Ring 1 issues.
Ring 2 is all remaining KHIs.
Deploying and Running KHI on your Servers
Start with downloading the Key Health Indicators for Lync Server 2013 and Skype for Business Server 2015. This contains the bits you need to run the KHI as well as a good explanation of how to do this. Basically the steps are as follows to run the KHI for your servers:
1. Create the Performance Monitor Data Collectors on each server. This is done by running the following on each server:
#Create KHI Data Collector on a single server
Create_KHI_Data_Collector.ps1 –version Skype4B
Create_KHI_Data_Collector.ps1 –version LyncServer2013
#Create KHI Data Collector on a remote server
Create_KHI_Data_Collector.ps1 –version Skype4B –computer fe1pool1.contoso.com
Create_KHI_Data_Collector.ps1 –version LyncServe2013 –computer fe1pool1.contoso.com
The above step should be done on all Lync/Skype servers in your environment and only needs to be done once.
2. Next you are ready to start a collection run. Typically, we want to run KHI sample collections for a period of 24 hours so you can get a good idea of how the servers are performing over the course of an entire day. The following shows the commands required to do this either on the server in question or remotely from a machine that has the appropriate access to all your servers:
#Start KHI Data Collector on a single server
Logman start KHI
#Start KHI Data Collector on a remote server
Logman start KHI –computer fe1pool1.contoso.com
3. Stop the collection after 24 hours.
#Stop KHI Data Collector on a single server
Logman stop KHI
#Stop KHI Data Collector on a remote server
Logman stop KHI –computer fe1pool1.contoso.com
4. After stopping the LyncKHI Data Collectors, you will want to move the output files to a central location so you can start the analyzing process. By default they are located at c:\PerfLogs\Admin\LyncKHI\ on all the servers you deployed the KHI data collectors to. Move all the results to your workstation and keep the results organized per server to keep things easy on you as there could be a lot of data you need to go through.
Analyzing the KHI Data
Included in the KHI download there is an excel spreadsheet called Key_Health_Indicators_-_Analysis_and_Definitions_Workbook_-_v1.1.xlsm. You will open this workbook and on the first tab you will find the following macro that allows you to import the KHI data you have collected. I like to have a separate workbook for all my collection of server roles. So I would have a workbook for all servers in a Front-End Pool for example. Same goes for Edge Pools and Mediation Pools. This allows the files to remain somewhat small and allows you to view your server environment based on the server roles. Once you hit the start button it can take a while to parse all the data you are pointing it to.
After the data has been imported, you can move through the tabs of the workbook. The second tab KHI Definitions explains all the counters the workbook is using. This is useful for reference.
The third tab or “Charts” shows the following data
The first chart shows the total number of servers analyzed listing the optimal vs sub-optimal servers per Ring with Ring 0 in the center moving to Ring 2 on the outer edge. In this case of the 18 servers looked at, all had sub optimal counters. The chart on the right shows the optimal vs sub-optimal counters per ring. It is important to note that the counters in the charts are showing the number of times a counter sample exceeded a prescribed threshold in a given sample period.
The third tab or “Timeline” shows the breakdown of counter thresholds being exceeded by the timeline sample period. This is helpful in identifying the peek times of when servers may be performing poorly.
The fifth tab or “Pivots” tab shows all the sample periods where counters where sub-optimal. You can fine tune and select what you are looking for in terms of the Ring you wish to sort by or server name.
Lastly the final tab or “Tables” tab shows all the sample data collected allowing you to dive in and view the raw data if required.
Remediation and Next Steps
So what’s next? Remediation will involve starting with Ring 0 and determining if you can remediate the issues flagged. For example, if the majority of the Ring 0 issues involved CPU or Disk related issues and you are in a virtualized environment, you may consider giving more CPU to the servers and increasing I/O to the disks. Once done you would re-run the KHI and see if this improves the results. Repeat the steps until all Ring 0 issues are remediated or you no longer experience the issues caused by the reported sub-optimal counter.
Move onto Ring 1 and repeat the remediation steps here and then finally do the same for Ring 2. In most cases solving Ring 0 issues will remediate Ring 1 and 2 issues. Remember that if you make changes you will need to validate the changes by running another collection sample to validate those changes.
Once you have stabilized your environment and you are happy with your KHI results, remember to run them from time to time, perhaps once every six months to ensure your servers remain healthy. If you have significant change in your environment like adding a large number of users, you will want to run the KHI collection sample again.
In the next part I will dive into running CQM reports to look at call quality health.