Understanding the "Health State" in the BI Platfor...

Toby_Johnston · ‎11-20-2014

One of the key concepts to understand in the BI Platform Monitoring Application is the Health State metric. A number of different aspects of the monitoring application rely on the Health State metric and without a clear understanding of how this is supposed to work, it makes effective monitoring and troubleshooting the application a frustrating task. In this article, I will describe the concept of Health State in great detail and I will also describe how to correct the Health State in your BI Platform Monitoring application.

Health State Metric

In the Central Management Console, when creating a new watch, there are two types of Health State metrics that can be used as threshold criteria:

Server Health State - The Server Health State indicates the health of a particular server. This metric can be used to understand whether the server is up and running, whether the server is overloaded, and whether the server is still able to take additional requests. The Health State of the server can indicate to the BI administrator if they need to take action to troubleshoot a problem on that particular server

Topology Health State - The Topology Health State indicates the cumulative health of all servers of a particular type (Categories health) and also all servers in a particular server group. The Service Categories include CrystalReports, Analysis Services, Dashboard Services, Promotion Management Services, Core Services, Explorer Services, Connectivity Services, and SIA nodes

How the value for Health State is determined

In the case of the Server Health State metric, the value is determined by the result that particular server's watch. Anytime you create a new server manually or use the System Configuration Wizard to create your Adaptive Processing Server configuration, the system will automatically create a new watch for each server using the nomenclature of NODENAME.SERVERNAME Watch. This is a "system" created watch and cannot be manually deleted. You may have noticed in the Central Management Console that the system created Server Watches are also displayed for ease of access under CMC -> Home -> Servers --> Servers List.

Health State Evaluation

Depending on value returned by the server's watch formula, the server health will display one of the following five states.

STATE	DEFINITION
GREEN	Server health is good and no action is necessary
AMBER	Server is slightly overloaded, nearing peak values as defined by the caution rule
RED	Server resources are over used, unable to take new requests, or the server is stopped or disabled
DISABLED	The watch is marked as disabled in the BI Monitoring application. Select the watch and click the enable button to re-enable the evaluation of this watch
FAILED	There is an error in the watch formula or the BI Monitoring service is disabled or not running

Topology and Categories Health States

In order to provide the BI administrator a quick path to troubleshoot issues in the BI Platform landscape, the server health states are aggregated into service category health states. This makes it much more simple to tell if any particular product type is available for the end users that are using the system. For example, if your BI system mainly processes Crystal Report view-on-demand requests, then it is vital in order to achieve maximum up-time that all the Crystal Reports Processing Servers in the BI landscape are available to process these jobs. The Crystal Reports category health state depends on the aggregated health state of all the Crystal Reports server watches. This can be seen by editing the Crystal Reports category watch formula where you will find in the formula the health state of all Crystal Reports servers.

In the case of the Crystal Reports category, all of the servers required to process Crystal Reports are grouped together in the topology map so that you can tell at a glance which server watch may be causing the overall category state to change.

Fixing the Overall Health Watch and the Health State Hierarchy

On the BI Platform Monitoring Dashboard, there is an Overall Health state indicator (also known as the Consolidated Health Watch). You may have noticed that this is quite often not showing a valid state (Green, Amber, or Red) and instead is giving a state of Failed. In order to fix this, it is important to understand how this particular Health State is determined, then make the necessary underlying watch formula corrections that this watch is dependent on. In the monitoring application, there is a large hierarchy of Health State watches and if any of these dependent watches is broken or invalid, the Overall Health will show a state of Failed. In order to help the BI Administrator to correct their BI Platform Monitoring application and Overall Health, I have created a diagram showing each level in the Overall Health hierarchy which you can use to track down the broken watches and correct the formula.

In this example, you can see that the Overall Health state is Failed.

If any of the dependent Health Watches below the Consolidated Health Watch are failed, then the watch in the next level up will also be failed. Therefore, you must start at the bottom of the hierarchy and correct this watch. In this example, the server APS 2 has a failed watch, therefore the SIA Node 2 watch is failed, the Enterprise Nodes watch is failed, and so on.

After correcting the APS 2 Health State watch formula, all of the parent watches are now also showing a correct value and the Overall Health is Green (OK). Note that, after you correct the child watch formula, wait for a few minutes as there is a metric refresh internal of 60 seconds (by default) where the Monitoring Service will update the status of all watches in the system. In otherwords, the change in Overall Health will not happen immediately after correcting the dependent watches so be patient.

Repairing the Server Watch formulas

When creating a new server or using the System Configuration Wizard, you will find that the automatic routine that handles this is not perfect and depending on which service you are creating, the automatically generated system watch may contain either the wrong server name reference, and in some cases (such as the Connection Server), the wrong metric altogether. When you edit the watch's danger rule or caution rule you will see in red, the erroneous contents in the formula that needs to be corrected.

A server Health State watch should contain at the very least a check to make sure the server is running. Depending on the granularity that you desire you can create a two state watch, or a three state watch.

If you want to see a yellow caution state when a server is stopping and starting then you should use a three state watch, if you are only interested in seeing green state for running and red for any other state, you can use a two state watch. Using the server metric Server Running State, you can easily create a new server watch based on whether that server is available or not.

Server Running State Values

State	Value
Stopped	0
Starting	1
Initializing	2
Running	3
Stopping	4
Failed	5
Running With Errors	6
Running With Warnings	7

See below an example of both two state and three state watches that check for server availability. In this example, my SIA node name is NODE and the server name is SERVERNAME.

Two state watch formula:

Danger Rule

NODE.SERVERNAME$'Server Running State‘!=3

Three state watch formula:

Caution Rule	NODE.SERVERNAME$'Server Running State'==1 \|\| NODE.SERVERNAME$'Server Running State'==2 \|\| NODE.SERVERNAME$'Server Running State'==4 \|\| NODE.SERVERNAME$'Server Running State'==6 \|\| NODE.SERVERNAME$'Server Running State'==7
Danger Rule	NODE.SERVERNAME$'Server Running State'==0 \|\| NODE.SERVERNAME$'Server Running State'==5

Factoring in performance to the server health state

In some cases such as the Central Management Server, the load on the CMS server is used to determine the server health state. Depending on which type of server you are editing the watch for, there are a variety of different metrics that can be used to determine load. You may want to also include in your server watch formula some thresholds for these metrics so that the server health state metric is dependent also on how well the service is performing and whether it is able to take on more jobs.

Refer to the BI Platform Administrator Guide for more information on server metrics to determine which metrics are suitable for your BI landscape.