Key Metrics

Tetrate Service Bridge collects a large number of metrics. This page is generated from dashboards ran internally at Tetrate and will be updated periodically based on best practices learned from operational experiences in Tetrate and from user deployments. Each heading represents a different dashboard, and each sub-heading is a panel on this dashboard. For this reason, you may see metrics appear multiple times.

The metrics described in this document build up a series of Grafana dashboards that can be downloaded from here, so you can import them into your Grafana setup.

GitOps Operational Status

Operational metrics to indicate Cluster GitOps health

GitOps Status

Shows the status of the GitOps component for each cluster.

Metric Name Labels PromQL Expression
gitops_enabled N/A
gitops_enabled

Accepted Admission Requests

Accepted admission requests for each cluster. This is the rate at which operations are processed by the GitOps relay and sent to TSB.

Metric Name Labels PromQL Expression
gitops_admission_count allowed
sum(rate(gitops_admission_count{allowed=“true”}[1h])) by (cluster_name)

Rejected Admission Requests

Rejected admission requests for each cluster. This is the rate at which operations are processed by the GitOps relay and sent to TSB.

A spike in these metrics may indicate an increase in invalid TSB resources being applied to the Kubernetes clusters, or error in the admission webhook processing.

Metric Name Labels PromQL Expression
gitops_admission_count allowed
sum(rate(gitops_admission_count{allowed=“false”}[1h])) by (cluster_name)

Admission Review Latency

Admission review latency percentiles grouped by cluster.

The GitOps admission reviews make decisions by forwarding the objects to the Management Plane. This metric helps understand the time it takes to make such decisions.

A spike here may indicate network issues or connectivity issues between the Control Plane and the Management Plane.

Metric Name Labels PromQL Expression
gitops_admission_duration_bucket N/A
histogram_quantile(0.99, sum(rate(gitops_admission_duration_bucket[1h])) by (cluster_name, le))
gitops_admission_duration_bucket N/A
histogram_quantile(0.95, sum(rate(gitops_admission_duration_bucket[1h])) by (cluster_name, le))

Resources Pushed to TSB

Number of resources pushed to the Management Plane.

This should be equivalent to the admission requests in most cases, but this will also account for object pushes that are done by the background reconcile processes.

Metric Name Labels PromQL Expression
gitops_push_count success
sum(rate(gitops_push_count{success=“true”}[1h])) by (cluster_name)

Failed pushes to TSB

Number of resource pushes to the Management Plane that failed.

This should be equivalent to the failed admission requests in most cases, but this will also account for object pushes that are done by the background reconcile processes.

Metric Name Labels PromQL Expression
gitops_push_count success
sum(rate(gitops_push_count{success=“false”}[1h])) by (cluster_name)

Resources Conversions

Number of Kubernetes resources that have been read from the cluster and successfully converted into TSB objects to be pushed to the Management plane.

The values for this metric should be the same as the Pushed Objects. If there is a difference between them, it probably means some issue when converting the Kubernetes objects to TSB objects.

Metric Name Labels PromQL Expression
gitops_convert_count success
sum(rate(gitops_convert_count{success=“true”}[1h])) by (cluster_name)

Resources conversions errors

Number of Kubernetes resources that have been read from the cluster and failed to be converted into TSB objects.

A spike on this metric indicates that the Kubernetes objects could not be converted to TSB objects and that those resources were not sent to the Management Plane.

Metric Name Labels PromQL Expression
gitops_convert_count success
sum(rate(gitops_convert_count{success=“false”}[1h])) by (cluster_name)

Global Configuration Distribution

These metrics indicate the overall health of Tetrate Service Bridge and should be considered the starting point for any investigation into issues with Tetrate Service Bridge.

Connected Clusters

This details all clusters connected to and receiving configuration from the management plane.

If this number drops below 1 or a given cluster does not appear in this table it means that the cluster is disconnected. This may happen for a brief period of time during upgrades/re-deploys.

Metric Name Labels PromQL Expression
xcp_central_current_edge_connections N/A
xcp_central_current_edge_connections

TSB Error Rate (Humans)

Rate of failed requests to the TSB apiserver from the UI and CLI.

Metric Name Labels PromQL Expression
grpc_server_handled_total component grpc_code grpc_method grpc_type
sum(rate(grpc_server_handled_total{component=“tsb”, grpc_code!=“OK”, grpc_type=“unary”, grpc_method!=“SendAuditLog”}[1m])) by (grpc_code) OR on() vector(0)

Istio-Envoy Sync Time (99th Percentile)

Once XCP has synced with the management plane it creates resources for Istio to configure Envoy. Istio usually distributes these within a second.

If this number starts to exceed 10 seconds then you may need to scale out istiod. In small clusters, it is possible this number is too small to be handled by the histogram buckets so may be nil.

Metric Name Labels PromQL Expression
pilot_proxy_convergence_time_bucket N/A
histogram_quantile(0.99, sum(rate(pilot_proxy_convergence_time_bucket[1m])) by (le, cluster_name))

XCP central -> edge Sync Time (99th Percentile)

MPC component translates TSB configuration into XCP objects. XCP central then sends these objects to every Edge connected to it.

This is the time taken for XCP central to send the configs to edges in ms.

Metric Name Labels PromQL Expression
xcp_central_config_propagation_time_ms_bucket N/A
histogram_quantile(0.99, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge))

Istiod Errors

Rate of istiod errors broken down by cluster. This graph helps identify clusters that may be experiencing problems. Typically, there should be no errors. Any non-transient errors should be investigated.

Sometimes this graph will show “No data” or these metrics won’t exist. This is because istiod only emits these metrics if the errors occur.

Metric Name Labels PromQL Expression
pilot_total_xds_internal_errors N/A
sum(rate(pilot_xds_write_timeout[1m])) by (cluster_name) + sum(rate(pilot_total_xds_internal_errors[1m])) by (cluster_name) + sum(rate(pilot_total_xds_rejects[1m])) by (cluster_name) + sum(rate(pilot_xds_expired_nonce[1m])) by (cluster_name) + sum(rate(pilot_xds_push_context_errors[1m])) by (cluster_name) + sum(rate(pilot_xds_pushes{type=~".*_senderr"}[1m])) by (cluster_name) OR on() vector(0)
pilot_total_xds_rejects N/A
sum(rate(pilot_xds_write_timeout[1m])) by (cluster_name) + sum(rate(pilot_total_xds_internal_errors[1m])) by (cluster_name) + sum(rate(pilot_total_xds_rejects[1m])) by (cluster_name) + sum(rate(pilot_xds_expired_nonce[1m])) by (cluster_name) + sum(rate(pilot_xds_push_context_errors[1m])) by (cluster_name) + sum(rate(pilot_xds_pushes{type=~".*_senderr"}[1m])) by (cluster_name) OR on() vector(0)
pilot_xds_expired_nonce N/A
sum(rate(pilot_xds_write_timeout[1m])) by (cluster_name) + sum(rate(pilot_total_xds_internal_errors[1m])) by (cluster_name) + sum(rate(pilot_total_xds_rejects[1m])) by (cluster_name) + sum(rate(pilot_xds_expired_nonce[1m])) by (cluster_name) + sum(rate(pilot_xds_push_context_errors[1m])) by (cluster_name) + sum(rate(pilot_xds_pushes{type=~".*_senderr"}[1m])) by (cluster_name) OR on() vector(0)
pilot_xds_push_context_errors N/A
sum(rate(pilot_xds_write_timeout[1m])) by (cluster_name) + sum(rate(pilot_total_xds_internal_errors[1m])) by (cluster_name) + sum(rate(pilot_total_xds_rejects[1m])) by (cluster_name) + sum(rate(pilot_xds_expired_nonce[1m])) by (cluster_name) + sum(rate(pilot_xds_push_context_errors[1m])) by (cluster_name) + sum(rate(pilot_xds_pushes{type=~".*_senderr"}[1m])) by (cluster_name) OR on() vector(0)
pilot_xds_pushes type
sum(rate(pilot_xds_write_timeout[1m])) by (cluster_name) + sum(rate(pilot_total_xds_internal_errors[1m])) by (cluster_name) + sum(rate(pilot_total_xds_rejects[1m])) by (cluster_name) + sum(rate(pilot_xds_expired_nonce[1m])) by (cluster_name) + sum(rate(pilot_xds_push_context_errors[1m])) by (cluster_name) + sum(rate(pilot_xds_pushes{type=~".*_senderr"}[1m])) by (cluster_name) OR on() vector(0)
pilot_xds_write_timeout N/A
sum(rate(pilot_xds_write_timeout[1m])) by (cluster_name) + sum(rate(pilot_total_xds_internal_errors[1m])) by (cluster_name) + sum(rate(pilot_total_xds_rejects[1m])) by (cluster_name) + sum(rate(pilot_xds_expired_nonce[1m])) by (cluster_name) + sum(rate(pilot_xds_push_context_errors[1m])) by (cluster_name) + sum(rate(pilot_xds_pushes{type=~".*_senderr"}[1m])) by (cluster_name) OR on() vector(0)

Istio Operational Status

Operational metrics for istiod health.

Connected Envoys

Count of Envoys connected to istiod. This should represent the total number of endpoints in the selected cluster.

If this number significantly decreases for longer than 5 minutes without an obvious reason (e.g. a scale-down event) then you should investigate. This may indicate that Envoys have been disconnected from istiod and are unable to reconnect.

Metric Name Labels PromQL Expression
pilot_xds cluster_name
sum(pilot_xds{cluster_name="$cluster"})

Total Error Rate

The total error rate for Istio when configuring Envoy, including generation and transport errors.

Any errors (current and historic) should be investigated using the more detailed split below.

Metric Name Labels PromQL Expression
pilot_total_xds_internal_errors cluster_name
sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_internal_errors{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_rejects{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_expired_nonce{cluster_name="$cluster"}[1m])) +   sum(rate(pilot_xds_push_context_errors{cluster_name="$cluster"}[1m])) +   sum(rate(pilot_xds_pushes{cluster_name="$cluster", type=~".*_senderr"}[1m])) OR on() vector(0)
pilot_total_xds_rejects cluster_name
sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_internal_errors{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_rejects{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_expired_nonce{cluster_name="$cluster"}[1m])) +   sum(rate(pilot_xds_push_context_errors{cluster_name="$cluster"}[1m])) +   sum(rate(pilot_xds_pushes{cluster_name="$cluster", type=~".*_senderr"}[1m])) OR on() vector(0)
pilot_xds_expired_nonce cluster_name
sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_internal_errors{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_rejects{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_expired_nonce{cluster_name="$cluster"}[1m])) +   sum(rate(pilot_xds_push_context_errors{cluster_name="$cluster"}[1m])) +   sum(rate(pilot_xds_pushes{cluster_name="$cluster", type=~".*_senderr"}[1m])) OR on() vector(0)
pilot_xds_push_context_errors cluster_name
sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_internal_errors{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_rejects{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_expired_nonce{cluster_name="$cluster"}[1m])) +   sum(rate(pilot_xds_push_context_errors{cluster_name="$cluster"}[1m])) +   sum(rate(pilot_xds_pushes{cluster_name="$cluster", type=~".*_senderr"}[1m])) OR on() vector(0)
pilot_xds_pushes cluster_name type
sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_internal_errors{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_rejects{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_expired_nonce{cluster_name="$cluster"}[1m])) +   sum(rate(pilot_xds_push_context_errors{cluster_name="$cluster"}[1m])) +   sum(rate(pilot_xds_pushes{cluster_name="$cluster", type=~".*_senderr"}[1m])) OR on() vector(0)
pilot_xds_write_timeout cluster_name
sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_internal_errors{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_rejects{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_expired_nonce{cluster_name="$cluster"}[1m])) +   sum(rate(pilot_xds_push_context_errors{cluster_name="$cluster"}[1m])) +   sum(rate(pilot_xds_pushes{cluster_name="$cluster", type=~".*_senderr"}[1m])) OR on() vector(0)

Median Proxy Convergence Time

The median (50th percentile) delay between istiod receiving configuration changes and the proxy receiving all required configuration in the selected cluster. This number indicates how stale the proxy configuration is. As this number increases, it may start to impact application traffic.

This number is typically in the hundreds of milliseconds. In small clusters, this number may be zero.

If this number creeps up to 30s for an extended period, istiod likely needs to be scaled out (or up).

Metric Name Labels PromQL Expression
pilot_proxy_convergence_time_bucket cluster_name
histogram_quantile(0.5, sum(rate(pilot_proxy_convergence_time_bucket{cluster_name="$cluster"}[1m])) by (le))

Istiod Push Rate

The rate of istiod pushes to Envoy grouped by discovery service. Istiod pushes clusters (CDS), endpoints (EDS), listeners (LDS) or routes (RDS) any time it receives a configuration change.

Changes are triggered by a user interacting with TSB or a change in infrastructure such as a new endpoint (service instance/pod) creation.

In small relatively static clusters these values can be zero most of the time.

Metric Name Labels PromQL Expression
pilot_xds_pushes cluster_name type
sum(irate(pilot_xds_pushes{cluster_name="$cluster", type=~“cds|eds|rds|lds”}[1m])) by (type)

Istiod Error Rate

The different error rates for Istio during general operations. Including the generation and distribution of Envoy configuration.

pilot_xds_write_timeout Rate of connection timeouts between Envoy and istiod. This number indicates that an Envoy has taken too long to acknowledge a configuration change from Istio. An increase in these errors typically indicates network issues, envoy resource limits or istiod resource limits (usually cpu)

pilot_total_xds_internal_errors Rate of errors thrown inside istiod whilst generating Envoy configuration. Check the istiod logs for more details if you see internal errors.

pilot_total_xds_rejects Rate of rejected configuration from Envoy. Istio should never produce any invalid Envoy configuration so any errors here warrants investigation, starting with the istiod logs.

pilot_xds_expired_nonce Rate of expired nonces from Envoys. This number indicates that an Envoy has responded to the wrong request sent from Istio. An increase in these errors typically indicates network issues (saturation or partition), Envoy resource limits or istiod resource limits (usually cpu).

pilot_xds_push_context_errors Rate of errors setting a connection with an Envoy instance. An increase in these errors typically indicates network issues (saturation or partition), Envoy resource limits or istiod resource limits (usually cpu). Check istiod logs for further details.

pilot_xds_pushes Rate of transport errors sending configuration to Envoy. An increase in these errors typically indicates network issues (saturation or partition), Envoy resource limits or istiod resource limits (usually cpu).

Metric Name Labels PromQL Expression
pilot_total_xds_internal_errors cluster_name
sum(rate(pilot_total_xds_internal_errors{cluster_name="$cluster"}[1m]))
pilot_total_xds_rejects cluster_name
sum(rate(pilot_total_xds_rejects{cluster_name="$cluster"}[1m]))
pilot_xds_expired_nonce cluster_name
sum(rate(pilot_xds_expired_nonce{cluster_name="$cluster"}[1m]))
pilot_xds_push_context_errors cluster_name
sum(rate(pilot_xds_push_context_errors{cluster_name="$cluster"}[1m]))
pilot_xds_pushes cluster_name type
sum(rate(pilot_xds_pushes{cluster_name="$cluster", type=~".*_senderr"}[1m])) by (type)
pilot_xds_write_timeout cluster_name
sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m]))

Proxy Convergence Time

The delay between an istiod receiving configuration changes and a proxy receiving all required configuration in the cluster. Broken down by percentiles.

This number indicates how stale the proxy configuration is. As this number increases it may start to affect application traffic.

This number is typically in the hundreds of milliseconds. If this number creeps up to 30s for an extended period of time, it is likely that istiod needs to be scaled out (or up) as it is likely pinned up against its CPU limits.

Metric Name Labels PromQL Expression
pilot_proxy_convergence_time_bucket cluster_name
histogram_quantile(0.5, sum(rate(pilot_proxy_convergence_time_bucket{cluster_name="$cluster"}[1m])) by (le))
pilot_proxy_convergence_time_bucket cluster_name
histogram_quantile(0.90, sum(rate(pilot_proxy_convergence_time_bucket{cluster_name="$cluster"}[1m])) by (le))
pilot_proxy_convergence_time_bucket cluster_name
histogram_quantile(0.99, sum(rate(pilot_proxy_convergence_time_bucket{cluster_name="$cluster"}[1m])) by (le))
pilot_proxy_convergence_time_bucket cluster_name
histogram_quantile(0.999, sum(rate(pilot_proxy_convergence_time_bucket{cluster_name="$cluster"}[1m])) by (le))

Configuration Validation

Success and failure rate of istio configuration validation requests. This is triggered when TSB configuration is created or updated.

Any failures here should be investigated in the istiod and edge logs.

If there are TSB configuration changes being made that affect the selected cluster and the success number is zero then there is an issue with configuration propagation. Check the XCP edge logs to debug further.

Metric Name Labels PromQL Expression
galley_validation_failed cluster_name
sum(rate(galley_validation_failed{cluster_name="$cluster"}[1m]))
galley_validation_passed cluster_name
sum(rate(galley_validation_passed{cluster_name="$cluster"}[1m]))

Sidecar Injection

Rate of sidecar injection requests. Sidecar injection is triggered whenever a new instance/pod is created.

Any errors displayed here should be investigated further by checking the istiod logs.

Metric Name Labels PromQL Expression
sidecar_injection_failure_total cluster_name
sum(rate(sidecar_injection_failure_total{cluster_name="$cluster"}[1m]))
sidecar_injection_success_total cluster_name
sum(rate(sidecar_injection_success_total{cluster_name="$cluster"}[1m]))

MPC Operational Status

Operational metrics to indicate Management Plane Controller (MPC) health.

Config Update Messages every 5m

Config update messages sent over the gRPC stream from TSB and received by MPC.

This metric can help understand how messages are queued in MPC when it is under load. The value for both metrics should always be the same. If the Received by MPC metric has a value lower than the TSB one, it means MPC is under load and cannot process all messages sent by TSB as fast as TSB is sending them.

Metric Name Labels PromQL Expression
grpc_client_msg_received_total component grpc_method
sum(increase(grpc_client_msg_received_total{component=“mpc”, grpc_method=“GetAllConfigObjects”}[5m])) or on() vector(0)
grpc_server_msg_sent_total component grpc_method
sum(increase(grpc_server_msg_sent_total{component=“tsb”, grpc_method=“GetAllConfigObjects”}[5m])) or on() vector(0)

Config updates processed every 5m

The number of configuration updates received by the Management Plane Controller (MPC) is to be processed and sent to XCP.

TSB sends the config updates over a permanently connected gRPC stream to MPC, and this metric shows the number of messages received and processed by MPC on that stream.

Metric Name Labels PromQL Expression
permanent_stream_operation error name
sum(increase(permanent_stream_operation{name=“ConfigUpdates”, error=""}[5m])) or on() vector(0)
permanent_stream_operation error name
sum(increase(permanent_stream_operation{name=“ConfigUpdates”, error!=""}[5m])) or on() vector(0)

Config stream connection attempts every 5m

The number of connection (and reconnection) attempts on the config updates stream.

TSB sends the config updates over a permanently connected gRPC stream to MPC. This metric shows the number of connections and reconnections that happened on that stream.

Metric Name Labels PromQL Expression
permanent_stream_connection_attempts error name
sum(increase(permanent_stream_connection_attempts{name=“ConfigUpdates”, error=""}[5m])) or on() vector(0)
permanent_stream_connection_attempts error name
sum(increase(permanent_stream_connection_attempts{name=“ConfigUpdates”, error!=""}[5m])) or on() vector(0)

XCP Config Push Duration

Time it took for configuration objects to be pushed to XCP.

This metric shows the time it takes for MPC to apply all the configuration objects in the XCP namespace once all the configuration objects have been received from TSB and translated into XCP objects.

Metric Name Labels PromQL Expression
mpc_xcp_config_push_time error
mpc_xcp_config_push_time{error=""} or on() vector(0)
mpc_xcp_config_push_time error
mpc_xcp_config_push_time{error!=""} or on() vector(0)

TSB to MPC sent configs

The number of resources that sent from TSB to MPC.

This metric shows the number of objects that are created, updated, and deleted as part of a configuration push from MPC to XCP.

This metric can be used together with the XCP push operations and push duration to get an understanding of how the amount of resources being pushed to XCP affects the time it takes for the entire configuration push operation to complete.

Metric Name Labels PromQL Expression
mpc_tsb_config_received_count N/A
mpc_tsb_config_received_count

XCP Resource conversion rate

Configuration updates received from TSB are processed by MPC and translated into XCP resources. This metric shows the conversion rate of TSB resources to XCP resources. It gives a good idea of the number of resources of each type in the runtime configuration.

Metric Name Labels PromQL Expression
mpc_xcp_conversion_count N/A
sum(rate(mpc_xcp_conversion_count[1m])) by (resource)

MPC to XCP pushed configs

The number of resources that are pushed to XCP.

This metric shows the number of objects that are created, updated, and deleted as part of a configuration push from MPC to XCP. It also shows how many fetch calls to the k8s api server are done.

This metric can be used together with the TSB tp MPC sent configs and XCP push operations and push duration to get an understanding of how the amount of resources being pushed to XCP affects the time it takes for the entire configuration push operation to complete.

Metric Name Labels PromQL Expression
mpc_xcp_config_create_ops N/A
sum(mpc_xcp_config_create_ops)
mpc_xcp_config_delete_ops N/A
sum(mpc_xcp_config_delete_ops)
mpc_xcp_config_fetch_ops N/A
sum(mpc_xcp_config_fetch_ops)
mpc_xcp_config_update_ops N/A
sum(mpc_xcp_config_update_ops)

XCP Resource conversion error rate

Configuration updates received from TSB are processed by MPC and translated into XCP resources. This metric shows the conversion error rate of TSB resources to XCP resources. It should always be zero. If there are errors reported in this graph, there are incompatibilities between the XCP resources and the TSB ones. This may be the result of mismatching version compatibility between TSB and XCP.

Metric Name Labels PromQL Expression
mpc_xcp_conversion_count error
sum(rate(mpc_xcp_conversion_count{error != “”}[5m])) by (resource) or on() vector(0)

MCP to XCP pushed configs error

The number of resources that failed while pushing to XCP.

This metric shows the number of objects that fail when they are tried to be created, updated, and deleted as part of a configuration push from MPC to XCP. It also shows the number of failed fetch calls to the k8s api server.

This metric can be used together with the MPC to TSB push configs and the XCP push operations and push duration to get an understanding of how the amount of resources being pushed to XCP affects the time it takes for the entire configuration push operation to complete.

Metric Name Labels PromQL Expression
mpc_xcp_config_create_ops_err N/A
sum(mpc_xcp_config_create_ops_err)
mpc_xcp_config_delete_ops_err N/A
sum(mpc_xcp_config_delete_ops_err)
mpc_xcp_config_fetch_ops_err N/A
sum(mpc_xcp_config_fetch_ops_err)
mpc_xcp_config_update_ops_err N/A
sum(mpc_xcp_config_update_ops_err)

Config Status updates every 5m

Config Status update messages sent over the gRPC streams, from XCP to MPC to XCP.

This metric can help understand how messages are queued in TSB when it is under load. The value for both metrics should always be the same. If the Received by TSB metric has a value lower than the MPC one, it means TSB is under load and cannot process all messages sent by MPC as fast as MPC is sending them.

Metric Name Labels PromQL Expression
grpc_client_msg_received_total component grpc_method
sum(increase(grpc_client_msg_received_total{grpc_method=“PullStatus”,component=“mpc”}[5m])) or on() vector(0)
grpc_client_msg_sent_total component grpc_method
sum(increase(grpc_client_msg_sent_total{grpc_method=“PullStatus”,component=“mpc”}[5m])) or on() vector(0)
grpc_client_msg_sent_total component grpc_method
sum(increase(grpc_client_msg_sent_total{grpc_method=“PushStatus”,component=“mpc”}[5m])) or on() vector(0)
grpc_server_msg_received_total component grpc_method
sum(increase(grpc_server_msg_received_total{grpc_method=“PushStatus”, component=“tsb”}[5m])) or on() vector(0)

Config Status updates processed every 5m

This is the number of config status updates that are processed by the Management Plane Controller (MPC), that are received from XCP and to be sent to TSB.

There are two gRPC streams, one that connects XCP to MPC and another one that connects MPC to TSB.

Metric Name Labels PromQL Expression
permanent_stream_operation error name
sum(increase(permanent_stream_operation{name=“StatusPush”, error=""}[5m])) or on() vector(0)
permanent_stream_operation error name
sum(increase(permanent_stream_operation{name=“StatusPush”, error!=""}[5m])) or on() vector(0)
permanent_stream_operation error name
sum(increase(permanent_stream_operation{name=“StatusPull”, error=""}[5m])) or on() vector(0)
permanent_stream_operation error name
sum(increase(permanent_stream_operation{name=“StatusPull”, error!=""}[5m])) or on() vector(0)

Config Status stream connection attempts every 5m

The number of connection (and reconnection) attempts on the config status updates streams. MPC sends the config status updates over a permanently connected gRPC stream to TSB. At the same time, XCP sends them to MPC. This metric shows the number of connections and reconnections that happened on each stream.

Metric Name Labels PromQL Expression
permanent_stream_connection_attempts error name
sum(increase(permanent_stream_connection_attempts{name=“StatusPull”, error=""}[5m])) or on() vector(0)
permanent_stream_connection_attempts error name
sum(increase(permanent_stream_connection_attempts{name=“StatusPull”, error!=""}[5m])) or on() vector(0)
permanent_stream_connection_attempts error name
sum(increase(permanent_stream_connection_attempts{name=“StatusPush”, error=""}[5m])) or on() vector(0)
permanent_stream_connection_attempts error name
sum(increase(permanent_stream_connection_attempts{name=“StatusPush”, error!=""}[5m])) or on() vector(0)

Cluster Update Messages every 5m

Cluster update messages sent over the gRPC stream from TSB and received by MPC.

This metric can help understand how messages are queued in MPC when it is under load. The value for both metrics should always be the same. If the Received by MPC metric has a value lower than the TSB one, it means MPC is under load and cannot process all messages sent by TSB as fast as TSB is sending them.

Metric Name Labels PromQL Expression
grpc_client_msg_received_total component grpc_method
sum(increase(grpc_client_msg_received_total{component=“mpc”, grpc_method=“GetAllClusters”}[5m])) or on() vector(0)
grpc_server_msg_sent_total component grpc_method
sum(increase(grpc_server_msg_sent_total{component=“tsb”, grpc_method=“GetAllClusters”}[5m])) or on() vector(0)

TSB Cluster updates processed every 5m

The number of cluster updates received by the Management Plane Controller (MPC) that must be processed and sent to XCP.

TSB sends the cluster updates (e.g. new onboarded clusters, deleted clusters) over a permanently connected gRPC stream to MPC. This metric shows the number of messages received and processed by MPC on that stream.

Metric Name Labels PromQL Expression
permanent_stream_operation error name
sum(increase(permanent_stream_operation{name=“ClusterPush”, error=""}[5m])) or on() vector(0)
permanent_stream_operation error name
sum(increase(permanent_stream_operation{name=“ClusterPush”, error!=""}[5m])) or on() vector(0)

TSB Cluster stream connection attempts every 5m

The number of connection (and reconnection) attempts on the cluster updates stream. TSB sends the cluster updates over a permanently connected gRPC stream to MPC. This metric shows the number of connections and reconnections that happened on that stream.

Metric Name Labels PromQL Expression
permanent_stream_connection_attempts error name
sum(increase(permanent_stream_connection_attempts{name=“ClusterPush”, error=""}[5m])) or on() vector(0)
permanent_stream_connection_attempts error name
sum(increase(permanent_stream_connection_attempts{name=“ClusterPush”, error!=""}[5m])) or on() vector(0)

Cluster Status Update from XCP every 5m

Cluster status update messages received from XCP over a gRPC stream.

Metric Name Labels PromQL Expression
grpc_client_msg_received_total component grpc_method
sum(increase(grpc_client_msg_received_total{component=“mpc”, grpc_method=“GetClusterState”}[5m])) or on() vector(0)

Cluster updates from XCP processed every 5m

The number of cluster status updates received by the Management Plane Controller (MPC) from XCP that must be processed and sent to TSB.

XCP sends the cluster status updates (e.g. services deployed in the cluster) over a permanently connected gRPC stream to MPC. This metric shows the number of messages received and processed by MPC on that stream.

Metric Name Labels PromQL Expression
permanent_stream_operation error name
sum(increase(permanent_stream_operation{name=“ClusterStateFromXCP”, error=""}[5m])) or on() vector(0)
permanent_stream_operation error name
sum(increase(permanent_stream_operation{name=“ClusterStateFromXCP”, error!=""}[5m])) or on() vector(0)

Cluster updates from XCP stream connection attempts every 5m

The number of connection (and reconnection) attempts on the cluster status updates from XCP stream. XCP sends the cluster status updates over a permanently connected gRPC stream to MPC. This metric shows the number of connections and reconnections that happened on that stream.

Metric Name Labels PromQL Expression
permanent_stream_connection_attempts error name
sum(increase(permanent_stream_connection_attempts{name=“ClusterStateFromXCP”, error=""}[5m])) or on() vector(0)
permanent_stream_connection_attempts error name
sum(increase(permanent_stream_connection_attempts{name=“ClusterStateFromXCP”, error!=""}[5m])) or on() vector(0)

XCP cluster status updates processed every 5m

This is the number of cluster status updates that are processed by the Management Plane Controller (MPC) to be sent to TSB.

MPC sends the cluster status updates over a gRPC stream that is permanently connected to TSB, and this metric shows the number of cluster updates that are processed by MPC and sent to TSB on that stream.

Metric Name Labels PromQL Expression
permanent_stream_operation error name
sum(increase(permanent_stream_operation{name=“ClusterUpdates”, error=""}[5m])) or on() vector(0)
permanent_stream_operation error name
sum(increase(permanent_stream_operation{name=“ClusterUpdates”, error!=""}[5m])) or on() vector(0)

Cluster status updates to TSB stream connection attempts every 5m

The number of connection (and reconnection) attempts on the cluster status updates stream. MPC sends the cluster status updates over a permanently connected gRPC stream to TSB. This metric shows the number of connections and reconnections that happened on that stream.

Metric Name Labels PromQL Expression
permanent_stream_connection_attempts error name
sum(increase(permanent_stream_connection_attempts{name=“ClusterUpdates”, error=""}[5m])) or on() vector(0)
permanent_stream_connection_attempts error name
sum(increase(permanent_stream_connection_attempts{name=“ClusterUpdates”, error!=""}[5m])) or on() vector(0)

OAP Operational Status

Operational metrics to indicate Tetrate Service Bridge OAP stack health.

OAP Request Rate

The request rate to OAP, by status.

Metric Name Labels PromQL Expression
envoy_cluster_upstream_rq_xx envoy_cluster_name plane
sum by (envoy_response_code_class) (rate(envoy_cluster_upstream_rq_xx{envoy_cluster_name=“oap-grpc”, plane=“management”}[1m]))

OAP Request Latency

The OAP, request latency.

Metric Name Labels PromQL Expression
envoy_cluster_upstream_rq_time_bucket envoy_cluster_name plane
histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name=“oap-grpc”, plane=“management”}[1m])) by (le))
envoy_cluster_upstream_rq_time_bucket envoy_cluster_name plane
histogram_quantile(0.95, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name=“oap-grpc”, plane=“management”}[1m])) by (le))
envoy_cluster_upstream_rq_time_bucket envoy_cluster_name plane
histogram_quantile(0.90, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name=“oap-grpc”, plane=“management”}[1m])) by (le))
envoy_cluster_upstream_rq_time_bucket envoy_cluster_name plane
histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name=“oap-grpc”, plane=“management”}[1m])) by (le))
envoy_cluster_upstream_rq_time_bucket envoy_cluster_name plane
histogram_quantile(0.50, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name=“oap-grpc”, plane=“management”}[1m])) by (le))

OAP Aggregation Request Rate

OAP Aggregation Request Rate, by type:

  • central aggregation service handler received
  • central application aggregation received
  • central service aggregation received
Metric Name Labels PromQL Expression
central_aggregation_handler N/A
sum(rate(central_aggregation_handler[1m]))
central_app_aggregation N/A
sum(rate(central_app_aggregation[1m]))
central_service_aggregation N/A
sum(rate(central_service_aggregation[1m]))

OAP Aggregation Rows

Cumulative rate of rows in OAP aggreagation.

Metric Name Labels PromQL Expression
metrics_aggregation plane
sum(rate(metrics_aggregation{plane=“management”}[1m]))

OAP Mesh Analysis Latency

The process latency of OAP service mesh telemetry streaming process.

Metric Name Labels PromQL Expression
mesh_analysis_latency_bucket component plane
histogram_quantile(0.99, sum(rate(mesh_analysis_latency_bucket{plane=“control”, component=“oap”}[1m])) by (le))
mesh_analysis_latency_bucket component plane
histogram_quantile(0.95, sum(rate(mesh_analysis_latency_bucket{plane=“control”, component=“oap”}[1m])) by (le))
mesh_analysis_latency_bucket component plane
histogram_quantile(0.90, sum(rate(mesh_analysis_latency_bucket{plane=“control”, component=“oap”}[1m])) by (le))
mesh_analysis_latency_bucket component plane
histogram_quantile(0.75, sum(rate(mesh_analysis_latency_bucket{plane=“control”, component=“oap”}[1m])) by (le))

OAP Zipkin Trace Rate

The OAP Zipkin processing trace rate

Metric Name Labels PromQL Expression
trace_in_latency_count plane protocol
sum(rate(trace_in_latency_count{protocol=‘zipkin-http’,plane=‘control’}[1m]))

OAP Zipkin Trace Latency

The OAP trace processing latency

Metric Name Labels PromQL Expression
trace_in_latency_bucket N/A
histogram_quantile(0.99, sum(rate(trace_in_latency_bucket[5m])) by (le))
trace_in_latency_bucket N/A
histogram_quantile(0.95, sum(rate(trace_in_latency_bucket[5m])) by (le))
trace_in_latency_bucket N/A
histogram_quantile(0.90, sum(rate(trace_in_latency_bucket[5m])) by (le))
trace_in_latency_bucket N/A
histogram_quantile(0.75, sum(rate(trace_in_latency_bucket[5m])) by (le))
trace_in_latency_bucket N/A
histogram_quantile(0.50, sum(rate(trace_in_latency_bucket[5m])) by (le))

OAP Zipkin Trace Error Rate

The OAP Zipkin processing trace error rate

Metric Name Labels PromQL Expression
trace_analysys_error_count plane protocol
sum(rate(trace_analysys_error_count{protocol=‘zipkin-http’,plane=‘control’}[1m]))

JVM Threads

Numbed of threads in OAP JVM

Metric Name Labels PromQL Expression
jvm_threads_current component plane
sum(jvm_threads_current{component=“oap”, plane=“management”})
jvm_threads_daemon component plane
sum(jvm_threads_daemon{component=“oap”, plane=“management”})
jvm_threads_deadlocked component plane
sum(jvm_threads_deadlocked{component=“oap”, plane=“management”})
jvm_threads_peak component plane
sum(jvm_threads_peak{component=“oap”, plane=“management”})

JVM Memory

JVM Memory stats of OAP JVM instances.

Metric Name Labels PromQL Expression
jvm_memory_bytes_max component plane
sum by (area, instance) (jvm_memory_bytes_max{component=“oap”, plane=“management”})
jvm_memory_bytes_used component plane
sum by (area, instance) (jvm_memory_bytes_used{component=“oap”, plane=“management”})

TSB Operational Status

Operational metrics to indicate Tetrate Service Bridge API server health.

Front Envoy Success Rate

Rate of successful requests to Front Envoy. This includes all user and cluster requests into the management plane.

Note: This indicates the health of the AuthZ server not whether the user or cluster making the request has the correct permissions.

Metric Name Labels PromQL Expression
envoy_cluster_internal_upstream_rq component envoy_response_code
sum(rate(envoy_cluster_internal_upstream_rq{envoy_response_code=~“2.|3.|401”, component=“front-envoy”}[1m])) by (envoy_cluster_name)

Front Envoy Error Rate

The error rate of requests to the Front Envoy server. This includes all user and cluster requests into the management plane. Note: This indicates the health of the AuthZ server not whether the user or cluster making the request has the correct permissions.

Metric Name Labels PromQL Expression
envoy_cluster_internal_upstream_rq component envoy_response_code
sum(rate(envoy_cluster_internal_upstream_rq{envoy_response_code!~“2.|3.|401”, component=“front-envoy”}[1m])) by (envoy_cluster_name, envoy_response_code)

Front Envoy Latency

Front Envoy request latency percentiles.

Metric Name Labels PromQL Expression
envoy_cluster_internal_upstream_rq_time_bucket component
histogram_quantile(0.99, sum(rate(envoy_cluster_internal_upstream_rq_time_bucket{component=“front-envoy”}[1m])) by (le, envoy_cluster_name))
envoy_cluster_internal_upstream_rq_time_bucket component
histogram_quantile(0.95, sum(rate(envoy_cluster_internal_upstream_rq_time_bucket{component=“front-envoy”}[1m])) by (le, envoy_cluster_name))

TSB Success Rate

Rate of successful requests to the TSB apiserver from the UI and CLI.

Metric Name Labels PromQL Expression
grpc_server_handled_total component grpc_code grpc_method grpc_type
sum(rate(grpc_server_handled_total{component=“tsb”, grpc_code=“OK”, grpc_type=“unary”, grpc_method!=“SendAuditLog”}[1m])) by (grpc_method)

TSB Error Rate

Rate of failed requests to the TSB apiserver from the UI and CLI.

Metric Name Labels PromQL Expression
grpc_server_handled_total component grpc_code grpc_method grpc_type
sum(rate(grpc_server_handled_total{component=“tsb”, grpc_code!=“OK”, grpc_type=“unary”, grpc_method!=“SendAuditLog”}[1m])) by (grpc_code)

Authentication Success Rate

The success rate for authentication operations for each type of authentication provider.

Metric Name Labels PromQL Expression
iam_auth_time_count error
sum(rate(iam_auth_time_count{error=""}[1m])) by (provider)

Authentication Error Rate

The error rate for authentication operations for each type of authentication provider.

Spikes may indicate problems with the provider or the given credentials, such as expired JWT tokens.

Metric Name Labels PromQL Expression
iam_auth_time_count error
sum(rate(iam_auth_time_count{error!=""}[1m])) by (provider)

Authentication Latency

The latency for authentication operations for each type of authentication provider.

Spikes in the latency may indicate that the authentication provider has a sub-optimal configuration (such as too wide LDAP queries).

Metric Name Labels PromQL Expression
iam_auth_time_bucket error
histogram_quantile(0.99, sum(rate(iam_auth_time_bucket{error=""}[1m])) by (le, provider))
iam_auth_time_bucket error
histogram_quantile(0.95, sum(rate(iam_auth_time_bucket{error=""}[1m])) by (le, provider))

Data Store Success Rate

Successful request rate for operations persisting data to the datastore grouped by method and kind.

This graph also includes transactions. These are standard SQL transactions and consist of multiple operations.

Metric Name Labels PromQL Expression
persistence_operation error
sum(rate(persistence_operation{error=""}[1m])) by (kind, method)
persistence_transaction error
sum(rate(persistence_transaction{error=""}[1m]))

Data Store Latency

The request latency for operations persisting data to the datastore grouped by method.

This graph also includes transactions. These are standard SQL transactions and consist of multiple operations.

Metric Name Labels PromQL Expression
persistence_operation_duration_bucket N/A
histogram_quantile(0.99, sum(rate(persistence_operation_duration_bucket[1m])) by (le, method))
persistence_transaction_duration_bucket N/A
histogram_quantile(0.99, sum(rate(persistence_transaction_duration_bucket[1m])) by (le))

Data Store Error Rate

The request error rate for operations persisting data to the datastore grouped by method and kind. This graph also includes transactions. These are standard SQL transactions and consists of multiple operations. Note: The graph explicitly excludes “resource not found” errors. A small number of “not found” responses are normal as TSB for optimization often uses Get queries instead of Exists to determine the resource existence.

Metric Name Labels PromQL Expression
persistence_operation error kind
sum(rate(persistence_operation{error!="", kind!=“iam_revoked_token”}[1m])) by (kind, method, error)
persistence_transaction error
sum(rate(persistence_transaction{error!=""}[1m])) by (error)

Active Transactions

The number of running transactions on the datastore.

This graph shows how many active transactions are running at a given point in time. It helps you understand the load of the system generated by concurrent access to the platform.

Metric Name Labels PromQL Expression
persistence_concurrent_transaction N/A
sum(persistence_concurrent_transaction)

PDP Success Rate

Successful request rate of PDP grouped by method. NGAC is a graph-based authorization framework that consists of three main components: Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. The other components of NGAC use this graph to perform access decisions. Policy Decision Point (PDP): Performs access decisions based on the NGAC graph’s policies. The PDP is used to perform binary Allow/Deny access decisions (Check) and determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP). Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. A successful request does not represent “access granted” decisions; they represent the access decision requests for which a verdict was obtained. A drop in this metric may show that operations against the graph are failing. This may mean that the graph is unavailable for reads and this is usually a consequence of failures in the PIP. Failures in the PIP for write operations will result in the graph not being properly updated to the latest status, resulting in access decisions based on stale models.

Metric Name Labels PromQL Expression
ngac_pdp_operation error
sum(rate(ngac_pdp_operation{error=""}[1m])) by (method)

PDP Error Rate

Rate of errors for PDP requests grouped by method. NGAC is a graph-based authorization framework that consists of three main components: Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. The other components of NGAC use this graph to perform access decisions. Policy Decision Point (PDP): Performs access decisions based on the NGAC graph’s policies. The PDP is used to perform binary Allow/Deny access decisions (Check) and determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP). Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. A successful request does not represent “access granted” decisions; they represent the access decision requests where a verdict was obtained. Failed requests to the PDP show the number of requests from the PEP to the PDP that have failed. They do not represent “access denied” decisions; they represent the access decision requests where a verdict could not be obtained. A rise in this metric may show that operations against the graph are failing. This may mean that the graph is unavailable for reads and this is usually a consequence of failures in the PIP. Failures in the PIP for write operations will result in the graph not being correctly updated to the latest status, resulting in access decisions based on stale models.

Metric Name Labels PromQL Expression
ngac_pdp_operation error
sum(rate(ngac_pdp_operation{error!=""}[1m])) by (method)

PDP Latency

PDP latency percentiles grouped by method. NGAC is a graph-based authorization framework that consists of three main components: Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. The other components of NGAC use this graph to perform access decisions. Policy Decision Point (PDP): Performs access decisions based on the NGAC graph’s policies. The PDP is used to perform binary Allow/Deny access decisions (Check) and determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP). Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. A successful request does not represent “access granted” decisions; they represent the access decision requests for which a verdict was obtained. This metric shows the time it takes to get an access decision for authorization requests. Degradation in PDP operations may result in general degradation of the system. PDP latency represents the time it takes to make access decisions, and that will impact user experience since access decisions are made and enforced for every operation.

Metric Name Labels PromQL Expression
ngac_pdp_operation_duration_bucket N/A
histogram_quantile(0.99, sum(rate(ngac_pdp_operation_duration_bucket[1m])) by (method, le))
ngac_pdp_operation_duration_bucket N/A
histogram_quantile(0.95, sum(rate(ngac_pdp_operation_duration_bucket[1m])) by (method, le))

PIP Success Rate

Successful request rate of PIP grouped by method.

NGAC is a graph based authorization framework that consists on three main components:

  • Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. This graph is used by the other components of NGAC to perform access decisions.
  • Policy Decision Point (PDP): Performs access decisions based on the policies configured in the NGAC graph. The PDP is used to perform binary Allow/Deny access decisions (Check), and to determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP).
  • Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. Successful request do not represent “access granted” decisions; they just represent the access decision requests for which a verdict was obtained.

PIP operations are executed against the NGAC graph to represent and maintain the objects in the system and the relationships between them.

A drop in this metric may show that operations against the graph are failing. This may mean that the graph is unavailable for reads, or that it is failing to persist data. Read failures may result in failed access decisions (in the PDP) and user interaction with the system may be rejected as well. Failures in write operations will result in the graph not being properly updated to the latest status and that could result in access decisions based on stale models.

Metric Name Labels PromQL Expression
ngac_pip_operation error
sum(rate(ngac_pip_operation{error=""}[1m])) by (method)

PIP Latency

PiP latency percentiles grouped by method.

NGAC is a graph based authorization framework that consists on three main components:

  • Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. This graph is used by the other components of NGAC to perform access decisions.
  • Policy Decision Point (PDP): Performs access decisions based on the policies configured in the NGAC graph. The PDP is used to perform binary Allow/Deny access decisions (Check), and to determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP).
  • Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. Successful request do not represent “access granted” decisions; they just represent the access decision requests for which a verdict was obtained.

This metric shows the time it takes for a PIP operation to complete and, in the case of write operations, to have data persisted in the NGAC graph.

Degradation in PIP operations may result in general degradation of the system. PIP latency represents the time it takes to access the NGAC graph, and this directly affects the PDP when running access decisions. A degraded PIP may result in a degraded PDP, and that will impact user experience, as access decisions are made and enforced for every operation.

Metric Name Labels PromQL Expression
ngac_pip_operation_duration_bucket N/A
histogram_quantile(0.99, sum(rate(ngac_pip_operation_duration_bucket[1m])) by (method, le))
ngac_pip_operation_duration_bucket N/A
histogram_quantile(0.95, sum(rate(ngac_pip_operation_duration_bucket[1m])) by (method, le))

PIP Error Rate

Rate of errors for PIP requests grouped by method.

NGAC is a graph based authorization framework that consists on three main components:

  • Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. This graph is used by the other components of NGAC to perform access decisions.
  • Policy Decision Point (PDP): Performs access decisions based on the policies configured in the NGAC graph. The PDP is used to perform binary Allow/Deny access decisions (Check), and to determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP).
  • Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. Successful request do not represent “access granted” decisions; they just represent the access decision requests for which a verdict was obtained.

PIP operations are executed against the NGAC graph to represent and maintain the objects in the system and the relationships between them.

Note: the “Node not found” errors are explicitly excluded as TSB often uses GetNode method instead of Exists to determine the node existence, for the purposes of optimisation.

A general raise in this metric may show that operations against the graph are failing. This may mean that the graph is unavailable for reads, or that it is failing to persist data. Read failures may result in failed access decisions (in the PDP) and user interaction with the system may be rejected as well. Failures in write operations will result in the graph not being properly updated to the latest status and that could result in access decisions based on stale models.

Metric Name Labels PromQL Expression
ngac_pip_operation error
sum(rate(ngac_pip_operation{error!="", error!=“Node not found”}[1m])) by (method)

Active PIP Transactions

The number of running transactions on the NGAC PIP. NGAC is a graph-based authorization framework that consists on three main components: Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. The other components of NGAC use this graph to perform access decisions. Policy Decision Point (PDP): Performs access decisions based on the NGAC graph’s policies. The PDP is used to perform binary Allow/Deny access decisions (Check) and determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP). Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. A successful request does not represent “access granted” decisions; they represent the access decision requests for which a verdict was obtained. This metric shows the number of active write operations against the NGAC graph. It can be useful to understand the load of the system generated by concurrent access to the platform.

Metric Name Labels PromQL Expression
ngac_pip_concurrent_transaction N/A
sum(ngac_pip_concurrent_transaction)

XCP Central Operational Status

Operational metrics to indicate XCP Central health.

Metric Name Labels PromQL Expression
process_start_time_seconds component plane
time() - process_start_time_seconds{component=“xcp”,plane=“management”}

XCP Central Version

Metric Name Labels PromQL Expression
xcp_central_version N/A
label_replace(xcp_central_version, “xcp_version”, “$1”, “version”, “(.*)”)

Time since last cluster state received from the edge (seconds)

Since the default cluster state resync time is 10 minutes, any value higher than 600-700 seconds is considered abnormal.

Metric Name Labels PromQL Expression
xcp_central_current_onboarded_edge N/A
time() - max((increase(xcp_central_current_onboarded_edge[2m]) unless increase(xcp_central_current_onboarded_edge[2m]) == 0) + on(edge) group_right xcp_central_last_config_propagation_event_timestamp_ms{edge!="", status=“received” , type=“cluster_state”} /1000) by (edge,type) 
xcp_central_last_config_propagation_event_timestamp_ms edge status type
time() - max((increase(xcp_central_current_onboarded_edge[2m]) unless increase(xcp_central_current_onboarded_edge[2m]) == 0) + on(edge) group_right xcp_central_last_config_propagation_event_timestamp_ms{edge!="", status=“received” , type=“cluster_state”} /1000) by (edge,type) 

Time since cluster states were sent to the MPC and Edges clients (seconds)

Metric Name Labels PromQL Expression
xcp_central_current_onboarded_edge N/A
time() - max((xcp_central_last_cluster_state_event_timestamp_ms / 1000  unless on(peer_cluster_name) label_replace(increase(xcp_central_current_onboarded_edge[5m]),“peer_cluster_name”, “$1”, “edge”, “(.)”) == 0) unless on(cluster_state_event_cluster_name) label_replace(increase(xcp_central_current_onboarded_edge[5m]),“cluster_state_event_cluster_name”, “$1”, “edge”, “(.)”) == 0) by (peer_cluster_name, cluster_state_event_cluster_name)
xcp_central_last_cluster_state_event_timestamp_ms N/A
time() - max((xcp_central_last_cluster_state_event_timestamp_ms / 1000  unless on(peer_cluster_name) label_replace(increase(xcp_central_current_onboarded_edge[5m]),“peer_cluster_name”, “$1”, “edge”, “(.)”) == 0) unless on(cluster_state_event_cluster_name) label_replace(increase(xcp_central_current_onboarded_edge[5m]),“cluster_state_event_cluster_name”, “$1”, “edge”, “(.)”) == 0) by (peer_cluster_name, cluster_state_event_cluster_name)

Time since config resync request is received from the edge (seconds)

Because regular periodic resync requests would be coming, a high value than the resync period, 60 sec default, is not normal.

Metric Name Labels PromQL Expression
xcp_central_current_onboarded_edge N/A
time() - max((increase(xcp_central_current_onboarded_edge[2m]) unless increase(xcp_central_current_onboarded_edge[2m]) == 0) + on(edge) group_right xcp_central_last_config_propagation_event_timestamp_ms{edge!="", status=“received”, type=“config_resync_request” } /1000) by (edge,type) 
xcp_central_last_config_propagation_event_timestamp_ms edge status type
time() - max((increase(xcp_central_current_onboarded_edge[2m]) unless increase(xcp_central_current_onboarded_edge[2m]) == 0) + on(edge) group_right xcp_central_last_config_propagation_event_timestamp_ms{edge!="", status=“received”, type=“config_resync_request” } /1000) by (edge,type) 

Time since config CRs sent to the edge (seconds)

Sent: Time since configs like workspaces, traffic groups etc were sent to the edge. In steady state, a very high value is fine

Metric Name Labels PromQL Expression
xcp_central_current_onboarded_edge N/A
time() - max((increase(xcp_central_current_onboarded_edge[1m]) unless increase(xcp_central_current_onboarded_edge[1m]) == 0) + on(edge) group_right xcp_central_last_config_propagation_event_timestamp_ms{edge!="", status=“sent” } /1000) by (edge,type)
xcp_central_last_config_propagation_event_timestamp_ms edge status
time() - max((increase(xcp_central_current_onboarded_edge[1m]) unless increase(xcp_central_current_onboarded_edge[1m]) == 0) + on(edge) group_right xcp_central_last_config_propagation_event_timestamp_ms{edge!="", status=“sent” } /1000) by (edge,type)

messages received by central from edges in last 5 min

Number of times any message is received by central from edges

Messages received by central from any edge are of three types:

  1. Periodic(per minute by default) config resync request
  2. cluster state
  3. Header message to ack the config received

This number is combined count of all three in the last 5 min.

Metric Name Labels PromQL Expression
xcp_central_config_propagation_event_count status type
increase(xcp_central_config_propagation_event_count{status=“received”,type=“config_resync_request”}[5m]) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0
xcp_central_config_propagation_event_count status type
increase(xcp_central_config_propagation_event_count{status=“received”,type=“cluster_state”}[5m]) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0
xcp_central_current_onboarded_edge N/A
increase(xcp_central_config_propagation_event_count{status=“received”,type=“config_resync_request”}[5m]) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0
xcp_central_current_onboarded_edge N/A
increase(xcp_central_config_propagation_event_count{status=“received”,type=“cluster_state”}[5m]) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0

Number of times config CRs sent by central to the edges in last 5m

Number of times config CRs like workspaces. traffic groups etc sent by central in last 5m

Metric Name Labels PromQL Expression
xcp_central_config_propagation_event_count status
increase(xcp_central_config_propagation_event_count{status=“sent”}[5m]) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0
xcp_central_current_onboarded_edge N/A
increase(xcp_central_config_propagation_event_count{status=“sent”}[5m]) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0

Config Propagation Latency by Edge

Distribution of time to propagate updates from Central (Management plane) to Edges. If there is no config push in last one minute, you will see all 0s, which is expected.

Metric Name Labels PromQL Expression
xcp_central_config_propagation_time_ms_bucket N/A
histogram_quantile(0.99, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge)) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0
xcp_central_config_propagation_time_ms_bucket N/A
histogram_quantile(0.95, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge)) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0
xcp_central_config_propagation_time_ms_bucket N/A
histogram_quantile(0.90, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge)) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0
xcp_central_config_propagation_time_ms_bucket N/A
histogram_quantile(0.75, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge)) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0
xcp_central_config_propagation_time_ms_bucket N/A
histogram_quantile(0.50, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge)) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0
xcp_central_current_onboarded_edge N/A
histogram_quantile(0.99, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge)) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0
xcp_central_current_onboarded_edge N/A
histogram_quantile(0.95, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge)) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0
xcp_central_current_onboarded_edge N/A
histogram_quantile(0.90, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge)) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0
xcp_central_current_onboarded_edge N/A
histogram_quantile(0.75, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge)) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0
xcp_central_current_onboarded_edge N/A
histogram_quantile(0.50, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge)) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0

Errors in config push REQUESTS to the edges in last 5 minutes

Central enqueues the config push request to the debouncer(an internal component of central) when:

  • It receives event about config resources from k8s apiserver , or
  • Any edge connects first time, or
  • It is handling a periodic resync request from any of the edges.

In either case, if central meets an error in the event handling before en-queuing the config push request to the debouncer, this metric gets incremented. So this panel is inversely related to “config push(to the edges) requests enqueued to debouncer in last 5 min”.

Metric Name Labels PromQL Expression
xcp_central_config_update_error_count N/A
 increase(xcp_central_config_update_error_count[5m]) OR on() vector(0)

config push(to the edges) requests enqueued to debouncer in last 5 min

Number of times central enqueued config push(to the connected edges) request to the debouncer in last 5 min. Along with count of the request, reason for the config push request are also shown.

Note: This metric does not indicate the count of actual config push by central. Because of debouncing, actual config push will generally be lesser than this metric. In other words, this metric shows input events for config push. Output (config push on grpc channels to edges) will be lesser because of debouncer

Reasons could be:

  1. ADD/DELETE/UPDATE : These are the events received by the central from the k8s apiserver. Example: ADD/IngressGateway means count of config push requests enqueued because of new IngressGateway CRs creation at k8s apiserver.
  2. EDGE_RESYNC: This shows the count of config push requests when periodic config resync request from edge triggers config push. This will be non-zero only in rare cases when, for whatever reason, edge reported a stale set of configs and central triggers config push to refresh the configs
  3. EDGE_FIRST_CONNECTION: When any edge connects to central, central syncs config to the edge. In steady state, its count must be 0. If its count is non-zero, that indicates grpc stream between central and edge is in error and getting reconnected.
  4. CENTRAL_RESYNC: central enqueues a config push request every 5 minute to reconcile configs at edges. Note that this will result into actual config push only to those edges which are not actively sending their config version periodically. Since 1.4, edges request config resync and therefore central will actually push configs over grpc as a result of these request only if edge is < 1.4.
Metric Name Labels PromQL Expression
xcp_central_config_update_push_count N/A
increase(xcp_central_config_update_push_count[5m])

Pending configurations (orphan configs)

Pending configurations are configs for which cluster could not be determined yet because the parent resource is missing. These metrics show which configurations are currently in Pending state, and the missing Parent group configuration due to which this is in Pending state.

For more information on the Pending configurations can be found by using the XCP central debug endpoint - /debug/cluster_scoped_configs/?pending=true

Metric Name Labels PromQL Expression
xcp_central_pending_configs N/A
sum(xcp_central_pending_configs)
xcp_central_pending_configs N/A
(xcp_central_pending_configs)

Number of connections(cluster state pushing and config pushing)

Central has two type of grpc connections:

  1. edge_config_distribution: One grpc connection with each edge for pushing user configs like workspace, trafficgroup etc
  2. cluster_state: One grpc connection with each edge for pushing learned cluster state(service discovery) from all other peer edges. In addition, one more grpc connection with the mpc for pushing all the learned cluster states to the tsb server.

count of edge_config_distribution will be equal to the number of edges connected to the central count of cluster_state connections will be one more that count of edge_config_distribution connections because of additional mpc connection.

IMPORTANT NOTE: If the cluster is not onboarded(TSB cluster object missing), but the edge is up and connected to central, in that scenario connection counts will include such edges

Metric Name Labels PromQL Expression
xcp_central_current_edge_connections connection_type
xcp_central_current_edge_connections{connection_type=“edge_config_distribution”} OR on() vector(0)
xcp_central_current_edge_connections connection_type
xcp_central_current_edge_connections{connection_type=“cluster_state”} OR on() vector(0)

validation webhook passed count in last 5 min

count of requests that validation webhook passed in last 5 minutes by GVK

Metric Name Labels PromQL Expression
xcp_central_validation_webhook_passed_count N/A
increase(xcp_central_validation_webhook_passed_count[5m]) OR on() vector(0)

Rate of webhook validation errors

Rate of webhook validation errors by GVK

Metric Name Labels PromQL Expression
xcp_central_validation_webhook_failed_count N/A
increase(xcp_central_validation_webhook_failed_count[5m]) OR on() vector(0)
xcp_central_validation_webhook_http_error_count N/A
increase(xcp_central_validation_webhook_http_error_count[5m]) OR on() vector(0)

All goroutines

Metric Name Labels PromQL Expression
go_goroutines component plane
go_goroutines{component=“xcp”,plane=“management”}

Central specific goroutines

This shows the number of active goroutines in XCP Central that are responsible for config pushes to edges.

Metric Name Labels PromQL Expression
go_goroutines component plane
increase(go_goroutines{component=“xcp”,plane=“management”}[1m])
xcp_central_go_routine_count N/A
increase(xcp_central_go_routine_count[1m])

Central memory consumption

Metric Name Labels PromQL Expression
go_memstats_heap_inuse_bytes component plane
go_memstats_heap_inuse_bytes{component=“xcp”,plane=“management”}
go_memstats_stack_inuse_bytes component plane
go_memstats_stack_inuse_bytes{component=“xcp”,plane=“management”}

Edges’ memory consumption

This shows the current memory usage for all Edges

Metric Name Labels PromQL Expression
go_memstats_heap_inuse_bytes component plane
go_memstats_heap_inuse_bytes{component=“xcp”,plane=“control”}

Central CPU consumption

Metric Name Labels PromQL Expression
process_cpu_seconds_total job
rate(process_cpu_seconds_total{job=“central-xcp”}[1m])

All edges’ CPU consumption

Metric Name Labels PromQL Expression
process_cpu_seconds_total job
rate(process_cpu_seconds_total{job=“edge-xcp”}[1m])

XCP Edge status

Metric Name Labels PromQL Expression
process_start_time_seconds cluster_name component
time() - process_start_time_seconds{cluster_name="$cluster",component=“xcp”}

XCP Edge Version

Metric Name Labels PromQL Expression
xcp_edge_istio_versions cluster_name
label_replace(xcp_edge_istio_versions{cluster_name="$cluster"}, “istio_versions”, “$1”, “version”, “(.*)”)
xcp_edge_version cluster_name
label_replace(xcp_edge_version{cluster_name="$cluster"}, “xcp_version”, “$1”, “version”, “(.*)”)

Number of gatewayHost exposed

Metric Name Labels PromQL Expression
xcp_edge_gateway_hosts_count cluster_name
xcp_edge_gateway_hosts_count{cluster_name="$cluster"}

Active connections to central

Current peer connections this edge holds against remote edges.

Metric Name Labels PromQL Expression
xcp_edge_stream_connect_count cluster_name statusLabel
xcp_edge_stream_connect_count{statusLabel=“ok”, cluster_name="$cluster"} - ignoring(statusLabel) xcp_edge_stream_connect_count{statusLabel=“close”, cluster_name="$cluster"} OR on() xcp_edge_stream_connect_count{statusLabel=“ok”, cluster_name="$cluster"} 

Time since any message sent to central on config stream (seconds)

Time since any of the following messages is sent by edge to central:

  1. Periodic(per minute) config resync request
  2. Ack of last config received
  3. Cluster state Because regular periodic resync requests would be going out periodically, a high value than the resync period, 60 sec default, is not normal.
Metric Name Labels PromQL Expression
xcp_edge_last_push_to_central_timestamp_ms cluster_name
time() - xcp_edge_last_push_to_central_timestamp_ms{cluster_name="$cluster"} / 1000

Number of times edge updates were sent to Central

The number of updates successfully sent from an edge to central. Updates can either be of the form cluster state OR headers depending upon whether cluster state or just the header is being pushed.

These updates can be sent for one or more of the following reasons: registration_with_central, service_subset_update (subsets of one or more services changed), config(config change), node_update, service(service change).

The graph represents number of times edge updates were sent to central, with the corresponding reason for the update. The last row shows how many times the gwHostNamespacesChanged flag was true while sending the above updates.

Metric Name Labels PromQL Expression
xcp_edge_updates_sent_to_central_count cluster_name trigger_reason
sum by (trigger_reason) (xcp_edge_updates_sent_to_central_count{cluster_name="$cluster",trigger_reason!=“periodic_config_resync”,trigger_reason!=""})
xcp_edge_updates_sent_to_central_count cluster_name gtw_host_namespace_changed
sum(xcp_edge_updates_sent_to_central_count{cluster_name="$cluster",gtw_host_namespace_changed=“true”})

Number of times cluster states received by edge from central in last 1 min

Number of times cluster states are received by edge from central in the last 1 min.

Metric Name Labels PromQL Expression
xcp_edge_cluster_state_received_from_central_count cluster_name
increase(xcp_edge_cluster_state_received_from_central_count{cluster_name="$cluster"}[1m])

Number of times config CRs received by edge from central in last 5 min

Number of times cluster states are received by edge from central

Metric Name Labels PromQL Expression
xcp_edge_config_updates_received_count cluster_name
increase(xcp_edge_config_updates_received_count{cluster_name="$cluster"}[5m])

Translation count

Number of Istio config translations in Edge per namespace

Metric Name Labels PromQL Expression
xcp_edge_istio_translations_count cluster_name
xcp_edge_istio_translations_count{cluster_name="$cluster"}

Number of times config status sent by edge to central in last 5 min

Number of times config statuses are sent by edge to central, with respective objects’ Kind.

Messages received by central from any edge are of three types:

  1. Periodic(per minute by default) config resync request
  2. cluster state
  3. Header message to ack the config received

This number is combined count of all three in the last 5 min.

Metric Name Labels PromQL Expression
xcp_edge_config_status_updates_sent_gvk cluster_name
increase(xcp_edge_config_status_updates_sent_gvk{cluster_name="$cluster"}[5m])

Istio config reconcile time (ms)

Time (in ms) to generate and apply Istio configurations when updates are received

Metric Name Labels PromQL Expression
xcp_edge_istio_config_reconcile_time_ms_bucket cluster_name
histogram_quantile(0.99, sum(rate(xcp_edge_istio_config_reconcile_time_ms_bucket{cluster_name="$cluster"}[5m])) by (le))
xcp_edge_istio_config_reconcile_time_ms_bucket cluster_name
histogram_quantile(0.95, sum(rate(xcp_edge_istio_config_reconcile_time_ms_bucket{cluster_name="$cluster"}[5m])) by (le))

Number of configs created/updated by edge at k8s apiserver every 5 minutes

Shows the activity of Edge creating objects in K8s API, grouped by object kind.

Metric Name Labels PromQL Expression
xcp_edge_cr_added cluster_name
increase(xcp_edge_cr_added{cluster_name="$cluster"}[5m]) OR increase(xcp_edge_cr_updated{cluster_name="$cluster"}[5m])
xcp_edge_cr_updated cluster_name
increase(xcp_edge_cr_added{cluster_name="$cluster"}[5m]) OR increase(xcp_edge_cr_updated{cluster_name="$cluster"}[5m])

Number of configs deleted by edge from k8s apiserver every 5 minutes

Shows the activity of Edge deleting objects in K8s API, grouped by object kind.

Metric Name Labels PromQL Expression
xcp_edge_cr_deleted cluster_name
increase(xcp_edge_cr_deleted{cluster_name="$cluster"}[5m])

latency in cluster-state propagation to central (ms)

Time (in ms) taken to propagate a change in the local cluster state to remote central

Metric Name Labels PromQL Expression
xcp_edge_local_cluster_update_propagation_time_ms_bucket cluster_name
histogram_quantile(0.50, sum(rate(xcp_edge_local_cluster_update_propagation_time_ms_bucket{cluster_name="$cluster"}[5m])) by (le)) / 1000
xcp_edge_local_cluster_update_propagation_time_ms_bucket cluster_name
histogram_quantile(0.95, sum(rate(xcp_edge_local_cluster_update_propagation_time_ms_bucket{cluster_name="$cluster"}[5m])) by (le)) / 1000

All goroutines

Metric Name Labels PromQL Expression
go_goroutines cluster_name component
go_goroutines{cluster_name="$cluster", component=“xcp”}

Edge specific gorountines

This shows the number of active goroutines in XCP Edge that are responsible for config translation.

Metric Name Labels PromQL Expression
xcp_edge_go_routine_count cluster_name
increase(xcp_edge_go_routine_count{cluster_name="$cluster"}[1m])

Edge CPU consumption

Metric Name Labels PromQL Expression
process_cpu_seconds_total cluster_name job
rate(process_cpu_seconds_total{job=“edge-xcp”,cluster_name="$cluster"}[1m])

Edge memory consumption

Metric Name Labels PromQL Expression
go_memstats_heap_inuse_bytes cluster_name component
go_memstats_heap_inuse_bytes{component=“xcp”,cluster_name="$cluster"}
go_memstats_stack_inuse_bytes cluster_name component
go_memstats_stack_inuse_bytes{component=“xcp”,cluster_name="$cluster"}