Statistics
General
The cluster manager has a statistics tree rooted at cluster_manager. with the following
statistics. Any :
character in the stats name is replaced with _
. Stats include
all clusters managed by the cluster manager, including both clusters used for data plane
upstreams and control plane xDS clusters.
Name |
Type |
Description |
---|---|---|
cluster_added |
Counter |
Total clusters added (either via static config or CDS) |
cluster_modified |
Counter |
Total clusters modified (via CDS) |
cluster_removed |
Counter |
Total clusters removed (via CDS) |
cluster_updated |
Counter |
Total cluster updates |
cluster_updated_via_merge |
Counter |
Total cluster updates applied as merged updates |
update_merge_cancelled |
Counter |
Total merged updates that got cancelled and delivered early |
update_out_of_merge_window |
Counter |
Total updates which arrived out of a merge window |
active_clusters |
Gauge |
Number of currently active (warmed) clusters |
warming_clusters |
Gauge |
Number of currently warming (not active) clusters |
In addition to the cluster manager stats, there are per worker thread local cluster manager statistics tree rooted at thread_local_cluster_manager.<worker_id>. with the following statistics.
Name |
Type |
Description |
---|---|---|
clusters_inflated |
Gauge |
Number of clusters the worker has initialized. If using cluster deferral this number should be <= (cluster_added - clusters_removed). |
Every cluster has a statistics tree rooted at cluster.<name>. with the following statistics:
Name |
Type |
Description |
---|---|---|
upstream_cx_total |
Counter |
Total connections |
upstream_cx_active |
Gauge |
Total active connections |
upstream_cx_http1_total |
Counter |
Total HTTP/1.1 connections |
upstream_cx_http2_total |
Counter |
Total HTTP/2 connections |
upstream_cx_http3_total |
Counter |
Total HTTP/3 connections |
upstream_cx_connect_fail |
Counter |
Total connection failures |
upstream_cx_connect_timeout |
Counter |
Total connection connect timeouts |
upstream_cx_connect_with_0_rtt |
Counter |
Total connections able to send 0-rtt requests (early data). |
upstream_cx_idle_timeout |
Counter |
Total connection idle timeouts |
upstream_cx_max_duration_reached |
Counter |
Total connections closed due to max duration reached |
upstream_cx_connect_attempts_exceeded |
Counter |
Total consecutive connection failures exceeding configured connection attempts |
upstream_cx_overflow |
Counter |
Total times that the cluster’s connection circuit breaker overflowed |
upstream_cx_connect_ms |
Histogram |
Connection establishment milliseconds |
upstream_cx_length_ms |
Histogram |
Connection length milliseconds |
upstream_cx_destroy |
Counter |
Total destroyed connections |
upstream_cx_destroy_local |
Counter |
Total connections destroyed locally |
upstream_cx_destroy_remote |
Counter |
Total connections destroyed remotely |
upstream_cx_destroy_with_active_rq |
Counter |
Total connections destroyed with 1+ active request |
upstream_cx_destroy_local_with_active_rq |
Counter |
Total connections destroyed locally with 1+ active request |
upstream_cx_destroy_remote_with_active_rq |
Counter |
Total connections destroyed remotely with 1+ active request |
upstream_cx_close_notify |
Counter |
Total connections closed via HTTP/1.1 connection close header or HTTP/2 or HTTP/3 GOAWAY |
upstream_cx_rx_bytes_total |
Counter |
Total received connection bytes |
upstream_cx_rx_bytes_buffered |
Gauge |
Received connection bytes currently buffered |
upstream_cx_tx_bytes_total |
Counter |
Total sent connection bytes |
upstream_cx_tx_bytes_buffered |
Gauge |
Send connection bytes currently buffered |
upstream_cx_pool_overflow |
Counter |
Total times that the cluster’s connection pool circuit breaker overflowed |
upstream_cx_protocol_error |
Counter |
Total connection protocol errors |
upstream_cx_max_requests |
Counter |
Total connections closed due to maximum requests |
upstream_cx_none_healthy |
Counter |
Total times connection not established due to no healthy hosts |
upstream_rq_total |
Counter |
Total requests |
upstream_rq_active |
Gauge |
Total active requests |
upstream_rq_pending_total |
Counter |
Total requests pending a connection pool connection |
upstream_rq_pending_overflow |
Counter |
Total requests that overflowed connection pool or requests (mainly for HTTP/2 and above) circuit breaking and were failed |
upstream_rq_pending_failure_eject |
Counter |
Total requests that were failed due to a connection pool connection failure or remote connection termination |
upstream_rq_pending_active |
Gauge |
Total active requests pending a connection pool connection |
upstream_rq_cancelled |
Counter |
Total requests cancelled before obtaining a connection pool connection |
upstream_rq_maintenance_mode |
Counter |
Total requests that resulted in an immediate 503 due to maintenance mode |
upstream_rq_timeout |
Counter |
Total requests that timed out waiting for a response |
upstream_rq_max_duration_reached |
Counter |
Total requests closed due to max duration reached |
upstream_rq_per_try_timeout |
Counter |
Total requests that hit the per try timeout (except when request hedging is enabled) |
upstream_rq_rx_reset |
Counter |
Total requests that were reset remotely |
upstream_rq_tx_reset |
Counter |
Total requests that were reset locally |
upstream_rq_retry |
Counter |
Total request retries |
upstream_rq_retry_backoff_exponential |
Counter |
Total retries using the exponential backoff strategy |
upstream_rq_retry_backoff_ratelimited |
Counter |
Total retries using the ratelimited backoff strategy |
upstream_rq_retry_limit_exceeded |
Counter |
Total requests not retried due to exceeding the configured number of maximum retries |
upstream_rq_retry_success |
Counter |
Total request retry successes |
upstream_rq_retry_overflow |
Counter |
Total requests not retried due to circuit breaking or exceeding the retry budget |
upstream_flow_control_paused_reading_total |
Counter |
Total number of times flow control paused reading from upstream |
upstream_flow_control_resumed_reading_total |
Counter |
Total number of times flow control resumed reading from upstream |
upstream_flow_control_backed_up_total |
Counter |
Total number of times the upstream connection backed up and paused reads from downstream |
upstream_flow_control_drained_total |
Counter |
Total number of times the upstream connection drained and resumed reads from downstream |
upstream_internal_redirect_failed_total |
Counter |
Total number of times failed internal redirects resulted in redirects being passed downstream. |
upstream_internal_redirect_succeeded_total |
Counter |
Total number of times internal redirects resulted in a second upstream request. |
membership_change |
Counter |
Total cluster membership changes |
membership_healthy |
Gauge |
Current cluster healthy total (inclusive of both health checking and outlier detection) |
membership_degraded |
Gauge |
Current cluster degraded total |
membership_excluded |
Gauge |
Current cluster excluded total |
membership_total |
Gauge |
Current cluster membership total |
retry_or_shadow_abandoned |
Counter |
Total number of times shadowing or retry buffering was canceled due to buffer limits |
config_reload |
Counter |
Total API fetches that resulted in a config reload due to a different config |
update_attempt |
Counter |
Total attempted cluster membership updates by service discovery |
update_success |
Counter |
Total successful cluster membership updates by service discovery |
update_failure |
Counter |
Total failed cluster membership updates by service discovery |
update_duration |
Histogram |
Amount of time spent updating configs |
update_empty |
Counter |
Total cluster membership updates ending with empty cluster load assignment and continuing with previous config |
update_no_rebuild |
Counter |
Total successful cluster membership updates that didn’t result in any cluster load balancing structure rebuilds |
version |
Gauge |
Hash of the contents from the last successful API fetch |
warming_state |
Gauge |
Current cluster warming state |
max_host_weight |
Gauge |
Maximum weight of any host in the cluster |
bind_errors |
Counter |
Total errors binding the socket to the configured source address |
assignment_timeout_received |
Counter |
Total assignments received with endpoint lease information. |
assignment_stale |
Counter |
Number of times the received assignments went stale before new assignments arrived. |
HTTP/3 protocol statistics
HTTP/3 protocol stats are global with the following statistics:
Name |
Type |
Description |
---|---|---|
upstream.<tx/rx>.quic_connection_close_error_code_<error_code> |
Counter |
A collection of counters that are lazily initialized to record each QUIC connection close’s error code. |
upstream.<tx/rx>.quic_reset_stream_error_code_<error_code> |
Counter |
A collection of counters that are lazily initialized to record each QUIC stream reset error code. |
Health check statistics
If health check is configured, the cluster has an additional statistics tree rooted at cluster.<name>.health_check. with the following statistics:
Name |
Type |
Description |
---|---|---|
attempt |
Counter |
Number of health checks |
success |
Counter |
Number of successful health checks |
failure |
Counter |
Number of immediately failed health checks (e.g. HTTP 503) as well as network failures |
passive_failure |
Counter |
Number of health check failures due to passive events (e.g. x-envoy-immediate-health-check-fail) |
network_failure |
Counter |
Number of health check failures due to network error |
verify_cluster |
Counter |
Number of health checks that attempted cluster name verification |
healthy |
Gauge |
Number of healthy members |
Outlier detection statistics
If outlier detection is configured for a cluster, statistics will be rooted at cluster.<name>.outlier_detection. and contain the following:
Name |
Type |
Description |
---|---|---|
ejections_enforced_total |
Counter |
Number of enforced ejections due to any outlier type |
ejections_active |
Gauge |
Number of currently ejected hosts |
ejections_overflow |
Counter |
Number of ejections aborted due to the max ejection % |
ejections_enforced_consecutive_5xx |
Counter |
Number of enforced consecutive 5xx ejections |
ejections_detected_consecutive_5xx |
Counter |
Number of detected consecutive 5xx ejections (even if unenforced) |
ejections_enforced_success_rate |
Counter |
Number of enforced success rate outlier ejections. Exact meaning of this counter depends on outlier_detection.split_external_local_origin_errors config item. Refer to Outlier Detection documentation for details. |
ejections_detected_success_rate |
Counter |
Number of detected success rate outlier ejections (even if unenforced). Exact meaning of this counter depends on outlier_detection.split_external_local_origin_errors config item. Refer to Outlier Detection documentation for details. |
ejections_enforced_consecutive_gateway_failure |
Counter |
Number of enforced consecutive gateway failure ejections |
ejections_detected_consecutive_gateway_failure |
Counter |
Number of detected consecutive gateway failure ejections (even if unenforced) |
ejections_enforced_consecutive_local_origin_failure |
Counter |
Number of enforced consecutive local origin failure ejections |
ejections_detected_consecutive_local_origin_failure |
Counter |
Number of detected consecutive local origin failure ejections (even if unenforced) |
ejections_enforced_local_origin_success_rate |
Counter |
Number of enforced success rate outlier ejections for locally originated failures |
ejections_detected_local_origin_success_rate |
Counter |
Number of detected success rate outlier ejections for locally originated failures (even if unenforced) |
ejections_enforced_failure_percentage |
Counter |
Number of enforced failure percentage outlier ejections. Exact meaning of this counter depends on outlier_detection.split_external_local_origin_errors config item. Refer to Outlier Detection documentation for details. |
ejections_detected_failure_percentage |
Counter |
Number of detected failure percentage outlier ejections (even if unenforced). Exact meaning of this counter depends on outlier_detection.split_external_local_origin_errors config item. Refer to Outlier Detection documentation for details. |
ejections_enforced_failure_percentage_local_origin |
Counter |
Number of enforced failure percentage outlier ejections for locally originated failures |
ejections_detected_failure_percentage_local_origin |
Counter |
Number of detected failure percentage outlier ejections for locally originated failures (even if unenforced) |
ejections_total |
Counter |
Deprecated. Number of ejections due to any outlier type (even if unenforced) |
ejections_consecutive_5xx |
Counter |
Deprecated. Number of consecutive 5xx ejections (even if unenforced) |
Circuit breakers statistics
Circuit breakers statistics will be rooted at cluster.<name>.circuit_breakers.<priority>. and contain the following:
Name |
Type |
Description |
---|---|---|
cx_open |
Gauge |
Whether the connection circuit breaker is under its concurrency limit (0) or is at capacity and no longer admitting (1) |
cx_pool_open |
Gauge |
Whether the connection pool circuit breaker is under its concurrency limit (0) or is at capacity and no longer admitting (1) |
rq_pending_open |
Gauge |
Whether the pending requests circuit breaker is under its concurrency limit (0) or is at capacity and no longer admitting (1) |
rq_open |
Gauge |
Whether the requests circuit breaker is under its concurrency limit (0) or is at capacity and no longer admitting (1) |
rq_retry_open |
Gauge |
Whether the retry circuit breaker is under its concurrency limit (0) or is at capacity and no longer admitting (1) |
remaining_cx |
Gauge |
Number of remaining connections until the circuit breaker reaches its concurrency limit |
remaining_pending |
Gauge |
Number of remaining pending requests until the circuit breaker reaches its concurrency limit |
remaining_rq |
Gauge |
Number of remaining requests until the circuit breaker reaches its concurrency limit |
remaining_retries |
Gauge |
Number of remaining retries until the circuit breaker reaches its concurrency limit |
Note
Metrics starting with prefix remaining_
are not generated by default.
To track the number of resources remaining until a circuit breaker opens, set the parameter
track_remaining
to true in circuit breaker configuration.
Timeout budget statistics
If timeout budget statistic tracking is turned on, statistics will be added to cluster.<name> and contain the following:
Name |
Type |
Description |
---|---|---|
upstream_rq_timeout_budget_percent_used |
Histogram |
What percentage of the global timeout was used waiting for a response |
upstream_rq_timeout_budget_per_try_percent_used |
Histogram |
What percentage of the per try timeout was used waiting for a response |
Dynamic HTTP statistics
If HTTP is used, dynamic HTTP response code statistics are also available. These are emitted by various internal systems as well as some filters such as the router filter and rate limit filter. They are rooted at cluster.<name>. and contain the following statistics:
Name |
Type |
Description |
---|---|---|
upstream_rq_completed |
Counter |
Total upstream requests completed |
upstream_rq_<*xx> |
Counter |
Aggregate HTTP response codes (e.g., 2xx, 3xx, etc.) |
upstream_rq_<*> |
Counter |
Specific HTTP response codes (e.g., 201, 302, etc.) |
upstream_rq_time |
Histogram |
Request time milliseconds |
canary.upstream_rq_completed |
Counter |
Total upstream canary requests completed |
canary.upstream_rq_<*xx> |
Counter |
Upstream canary aggregate HTTP response codes |
canary.upstream_rq_<*> |
Counter |
Upstream canary specific HTTP response codes |
canary.upstream_rq_time |
Histogram |
Upstream canary request time milliseconds |
internal.upstream_rq_completed |
Counter |
Total internal origin requests completed |
internal.upstream_rq_<*xx> |
Counter |
Internal origin aggregate HTTP response codes |
internal.upstream_rq_<*> |
Counter |
Internal origin specific HTTP response codes |
internal.upstream_rq_time |
Histogram |
Internal origin request time milliseconds |
external.upstream_rq_completed |
Counter |
Total external origin requests completed |
external.upstream_rq_<*xx> |
Counter |
External origin aggregate HTTP response codes |
external.upstream_rq_<*> |
Counter |
External origin specific HTTP response codes |
external.upstream_rq_time |
Histogram |
External origin request time milliseconds |
TLS statistics
If TLS is used by the cluster the following statistics are rooted at cluster.<name>.ssl.:
Name |
Type |
Description |
---|---|---|
connection_error |
Counter |
Total TLS connection errors not including failed certificate verifications |
handshake |
Counter |
Total successful TLS connection handshakes |
session_reused |
Counter |
Total successful TLS session resumptions |
no_certificate |
Counter |
Total successful TLS connections with no client certificate |
fail_verify_no_cert |
Counter |
Total TLS connections that failed because of missing client certificate |
fail_verify_error |
Counter |
Total TLS connections that failed CA verification |
fail_verify_san |
Counter |
Total TLS connections that failed SAN verification |
fail_verify_cert_hash |
Counter |
Total TLS connections that failed certificate pinning verification |
ocsp_staple_failed |
Counter |
Total TLS connections that failed compliance with the OCSP policy |
ocsp_staple_omitted |
Counter |
Total TLS connections that succeeded without stapling an OCSP response |
ocsp_staple_responses |
Counter |
Total TLS connections where a valid OCSP response was available (irrespective of whether the client requested stapling) |
ocsp_staple_requests |
Counter |
Total TLS connections where the client requested an OCSP staple |
ciphers.<cipher> |
Counter |
Total successful TLS connections that used cipher <cipher> |
curves.<curve> |
Counter |
Total successful TLS connections that used ECDHE curve <curve> |
sigalgs.<sigalg> |
Counter |
Total successful TLS connections that used signature algorithm <sigalg> |
versions.<version> |
Counter |
Total successful TLS connections that used protocol version <version> |
was_key_usage_invalid |
Counter |
Total successful TLS connections that used an invalid keyUsage extension. (This is not avaiable in BoringSSL FIPS yet due to issue #28246) |
TCP statistics
The following TCP statistics, which are available when using the TCP stats transport socket, are rooted at cluster.<name>.tcp_stats.:
Note
These metrics are provided by the operating system. Due to differences in operating system metrics available and the methodology used to take measurements, the values may not be consistent across different operating systems or versions of the same operating system.
Name |
Type |
Description |
---|---|---|
cx_tx_segments |
Counter |
Total TCP segments transmitted |
cx_rx_segments |
Counter |
Total TCP segments received |
cx_tx_data_segments |
Counter |
Total TCP segments with a non-zero data length transmitted |
cx_rx_data_segments |
Counter |
Total TCP segments with a non-zero data length received |
cx_tx_retransmitted_segments |
Counter |
Total TCP segments retransmitted |
cx_rx_bytes_received |
Counter |
Total payload bytes received for which TCP acknowledgments have been sent. |
cx_tx_bytes_sent |
Counter |
Total payload bytes transmitted (including retransmitted bytes). |
cx_tx_unsent_bytes |
Gauge |
Bytes which Envoy has sent to the operating system which have not yet been sent |
cx_tx_unacked_segments |
Gauge |
Segments which have been transmitted that have not yet been acknowledged |
cx_tx_percent_retransmitted_segments |
Histogram |
Percent of segments on a connection which were retransmistted |
cx_rtt_us |
Histogram |
Smoothed round trip time estimate in microseconds |
cx_rtt_variance_us |
Histogram |
Estimated variance in microseconds of the round trip time. Higher values indicated more variability. |
Alternate tree dynamic HTTP statistics
If alternate tree statistics are configured, they will be present in the cluster.<name>.<alt name>. namespace. The statistics produced are the same as documented in the dynamic HTTP statistics section above.
Per service zone dynamic HTTP statistics
If the service zone is available for the local service (via --service-zone
)
and the upstream cluster,
Envoy will track the following statistics in cluster.<name>.zone.<from_zone>.<to_zone>. namespace.
Name |
Type |
Description |
---|---|---|
upstream_rq_<*xx> |
Counter |
Aggregate HTTP response codes (e.g., 2xx, 3xx, etc.) |
upstream_rq_<*> |
Counter |
Specific HTTP response codes (e.g., 201, 302, etc.) |
upstream_rq_time |
Histogram |
Request time milliseconds |
Load balancer statistics
Statistics for monitoring load balancer decisions. Stats are rooted at cluster.<name>. and contain the following statistics:
Name |
Type |
Description |
---|---|---|
lb_recalculate_zone_structures |
Counter |
The number of times locality aware routing structures are regenerated for fast decisions on upstream locality selection |
lb_healthy_panic |
Counter |
Total requests load balanced with the load balancer in panic mode |
lb_zone_cluster_too_small |
Counter |
No zone aware routing because of small upstream cluster size |
lb_zone_routing_all_directly |
Counter |
Sending all requests directly to the same zone |
lb_zone_routing_sampled |
Counter |
Sending some requests to the same zone |
lb_zone_routing_cross_zone |
Counter |
Zone aware routing mode but have to send cross zone |
lb_local_cluster_not_ok |
Counter |
Local host set is not set or it is panic mode for local cluster |
lb_zone_number_differs |
Counter |
No zone aware routing because the feature flag is disabled and the number of zones in local and upstream cluster is different |
lb_zone_no_capacity_left |
Counter |
Total number of times ended with random zone selection due to rounding error |
original_dst_host_invalid |
Counter |
Total number of invalid hosts passed to original destination load balancer |
Load balancer subset statistics
Statistics for monitoring load balancer subset decisions. Stats are rooted at cluster.<name>. and contain the following statistics:
Name |
Type |
Description |
---|---|---|
lb_subsets_active |
Gauge |
Number of currently available subsets |
lb_subsets_created |
Counter |
Number of subsets created |
lb_subsets_removed |
Counter |
Number of subsets removed due to no hosts |
lb_subsets_selected |
Counter |
Number of times any subset was selected for load balancing |
lb_subsets_fallback |
Counter |
Number of times the fallback policy was invoked |
lb_subsets_fallback_panic |
Counter |
Number of times the subset panic mode triggered |
lb_subsets_single_host_per_subset_duplicate |
Gauge |
Number of duplicate (unused) hosts when using single_host_per_subset |
Ring hash load balancer statistics
Statistics for monitoring the size and effective distribution of hashes when using the ring hash load balancer. Stats are rooted at cluster.<name>.ring_hash_lb. and contain the following statistics:
Name |
Type |
Description |
---|---|---|
size |
Gauge |
Total number of host hashes on the ring |
min_hashes_per_host |
Gauge |
Minimum number of hashes for a single host |
max_hashes_per_host |
Gauge |
Maximum number of hashes for a single host |
Maglev load balancer statistics
Statistics for monitoring effective host weights when using the Maglev load balancer. Stats are rooted at cluster.<name>.maglev_lb. and contain the following statistics:
Name |
Type |
Description |
---|---|---|
min_entries_per_host |
Gauge |
Minimum number of entries for a single host |
max_entries_per_host |
Gauge |
Maximum number of entries for a single host |
Request Response Size statistics
If request response size statistics are tracked, statistics will be added to cluster.<name> and contain the following:
Name |
Type |
Description |
---|---|---|
upstream_rq_headers_size |
Histogram |
Request headers size in bytes per upstream |
upstream_rq_body_size |
Histogram |
Request body size in bytes per upstream |
upstream_rs_headers_size |
Histogram |
Response headers size in bytes per upstream |
upstream_rs_body_size |
Histogram |
Response body size in bytes per upstream |