Management Server

Management Server Unreachability

When an Envoy instance loses connectivity with the management server, Envoy will latch on to the previous configuration while actively retrying in the background to reestablish the connection with the management server.

It is important that Envoy detects when a connection to a management server is unhealthy so that it can try to establish a new connection. Configuring either TCP keep-alives or HTTP/2 keepalives in the cluster that connects to the management server is recommended.

Envoy debug logs the fact that it is not able to establish a connection with the management server every time it attempts a connection.

connected_state statistic provides a signal for monitoring this behavior.

Statistics

Management Server has a statistics tree rooted at control_plane. with the following statistics:

Name	Type	Description
connected_state	Gauge	A boolean (1 for connected and 0 for disconnected) that indicates the current connection state with management server
rate_limit_enforced	Counter	Total number of times rate limit was enforced for management server requests
pending_requests	Gauge	Total number of pending requests when the rate limit was enforced
identifier	TextReadout	The identifier of the control plane instance that sent the last discovery response

xDS subscription statistics

Envoy discovers its various dynamic resources via discovery services referred to as xDS. Resources are requested via subscriptions, by specifying a filesystem path to watch, initiating gRPC streams or polling a REST-JSON URL.

The following statistics are generated for all subscriptions.

Name	Type	Description
config_reload	Counter	Total API fetches that resulted in a config reload due to a different config
config_reload_time_ms	Gauge	Timestamp of the last config reload as milliseconds since the epoch
init_fetch_timeout	Counter	Total initial fetch timeouts
update_attempt	Counter	Total API fetches attempted
update_success	Counter	Total API fetches completed successfully
update_failure	Counter	Total API fetches that failed because of network errors
update_rejected	Counter	Total API fetches that failed because of schema/validation errors
update_time	Gauge	Timestamp of the last successful API fetch attempt as milliseconds since the epoch. Refreshed even after a trivial configuration reload that contained no configuration changes.
version	Gauge	Hash of the contents from the last successful API fetch
version_text	TextReadout	The version text from the last successful API fetch
control_plane.connected_state	Gauge	A boolean (1 for connected and 0 for disconnected) that indicates the current connection state with management server