Performance
Envoy is architected to optimize scalability and resource utilization by running an event loop on a small number of threads. The “main” thread is responsible for control plane processing, and each “worker” thread handles a portion of the data plane processing. Envoy exposes two statistics to monitor performance of the event loops on all these threads.
Loop duration: Some amount of processing is done on each iteration of the event loop. This amount will naturally vary with changes in load. However, if one or more threads have an unusually long-tailed loop duration, it may indicate a performance issue. For example, work might not be distributed fairly across the worker threads, or there may be a long blocking operation in an extension that’s impeding progress.
Poll delay: On each iteration of the event loop, the event dispatcher polls for I/O events and “wakes up” either when some I/O events are ready to be processed or when a timeout fires, whichever occurs first. In the case of a timeout, we can measure the difference between the expected wakeup time and the actual wakeup time after polling; this difference is called the “poll delay.” It’s normal to see some small poll delay, usually equal to the kernel scheduler’s “time slice” or “quantum”—this depends on the specific operating system on which Envoy is running—but if this number elevates substantially above its normal observed baseline, it likely indicates kernel scheduler delays.
These statistics can be enabled by setting enable_dispatcher_stats to true.
Warning
Note that enabling dispatcher stats records a value for each iteration of the event loop on every thread. This should normally be minimal overhead, but when using statsd, it will send each observed value over the wire individually because the statsd protocol doesn’t have any way to represent a histogram summary. Be aware that this can be a very large volume of data.
Event loop statistics
The event dispatcher for the main thread has a statistics tree rooted at server.dispatcher., and the event dispatcher for each worker thread has a statistics tree rooted at listener_manager.worker_<id>.dispatcher., each with the following statistics:
Name |
Type |
Description |
---|---|---|
loop_duration_us |
Histogram |
Event loop durations in microseconds |
poll_delay_us |
Histogram |
Polling delays in microseconds |
Note that any auxiliary threads are not included here.
Watchdog
In addition to event loop statistics, Envoy also include a configurable watchdog system that can increment statistics when Envoy is not responsive and optionally kill the server. The system has two separate watchdog configs, one for the main thread and one for worker threads; this is helpful as the different threads have different workloads. The system also has an extension point allowing for custom actions to be taken based on watchdog events. The statistics are useful for understanding at a high level whether Envoy’s event loop is not responsive either because it is doing too much work, blocking, or not being scheduled by the OS.
The watchdog emits aggregated statistics in both main_thread and workers. In addition, it emits individual statistics under server.<thread_name>. trees. <thread_name> is equal to main_thread, worker_0, worker_1, etc.
Name |
Type |
Description |
---|---|---|
watchdog_miss |
Counter |
Number of standard misses |
watchdog_mega_miss |
Counter |
Number of mega misses |