Affects Version/s: None
Fix Version/s: DC/OS 1.13.0
Sprint:Security Sprint 43
Parent Initiative:D2IQ-44281 - [DC/OS] Instrument and Transmit Metrics for Critical Components (via dcos-telegraf) to enable operator visibility to customer workload from a single service
DCOS_OSS-4596 we will have measurements of latency of Admin Router responses available in Grafana.
These measurements use an arithmetic mean of the latency over the last one minute, provided by nginx-module-vts.
This will not show us what the real latency is for requests, and how that changes within a more precise time frame than one minute.
For sure we want to reduce the scraping interval, to get more precise measurement.
We must be careful not to make the scrape interval too small, else we will move on to the next attempt to scrape data before data has been scraped.
Be aware that we are collecting more data, and optionally implement some cleanup/rotation if that is necessary.
We should choose also what to do with that data:
- Stick with the arithmetic mean system provided by nginx-module-vts
- Change to a weighted moving average, provided by nginx-module-vts
- Do our own maths to show a different metric to the arithmetic mean or weighted moving average
- Export prometheus histograms that displays high latency
- Metrics are exported and displayed in corresponding dashboards