DataCentersX > Compute Ops > Telemetry & Observability


DC Telemetry & Observability


Telemetry and observability is the integrating discipline that aggregates operational data from across the facility, the network fabric, the IT and application layers, and produces the observable platform that operators use to understand, troubleshoot, and improve the system. The discipline sits one layer above the source-layer monitoring pages (Power Monitoring, Cooling Monitoring, Water Monitoring, Emissions Monitoring) and one layer below the consuming disciplines (AIOps, Platform Reliability Engineering, SLA/SLO Management). Source-layer monitoring produces sensor data; observability platforms aggregate, correlate, and surface it for human and automated consumers.


The three pillars

Modern observability practice organizes data into three primary pillars - metrics, logs, and traces - plus events as a fourth category. Each has different storage, query, and analysis characteristics; mature platforms ingest all three and correlate across them.

Pillar What it is Primary use
Metrics Numeric measurements over time; CPU utilization, request rates, error counts, PUE, temperatures Dashboards, alerting, capacity trending, SLO measurement
Logs Discrete event records; application logs, system logs, audit logs Incident investigation, audit trails, debugging, security analysis
Traces End-to-end request paths through distributed systems; spans showing service-to-service calls Latency analysis, dependency mapping, performance optimization
Events State changes worth noting; deployments, incidents, configuration changes Change correlation, deployment tracking, incident timeline reconstruction

Telemetry sources

Source Examples Consumed by
Application instrumentation OpenTelemetry SDKs, custom instrumentation, framework auto-instrumentation PRE, application teams, SLO platforms
Container and orchestration Kubernetes metrics, container resource usage, pod lifecycle events Platform teams, capacity planning, AIOps
Server and OS CPU, memory, disk, network metrics; system logs; BMC telemetry Hardware Fleet Management, capacity planning
Network fabric SNMP, gNMI streaming telemetry, sFlow/NetFlow/IPFIX, switch logs Network Operations, capacity planning
Facility infrastructure EPMS, BMS, DCIM telemetry; power, cooling, environmental sensors FACILITY OPS, sustainability reporting
Security tooling SIEM events, EDR alerts, IDS/IPS, access logs, CCTV Security Operations
GPU and accelerator telemetry DCGM metrics, ECC errors, thermal data, NVLink utilization AI training operations, Hardware Fleet, AIOps
Cloud provider services AWS CloudWatch, Azure Monitor, GCP Cloud Monitoring native metrics Cloud-native applications, hybrid observability

OpenTelemetry

OpenTelemetry has emerged as the dominant standard for observability instrumentation. The CNCF project provides vendor-neutral APIs, SDKs, and protocols (OTLP) for emitting metrics, logs, and traces from applications and infrastructure. Most major commercial observability platforms support OTLP ingestion; many provide their own SDKs that produce OpenTelemetry-compatible data. The standardization matters operationally because it decouples instrumentation from the observability platform - applications instrument once with OpenTelemetry and can ship that data to multiple backends without re-instrumentation. The broader vision of "instrument once, query anywhere" is still partial, but OpenTelemetry has made it substantially more achievable than the previous landscape of vendor-specific agents and protocols.


eBPF and kernel-level observability

eBPF (extended Berkeley Packet Filter) is a Linux kernel technology that allows observability and security tooling to run in kernel space without modifying the kernel itself. The technology has transformed observability over the past five years by enabling deep visibility into system behavior - syscall tracking, network packet analysis, performance profiling, security event detection - with minimal performance overhead and without requiring per-application instrumentation. Major observability and security platforms (Cilium for networking, Pixie for application observability, Tetragon for security, Datadog and Dynatrace for general observability) now use eBPF as a primary data source. The 2024-2026 wave of eBPF-native observability platforms is replacing earlier patterns that relied on heavyweight agents or per-application instrumentation.


Cardinality and cost

The dominant operational concern at exascale observability is cardinality - the number of unique time series or unique log streams the platform must store and query. Cardinality grows multiplicatively with the number of dimensions in the data (per-pod metrics across thousands of pods across hundreds of services across dozens of regions can produce billions of unique time series). Cardinality drives storage cost, query performance, and ingestion cost in proportional amounts. Mature observability practice includes cardinality management as a first-class engineering concern: dropping high-cardinality dimensions that aren't queried, sampling high-volume data, and tiering storage for older data. Several major observability vendor outages and customer cost surprises in 2023-2024 traced to cardinality explosions; the discipline of treating cardinality as a budget rather than an unlimited capability has become standard practice.


Observability platform landscape

Platform Vendor Distinctive
Datadog Datadog Unified metrics, logs, traces, RUM, security; broad enterprise adoption
New Relic New Relic All-in-one observability; consumption-based pricing model since 2020 reset
Dynatrace Dynatrace Davis AI for automated root-cause; strong enterprise APM heritage
Splunk Observability Splunk (Cisco) SignalFx-derived APM and infrastructure monitoring; pairs with Splunk logs
Honeycomb Honeycomb High-cardinality query focus; observability for distributed systems
Grafana Cloud / Grafana Labs Grafana Labs Open-source-rooted; Mimir, Loki, Tempo for metrics, logs, traces
Prometheus + Grafana (open source) CNCF Dominant open-source stack for Kubernetes observability; ubiquitous in cloud-native
Elastic Stack (ELK/Elastic Observability) Elastic Logs heritage extended to metrics, traces, APM
Lightstep ServiceNow Distributed tracing focus; OpenTelemetry-native
Cloud-native managed AWS CloudWatch, Azure Monitor, Google Cloud Operations Native to each cloud; tight integration with cloud services
Hyperscaler internal Google internal, Meta Scuba, Microsoft, AWS internal Custom-built for fleet scale; not commercially available; informed open-source design
eBPF-native observability Cilium, Pixie (acquired by New Relic), Tetragon Kernel-level visibility without per-application instrumentation

Cross-domain correlation

Cross-domain correlation is the analytical capability to connect signals across the IT/OT boundary - linking application slowness to a cooling event, connecting database latency to a network capacity issue, correlating power quality events with subsequent server failures. Most operators run separate observability for IT (Datadog, Prometheus) and facility (DCIM, BMS, EPMS); cross-correlation requires either dedicated integration or platforms that ingest both domains. The discipline is operationally underdeveloped at most organizations - facility incidents that affect IT often go undetected as facility-caused for hours or days because the correlation isn't automated. Hyperscaler internal platforms increasingly run unified observability across both domains; the commercial market is still catching up.


Retention and tiering

Observability data has different value at different ages. Real-time queries against the last hour of metrics are the dominant use case; investigations may reach back days; compliance and audit cases reach back years. Modern platforms tier storage to match - hot tier for real-time query (in-memory or fast SSD), warm tier for recent investigation (slower SSD or fast object storage), and cold tier for long-term retention (cheap object storage with slower query). The retention policy decisions cross-reference compliance frameworks - SOC 2, HIPAA, PCI-DSS, and GDPR all impose minimum retention requirements on different data categories - and the cost optimization decisions are some of the highest-leverage in observability operations.


Where this fits

Telemetry and observability is the integrating discipline above source-layer monitoring (Power Monitoring, Cooling Monitoring, Water Monitoring, Emissions Monitoring) and below the consuming disciplines (AIOps for ML-driven analysis, Platform Reliability Engineering for incident response, SLA/SLO Management for SLI measurement). Network telemetry feeds Network Operations; hardware telemetry feeds Hardware Fleet Management; security telemetry feeds Security. Compliance evidence flows to GRC:Auditability.