DataCentersX > Compute Ops > Telemetry & Observability
DC Telemetry & Observability
Telemetry and observability is the integrating discipline that aggregates operational data from across the facility, the network fabric, the IT and application layers, and produces the observable platform that operators use to understand, troubleshoot, and improve the system. The discipline sits one layer above the source-layer monitoring pages (Power Monitoring, Cooling Monitoring, Water Monitoring, Emissions Monitoring) and one layer below the consuming disciplines (AIOps, Platform Reliability Engineering, SLA/SLO Management). Source-layer monitoring produces sensor data; observability platforms aggregate, correlate, and surface it for human and automated consumers.
The three pillars
Modern observability practice organizes data into three primary pillars - metrics, logs, and traces - plus events as a fourth category. Each has different storage, query, and analysis characteristics; mature platforms ingest all three and correlate across them.
| Pillar | What it is | Primary use |
|---|---|---|
| Metrics | Numeric measurements over time; CPU utilization, request rates, error counts, PUE, temperatures | Dashboards, alerting, capacity trending, SLO measurement |
| Logs | Discrete event records; application logs, system logs, audit logs | Incident investigation, audit trails, debugging, security analysis |
| Traces | End-to-end request paths through distributed systems; spans showing service-to-service calls | Latency analysis, dependency mapping, performance optimization |
| Events | State changes worth noting; deployments, incidents, configuration changes | Change correlation, deployment tracking, incident timeline reconstruction |
Telemetry sources
| Source | Examples | Consumed by |
|---|---|---|
| Application instrumentation | OpenTelemetry SDKs, custom instrumentation, framework auto-instrumentation | PRE, application teams, SLO platforms |
| Container and orchestration | Kubernetes metrics, container resource usage, pod lifecycle events | Platform teams, capacity planning, AIOps |
| Server and OS | CPU, memory, disk, network metrics; system logs; BMC telemetry | Hardware Fleet Management, capacity planning |
| Network fabric | SNMP, gNMI streaming telemetry, sFlow/NetFlow/IPFIX, switch logs | Network Operations, capacity planning |
| Facility infrastructure | EPMS, BMS, DCIM telemetry; power, cooling, environmental sensors | FACILITY OPS, sustainability reporting |
| Security tooling | SIEM events, EDR alerts, IDS/IPS, access logs, CCTV | Security Operations |
| GPU and accelerator telemetry | DCGM metrics, ECC errors, thermal data, NVLink utilization | AI training operations, Hardware Fleet, AIOps |
| Cloud provider services | AWS CloudWatch, Azure Monitor, GCP Cloud Monitoring native metrics | Cloud-native applications, hybrid observability |
OpenTelemetry
OpenTelemetry has emerged as the dominant standard for observability instrumentation. The CNCF project provides vendor-neutral APIs, SDKs, and protocols (OTLP) for emitting metrics, logs, and traces from applications and infrastructure. Most major commercial observability platforms support OTLP ingestion; many provide their own SDKs that produce OpenTelemetry-compatible data. The standardization matters operationally because it decouples instrumentation from the observability platform - applications instrument once with OpenTelemetry and can ship that data to multiple backends without re-instrumentation. The broader vision of "instrument once, query anywhere" is still partial, but OpenTelemetry has made it substantially more achievable than the previous landscape of vendor-specific agents and protocols.
eBPF and kernel-level observability
eBPF (extended Berkeley Packet Filter) is a Linux kernel technology that allows observability and security tooling to run in kernel space without modifying the kernel itself. The technology has transformed observability over the past five years by enabling deep visibility into system behavior - syscall tracking, network packet analysis, performance profiling, security event detection - with minimal performance overhead and without requiring per-application instrumentation. Major observability and security platforms (Cilium for networking, Pixie for application observability, Tetragon for security, Datadog and Dynatrace for general observability) now use eBPF as a primary data source. The 2024-2026 wave of eBPF-native observability platforms is replacing earlier patterns that relied on heavyweight agents or per-application instrumentation.
Cardinality and cost
The dominant operational concern at exascale observability is cardinality - the number of unique time series or unique log streams the platform must store and query. Cardinality grows multiplicatively with the number of dimensions in the data (per-pod metrics across thousands of pods across hundreds of services across dozens of regions can produce billions of unique time series). Cardinality drives storage cost, query performance, and ingestion cost in proportional amounts. Mature observability practice includes cardinality management as a first-class engineering concern: dropping high-cardinality dimensions that aren't queried, sampling high-volume data, and tiering storage for older data. Several major observability vendor outages and customer cost surprises in 2023-2024 traced to cardinality explosions; the discipline of treating cardinality as a budget rather than an unlimited capability has become standard practice.
Observability platform landscape
| Platform | Vendor | Distinctive |
|---|---|---|
| Datadog | Datadog | Unified metrics, logs, traces, RUM, security; broad enterprise adoption |
| New Relic | New Relic | All-in-one observability; consumption-based pricing model since 2020 reset |
| Dynatrace | Dynatrace | Davis AI for automated root-cause; strong enterprise APM heritage |
| Splunk Observability | Splunk (Cisco) | SignalFx-derived APM and infrastructure monitoring; pairs with Splunk logs |
| Honeycomb | Honeycomb | High-cardinality query focus; observability for distributed systems |
| Grafana Cloud / Grafana Labs | Grafana Labs | Open-source-rooted; Mimir, Loki, Tempo for metrics, logs, traces |
| Prometheus + Grafana (open source) | CNCF | Dominant open-source stack for Kubernetes observability; ubiquitous in cloud-native |
| Elastic Stack (ELK/Elastic Observability) | Elastic | Logs heritage extended to metrics, traces, APM |
| Lightstep | ServiceNow | Distributed tracing focus; OpenTelemetry-native |
| Cloud-native managed | AWS CloudWatch, Azure Monitor, Google Cloud Operations | Native to each cloud; tight integration with cloud services |
| Hyperscaler internal | Google internal, Meta Scuba, Microsoft, AWS internal | Custom-built for fleet scale; not commercially available; informed open-source design |
| eBPF-native observability | Cilium, Pixie (acquired by New Relic), Tetragon | Kernel-level visibility without per-application instrumentation |
Cross-domain correlation
Cross-domain correlation is the analytical capability to connect signals across the IT/OT boundary - linking application slowness to a cooling event, connecting database latency to a network capacity issue, correlating power quality events with subsequent server failures. Most operators run separate observability for IT (Datadog, Prometheus) and facility (DCIM, BMS, EPMS); cross-correlation requires either dedicated integration or platforms that ingest both domains. The discipline is operationally underdeveloped at most organizations - facility incidents that affect IT often go undetected as facility-caused for hours or days because the correlation isn't automated. Hyperscaler internal platforms increasingly run unified observability across both domains; the commercial market is still catching up.
Retention and tiering
Observability data has different value at different ages. Real-time queries against the last hour of metrics are the dominant use case; investigations may reach back days; compliance and audit cases reach back years. Modern platforms tier storage to match - hot tier for real-time query (in-memory or fast SSD), warm tier for recent investigation (slower SSD or fast object storage), and cold tier for long-term retention (cheap object storage with slower query). The retention policy decisions cross-reference compliance frameworks - SOC 2, HIPAA, PCI-DSS, and GDPR all impose minimum retention requirements on different data categories - and the cost optimization decisions are some of the highest-leverage in observability operations.
Where this fits
Telemetry and observability is the integrating discipline above source-layer monitoring (Power Monitoring, Cooling Monitoring, Water Monitoring, Emissions Monitoring) and below the consuming disciplines (AIOps for ML-driven analysis, Platform Reliability Engineering for incident response, SLA/SLO Management for SLI measurement). Network telemetry feeds Network Operations; hardware telemetry feeds Hardware Fleet Management; security telemetry feeds Security. Compliance evidence flows to GRC:Auditability.