DataCentersX > Compute Ops > Network Operations
DC Network Operations
Network operations is the discipline that runs the data center network fabric - the switches and routers, the protocols, the configurations, the traffic, and the relationships with carriers and exchanges. The discipline is operationally distinct from Stack:Networking and Fabrics, which covers the architectural design layer (fabric topology, protocol choices, hardware selection), and from WAN & Comms Ingress, which covers the physical infrastructure where external connectivity enters the facility. Stack:Networking decides what fabric to build; WAN Ingress builds the physical entry point; Network Operations runs the fabric and the WAN edge day-to-day.
Operational scope
| Domain | What it covers |
|---|---|
| Fabric operation | Spine, leaf, and ToR switch operation; configuration management; firmware lifecycle |
| Routing protocol management | BGP peering, IGP (OSPF, IS-IS), route policy, BGP communities |
| Traffic engineering | Path optimization, ECMP balancing, segment routing, traffic shaping |
| Capacity management | Link utilization tracking, congestion identification, growth planning |
| Configuration management | Source-of-truth network config; change validation; rollback |
| Network telemetry | SNMP, streaming telemetry, sFlow/NetFlow/IPFIX collection and analysis |
| Incident response | Network outage triage, fault isolation, restoration, post-incident analysis |
| Carrier and peering | Transit relationships, peering with exchanges and bilateral peers, capacity orders |
| Security operations | DDoS mitigation, BGP hygiene (RPKI, ROV), east-west traffic inspection |
Data center fabric architecture
Modern data center fabrics use spine-leaf topology (Clos networks) with all-IP routing rather than the legacy three-tier (core-aggregation-access) Ethernet design. Routing on the host using BGP is increasingly standard at hyperscale - every server runs a BGP speaker that peers with the leaf switch, eliminating the need for spanning tree protocol and providing fast convergence on link failures. Hyperscaler fabrics typically run flat IP routing across thousands of switches with BGP as the primary protocol; some operators use IS-IS for IGP and reserve BGP for inter-region. EVPN-VXLAN provides multi-tenancy on shared underlay, enabling cloud-style network virtualization on physical fabric. The architecture is mature; the operational discipline is what distinguishes well-run fabrics from problematic ones.
AI fabric operations
AI training fabrics have requirements that traditional data center networking did not address. The bandwidth requirements per GPU (typical training cluster runs 400G or 800G per H100, with 1.6T per Rubin GPU) drive fabric scale-up beyond what general cloud workloads need. The synchronized collective operations (all-reduce across thousands of GPUs) require carefully tuned ECMP behavior to avoid hash collisions that strand bandwidth. The choice of InfiniBand (mature at HPC scale, vendor-locked to NVIDIA Quantum/Mellanox) versus Spectrum-X Ethernet (newer, standards-based, used at xAI Colossus) versus traditional Ethernet (cheaper but lower training efficiency) is one of the primary architectural decisions in AI cluster design.
| Fabric type | Where used | Operational concern |
|---|---|---|
| InfiniBand (Quantum-2 400G, Quantum-X 800G) | Most major H100 superclusters; OpenAI Stargate; Microsoft Azure AI; Meta RSC | NVIDIA-specific tooling; SHARP for in-network reduction; UFM for fabric management |
| Spectrum-X Ethernet | xAI Colossus 100K-200K H100 fabric; growing deployments | Standards-based but NVIDIA-tuned; achieves 95% data throughput vs ~60% for standard Ethernet |
| NVLink Switch fabric (NVL72) | GB200 NVL72 reference designs; Rubin reference designs | Memory-coherent across 72 GPUs at rack scale; new operational paradigm |
| Traditional Ethernet RoCE | Cost-optimized AI deployments; some hyperscaler internal training | RDMA over converged Ethernet; PFC and ECN tuning critical for performance |
| Google ICI | Google TPU pods (all generations) | Google-internal proprietary; not commercially available |
| AWS EFA | AWS Trainium clusters; some H100 capacity on AWS | AWS-internal RDMA-equivalent; pairs with Trainium2 for Anthropic Project Rainier |
Switch silicon and operating systems
| Class | Examples | Notes |
|---|---|---|
| Merchant silicon | Broadcom Tomahawk, Trident, Jericho; NVIDIA Spectrum (ex-Mellanox); Cisco Silicon One | Dominates spine and high-radix leaf; enables disaggregated NOS approach |
| Vendor-integrated silicon | Cisco proprietary, Juniper Trio/PE, Arista internal | Some vendors maintain proprietary silicon; declining outside specific use cases |
| NOS (vendor) | Cisco IOS-XR/NX-OS, Juniper Junos, Arista EOS, Nokia SR Linux | Mature operational tooling; vendor support and certification |
| NOS (open) | SONiC (Microsoft-originated, Linux Foundation), DENT, Cumulus (NVIDIA) | Disaggregated NOS on merchant silicon; hyperscaler-driven adoption |
| P4-programmable | Intel Tofino (legacy; Intel exited), AMD Pensando, NVIDIA BlueField DPUs | Specialized; mostly hyperscaler internal use; programmable data planes |
NetDevOps and configuration management
NetDevOps applies software engineering practices to network operations. Network configuration is held in version-controlled repositories; changes go through pull request review; deployment uses automation rather than manual CLI; testing happens against simulated topologies before production deployment. The discipline reduces the human-error category of network outages (which is one of the dominant outage causes per failure mode analysis) and provides the audit trail that GRC requires. Major NetDevOps tools include Ansible (Red Hat), Cisco NSO, Nautobot (NetBox successor), Terraform (HashiCorp) for infrastructure, and Batfish for configuration validation. The discipline overlaps with broader infrastructure-as-code and GitOps practice in the cloud-native ecosystem.
BGP hygiene and security
BGP route security has become an operational concern as route hijacks and route leaks have caused major internet incidents. The standard mitigations include RPKI (Resource Public Key Infrastructure) and ROV (Route Origin Validation) - cryptographic certificates that prove an AS is authorized to originate a prefix, with route validation that filters out unauthenticated announcements. Major content providers and carriers have deployed RPKI invalid-drop policies; data center operators with BGP peering to carriers and exchanges deploy similar policies. BGPsec extends RPKI to the path itself but has limited deployment due to performance concerns. The discipline lives in network operations but cross-references Security and Cybersecurity for the broader internet routing security context.
Telemetry and observability
Network telemetry has shifted from SNMP polling (every few minutes, via the management plane) to streaming telemetry (sub-second push from the data plane) over the past decade. gNMI and OpenConfig provide standardized streaming telemetry; vendor-specific implementations remain common. sFlow, NetFlow, and IPFIX continue to provide flow-level traffic analysis. The aggregate telemetry from a hyperscale fabric (tens of thousands of switches generating sub-second telemetry) is its own data engineering challenge - hyperscalers run dedicated time-series and analytics platforms (Microsoft's Athena, Meta's Scuba, Google's internal monitoring) for network telemetry that's separate from their general observability platforms. Major commercial network observability platforms include Kentik, ThousandEyes (Cisco), Arista CloudVision, NetScout, and Cisco AppDynamics with network insights.
DDoS mitigation
DDoS mitigation operates at the WAN edge to detect and absorb volumetric and application-layer attacks before they reach customer infrastructure. The discipline includes upstream scrubbing (carrier-level mitigation), in-line appliances (Arbor Sightline / TMS, Radware DefensePro, F5 Silverline), cloud-based mitigation services (Cloudflare Magic Transit, AWS Shield, Akamai, Imperva), and the operational playbooks for activating mitigation during attacks. Modern attacks regularly exceed 1 Tbps; some have crossed 5 Tbps. The mitigation infrastructure has scaled accordingly. The discipline overlaps with Security for the cybersecurity attack response and with WAN Ingress for the physical infrastructure where mitigation happens.
Network operations vendor landscape
| Category | Vendor examples |
|---|---|
| Switch and router vendors | Cisco, Arista, Juniper, Nokia, NVIDIA (ex-Mellanox), HPE (ex-Aruba) |
| ODM hardware (white-box) | Edgecore, Ufispace, Wistron, Inventec, Foxconn (for hyperscalers) |
| Open NOS | SONiC (LF Networking), DENT, OpenSwitch |
| Network observability | Kentik, ThousandEyes (Cisco), Arista CloudVision, NetScout, Datadog |
| DDoS mitigation | Cloudflare, Arbor (NetScout), Akamai, Radware, AWS Shield |
| NetDevOps and automation | Ansible, Cisco NSO, Nautobot, Terraform, Batfish |
| Source-of-truth and IPAM | NetBox / Nautobot, Infoblox, BlueCat, EfficientIP |
| Hyperscaler internal | Google, Meta, AWS, Microsoft custom platforms |
Where this fits
Network operations runs the fabric that Stack:Networking and Fabrics architects and that WAN & Comms Ingress physically delivers at the edge. Network telemetry feeds Telemetry & Observability and AIOps. DDoS mitigation and BGP security cross-reference Security and Cybersecurity. Hardware lifecycle for switches and optical equipment cross-references Hardware Fleet Management. Configuration management and audit trail flow to GRC:Auditability.
Related coverage
Compute Ops | Hardware Fleet Management | WAN & Comms Ingress | Stack: Networking & Fabrics | Telemetry & Observability | AIOps | Security | Cybersecurity | AI Training Superclusters