DataCentersX > Compute Ops > Network Operations

DC Network Operations

Network operations is the discipline that runs the data center network fabric - the switches and routers, the protocols, the configurations, the traffic, and the relationships with carriers and exchanges. The discipline is operationally distinct from Stack:Networking and Fabrics, which covers the architectural design layer (fabric topology, protocol choices, hardware selection), and from WAN & Comms Ingress, which covers the physical infrastructure where external connectivity enters the facility. Stack:Networking decides what fabric to build; WAN Ingress builds the physical entry point; Network Operations runs the fabric and the WAN edge day-to-day.

Operational scope

Domain	What it covers
Fabric operation	Spine, leaf, and ToR switch operation; configuration management; firmware lifecycle
Routing protocol management	BGP peering, IGP (OSPF, IS-IS), route policy, BGP communities
Traffic engineering	Path optimization, ECMP balancing, segment routing, traffic shaping
Capacity management	Link utilization tracking, congestion identification, growth planning
Configuration management	Source-of-truth network config; change validation; rollback
Network telemetry	SNMP, streaming telemetry, sFlow/NetFlow/IPFIX collection and analysis
Incident response	Network outage triage, fault isolation, restoration, post-incident analysis
Carrier and peering	Transit relationships, peering with exchanges and bilateral peers, capacity orders
Security operations	DDoS mitigation, BGP hygiene (RPKI, ROV), east-west traffic inspection

Data center fabric architecture

Modern data center fabrics use spine-leaf topology (Clos networks) with all-IP routing rather than the legacy three-tier (core-aggregation-access) Ethernet design. Routing on the host using BGP is increasingly standard at hyperscale - every server runs a BGP speaker that peers with the leaf switch, eliminating the need for spanning tree protocol and providing fast convergence on link failures. Hyperscaler fabrics typically run flat IP routing across thousands of switches with BGP as the primary protocol; some operators use IS-IS for IGP and reserve BGP for inter-region. EVPN-VXLAN provides multi-tenancy on shared underlay, enabling cloud-style network virtualization on physical fabric. The architecture is mature; the operational discipline is what distinguishes well-run fabrics from problematic ones.

AI fabric operations

AI training fabrics have requirements that traditional data center networking did not address. The bandwidth requirements per GPU (typical training cluster runs 400G or 800G per H100, with 1.6T per Rubin GPU) drive fabric scale-up beyond what general cloud workloads need. The synchronized collective operations (all-reduce across thousands of GPUs) require carefully tuned ECMP behavior to avoid hash collisions that strand bandwidth. The choice of InfiniBand (mature at HPC scale, vendor-locked to NVIDIA Quantum/Mellanox) versus Spectrum-X Ethernet (newer, standards-based, used at xAI Colossus) versus traditional Ethernet (cheaper but lower training efficiency) is one of the primary architectural decisions in AI cluster design.

Fabric type	Where used	Operational concern
InfiniBand (Quantum-2 400G, Quantum-X 800G)	Most major H100 superclusters; OpenAI Stargate; Microsoft Azure AI; Meta RSC	NVIDIA-specific tooling; SHARP for in-network reduction; UFM for fabric management
Spectrum-X Ethernet	xAI Colossus 100K-200K H100 fabric; growing deployments	Standards-based but NVIDIA-tuned; achieves 95% data throughput vs ~60% for standard Ethernet
NVLink Switch fabric (NVL72)	GB200 NVL72 reference designs; Rubin reference designs	Memory-coherent across 72 GPUs at rack scale; new operational paradigm
Traditional Ethernet RoCE	Cost-optimized AI deployments; some hyperscaler internal training	RDMA over converged Ethernet; PFC and ECN tuning critical for performance
Google ICI	Google TPU pods (all generations)	Google-internal proprietary; not commercially available
AWS EFA	AWS Trainium clusters; some H100 capacity on AWS	AWS-internal RDMA-equivalent; pairs with Trainium2 for Anthropic Project Rainier

Switch silicon and operating systems

Class	Examples	Notes
Merchant silicon	Broadcom Tomahawk, Trident, Jericho; NVIDIA Spectrum (ex-Mellanox); Cisco Silicon One	Dominates spine and high-radix leaf; enables disaggregated NOS approach
Vendor-integrated silicon	Cisco proprietary, Juniper Trio/PE, Arista internal	Some vendors maintain proprietary silicon; declining outside specific use cases
NOS (vendor)	Cisco IOS-XR/NX-OS, Juniper Junos, Arista EOS, Nokia SR Linux	Mature operational tooling; vendor support and certification
NOS (open)	SONiC (Microsoft-originated, Linux Foundation), DENT, Cumulus (NVIDIA)	Disaggregated NOS on merchant silicon; hyperscaler-driven adoption
P4-programmable	Intel Tofino (legacy; Intel exited), AMD Pensando, NVIDIA BlueField DPUs	Specialized; mostly hyperscaler internal use; programmable data planes

NetDevOps and configuration management

NetDevOps applies software engineering practices to network operations. Network configuration is held in version-controlled repositories; changes go through pull request review; deployment uses automation rather than manual CLI; testing happens against simulated topologies before production deployment. The discipline reduces the human-error category of network outages (which is one of the dominant outage causes per failure mode analysis) and provides the audit trail that GRC requires. Major NetDevOps tools include Ansible (Red Hat), Cisco NSO, Nautobot (NetBox successor), Terraform (HashiCorp) for infrastructure, and Batfish for configuration validation. The discipline overlaps with broader infrastructure-as-code and GitOps practice in the cloud-native ecosystem.

BGP hygiene and security

BGP route security has become an operational concern as route hijacks and route leaks have caused major internet incidents. The standard mitigations include RPKI (Resource Public Key Infrastructure) and ROV (Route Origin Validation) - cryptographic certificates that prove an AS is authorized to originate a prefix, with route validation that filters out unauthenticated announcements. Major content providers and carriers have deployed RPKI invalid-drop policies; data center operators with BGP peering to carriers and exchanges deploy similar policies. BGPsec extends RPKI to the path itself but has limited deployment due to performance concerns. The discipline lives in network operations but cross-references Security and Cybersecurity for the broader internet routing security context.

Telemetry and observability

Network telemetry has shifted from SNMP polling (every few minutes, via the management plane) to streaming telemetry (sub-second push from the data plane) over the past decade. gNMI and OpenConfig provide standardized streaming telemetry; vendor-specific implementations remain common. sFlow, NetFlow, and IPFIX continue to provide flow-level traffic analysis. The aggregate telemetry from a hyperscale fabric (tens of thousands of switches generating sub-second telemetry) is its own data engineering challenge - hyperscalers run dedicated time-series and analytics platforms (Microsoft's Athena, Meta's Scuba, Google's internal monitoring) for network telemetry that's separate from their general observability platforms. Major commercial network observability platforms include Kentik, ThousandEyes (Cisco), Arista CloudVision, NetScout, and Cisco AppDynamics with network insights.

DDoS mitigation

DDoS mitigation operates at the WAN edge to detect and absorb volumetric and application-layer attacks before they reach customer infrastructure. The discipline includes upstream scrubbing (carrier-level mitigation), in-line appliances (Arbor Sightline / TMS, Radware DefensePro, F5 Silverline), cloud-based mitigation services (Cloudflare Magic Transit, AWS Shield, Akamai, Imperva), and the operational playbooks for activating mitigation during attacks. Modern attacks regularly exceed 1 Tbps; some have crossed 5 Tbps. The mitigation infrastructure has scaled accordingly. The discipline overlaps with Security for the cybersecurity attack response and with WAN Ingress for the physical infrastructure where mitigation happens.

Network operations vendor landscape

Category	Vendor examples
Switch and router vendors	Cisco, Arista, Juniper, Nokia, NVIDIA (ex-Mellanox), HPE (ex-Aruba)
ODM hardware (white-box)	Edgecore, Ufispace, Wistron, Inventec, Foxconn (for hyperscalers)
Open NOS	SONiC (LF Networking), DENT, OpenSwitch
Network observability	Kentik, ThousandEyes (Cisco), Arista CloudVision, NetScout, Datadog
DDoS mitigation	Cloudflare, Arbor (NetScout), Akamai, Radware, AWS Shield
NetDevOps and automation	Ansible, Cisco NSO, Nautobot, Terraform, Batfish
Source-of-truth and IPAM	NetBox / Nautobot, Infoblox, BlueCat, EfficientIP
Hyperscaler internal	Google, Meta, AWS, Microsoft custom platforms

Where this fits

Network operations runs the fabric that Stack:Networking and Fabrics architects and that WAN & Comms Ingress physically delivers at the edge. Network telemetry feeds Telemetry & Observability and AIOps. DDoS mitigation and BGP security cross-reference Security and Cybersecurity. Hardware lifecycle for switches and optical equipment cross-references Hardware Fleet Management. Configuration management and audit trail flow to GRC:Auditability.

Related coverage