Highly Customized Optical Networking Critical for Google’s Tensor Processing Units (TPUs)

Google’s system leverages optical circuit switching (OCS) to create direct, low-latency optical paths between TPU chips, minimizing signal conversion losses. They avoid repeated optical-electrical-optical transformations. This enables efficient data sharing across thousands of chips in a single pod, with signals remaining in the optical domain for most inter-chip communication.

Ironwood, seventh generation TPU, is purpose-built for the most demanding workloads: from large-scale model training and complex reinforcement learning (RL) to high-volume, low-latency AI inference and model serving. It offers a 10X peak performance improvement over TPU v5p and more than 4X better performance per chip for both training and inference workloads compared to TPU v6e (Trillium), making Ironwood our most powerful and energy-efficient custom silicon to date.

New Arm®-based Axion instances. N4A is the most cost-effective N series virtual machine to date. N4A offers up to 2x better price-performance than comparable current-generation x86-based VMs. Google announced C4A metal as their first Arm-based bare metal instance.

Key Components of TPU Optical Networking

Inter-Chip Interconnect (ICI)- The foundational high-speed network within and between TPUs. ICI uses a 3D torus topology (4x4x4 cubes of 64 TPUs) for low-diameter connections, supporting bidirectional bandwidth up to 1.2 TBps per chip in recent generations. Intra-cube links use direct-attach copper (DAC) cables for short distances, while inter-cube and pod-scale links transition to optical transceivers (1.5 optical transceivers per TPU).

Optical Circuit Switches (OCS)- Custom MEMS-based (micro-electro-mechanical systems) switches with 2D mirror arrays, lenses, and cameras for beam steering. These dynamically reconfigure topologies (twisted 3D torus) without electrical switches, reducing overhead, power (40% less), and cost (30% less). A single OCS handles 144×144 ports, enabling fault-tolerant routing around failures.

Wave Division Multiplexing (WDM) Transceivers- Integrated optical circulators allow full-duplex communication over single fiber strands, cutting fiber needs by 50%. This supports software-defined networking (SDN) for flexible topologies, improving all-to-all collective throughput for distributed AI workloads.
Scale and Resiliency: Pods connect via Jupiter datacenter networking (multi-petabit-per-second), scaling to hundreds of thousands of chips. ICI resiliency routes around OCS/optical faults, boosting availability (50x less downtime) with temporary performance trade-offs.

This setup powers Google’s AI Hypercomputer, integrating compute, networking, and storage for workloads like Gemini model training.

Networking costs are less than 5% of total TPU pod CapEx and less than 3% of power draw.

TPU v7 (Ironwood) leverages better networking for superior scaling.

It doubles ICI bandwidth to 1.2 TBps bidirectional (1.5x over TPU v6/Trillium), allowing synchronous communication across massive clusters with minimal latency.

They supports pods up to 9,216 chips (144 x 4x4x4 cubes, requiring 48 OCS units and 13,824 optical ports)—over 24x the compute of the world’s largest supercomputer (42.5 exaFLOPS FP8). This dwarfs prior generations (TPU v4’s 4,096 chips) and enables multislice configs for 100,000+ chips via data-center networks.

Enhanced OCS (Optical Circuit Switches) and 3D torus allow dynamic re-slicing for diverse workloads (mixing data, tensor, and pipeline parallelism). Optical density is optimized (1.5 transceivers/TPU), with liquid cooling for ~10 MW pods. This yields 10x peak performance over TPU v5p, 4x/chip vs. TPU v6e, and 2x perf/watt efficiency.

They reduce data movement overhead for inference-heavy tasks (65% cost savings for Midjourney). Anthropic’s 1M TPU deal highlights real-world scale for Claude models, with ~50% cheaper effective FLOPs vs. Nvidia equivalents due to integrated stack.

Ironwood’s networking makes TPUs ideal for hyperscale AI, where GPUs struggle with optical transceiver costs and topology limits.

TPUs are a key component of the google AI Hypercomputer. AI Hypercomputer is the integrated supercomputing system that brings together compute, networking, storage, and software to improve system-level performance and efficiency. At the macro level, according to a recent IDC report, AI Hypercomputer customers achieved on average 353% three-year ROI, 28% lower IT costs, and 55% more efficient IT teams.