Tristan Manchester • 12th January 2026

From finger to photon and back: the complete journey of a keystroke through an LLM

What actually happens when you press a key and an AI responds?

I mean physics-wise. The version where we trace individual electrons through transistor channels, watch photons bounce through 3,000 kilometres of glass fibre, and count the floating-point operations that separate your question from the model's answer.

I decided to go deeper than the usual "your request goes to a server" abstraction. The result is a journey spanning roughly 15 orders of magnitude in time, from nanosecond transistor switches to month-long training runs, involving so many electron-to-photon conversions that I lost count somewhere around a dozen.

Strap in. The number of coordinated state changes is astronomical. (If you want a back-of-envelope: ~10¹¹ transistors toggling at GHz rates for a second of GPU time already gives you 10²⁰-10²¹ switching events, before you count memory refreshes, network hops, or the billions of operations per token. And that's just inference: training the model was 10²³-10²⁵ FLOPs.)

Part I: The mechanical-to-electrical boundary

Your finger closes a switch

Your finger descends over approximately 10 milliseconds, an eternity in computer time, but constrained by the biomechanics of muscle contraction and the travel distance of a typical key (around 2-4mm for a laptop keyboard, 4mm for a mechanical switch).

During this press, you're closing a circuit in a keyboard matrix. Most keyboards don't give each key its own dedicated wire. That would require 100+ individual connections. Instead, keys sit at intersections of a row-column matrix. The keyboard controller rapidly scans columns (driving them low one at a time) while reading rows to detect which intersections are closed.

This scan happens at the keyboard's polling rate: 125 Hz for basic USB keyboards, 1,000 Hz for gaming peripherals, sometimes 8,000 Hz for the truly obsessive. At 1,000 Hz, the controller completes a full matrix scan every millisecond.

When your key closes the circuit, roughly 10¹¹ to 10¹² electrons flow through the pull-up resistors and scanning circuitry during the scan window. A typical pull-up might draw a few microamps when pulled low, integrated over the milliseconds of matrix scanning. The exact count depends heavily on the keyboard's internal resistor values and how long the row line is driven, but the order of magnitude is enough current to reliably distinguish "pressed" from "floating" while staying low enough to run on USB bus power.

On cheaper keyboards, pressing three specific keys simultaneously can cause "ghosting", where the controller thinks a fourth key is pressed because current flows backward through the grid. High-end mechanical keyboards solder a diode to every single switch (100+ diodes total) to act as a one-way valve, ensuring electricity flows only in the intended direction. It's a simple fix for an annoying physics problem.

Hold up: where does keyboard debouncing fit in?

Mechanical switches don't close cleanly. The metal contacts bounce against each other for 1-5 milliseconds, producing a noisy signal that could register as dozens of keypresses. The keyboard firmware handles this through debouncing, either by waiting for the signal to stabilize (adding latency) or by more sophisticated algorithms that confirm a press on the falling edge and a release on the rising edge.

This is your first encounter with a pattern that repeats throughout the stack: raw physics is messy, so every layer adds processing to clean up the layer below it.

The USB HID report

Once the firmware confirms a genuine keypress, it constructs a Human Interface Device (HID) report, a standardized packet format defined by the USB-IF back in 1996 and still going strong.

A typical keyboard HID report is 8 bytes:

Byte 0: Modifier keys (Shift, Ctrl, Alt, etc.) as a bitmask
Byte 1: Reserved (usually 0x00)
Bytes 2-7: Up to 6 simultaneous key scan codes

The keyboard doesn't send ASCII characters. It sends scan codes, position-based identifiers that mean "the key in position 0x04" (which happens to be 'A' on a US QWERTY layout). Your operating system handles the mapping from scan codes to characters, which is why you can switch keyboard layouts without reflashing your keyboard firmware.

For USB connections, this report travels over the D+ and D- differential pair at either 1.5 Mbps (Low Speed), 12 Mbps (Full Speed), or 480 Mbps (High Speed). Most keyboards use Low Speed or Full Speed. 12 Mbps is absurd overkill for 8 bytes every millisecond, but the overhead buys you reliable delivery.

Here's a naming quirk worth knowing: USB "interrupt transfers" aren't actually interrupts in the hardware sense. The host controller polls the device at an interval the device advertises (commonly 1-10 ms for keyboards, though the spec allows up to 255 ms for full-speed devices; low-speed is limited to 10-255 ms). When the poll arrives, the device either responds with new data or NAKs. So when people say "the keyboard interrupts the CPU," they're describing the end result, not the bus mechanism. The USB host schedules all traffic; the device waits to be asked.

(Bluetooth keyboards use the same HID report structure but wrap it in L2CAP packets over a 2.4 GHz radio link, adding another 10-30ms of latency that gamers complain about endlessly.)

Part II: Operating system event handling

The transfer completes

Your USB host controller, a dedicated chip on your motherboard, receives the HID report when it completes a scheduled poll. This triggers a hardware interrupt to the CPU, a physical electrical signal that says: "the transfer you scheduled is done."

The CPU responds in microseconds (the exact timing varies with power states, interrupt moderation, and kernel configuration), saving its current state to the stack and jumping to the interrupt handler registered for this interrupt vector. On Linux, this is the USB HID driver; on Windows, hidclass.sys; on macOS, IOHIDFamily.

The interrupt handler is intentionally minimal. It runs with interrupts disabled, so it needs to finish fast. It copies the HID report into a kernel buffer, acknowledges the interrupt, and schedules a deferred work item for the slower processing.

From scan code to key event

The deferred work happens in kernel context but with interrupts re-enabled. Here, the driver:

Parses the HID report to extract the scan code (say, 0x04 for 'A')
Translates HID usage codes to kernel-internal key codes
Generates a key event structure with timestamps, key code, and modifier state
Pushes this event into the input subsystem's event queue

On Linux, this produces an evdev event that userspace programs can read from /dev/input/eventX. Critically, evdev delivers keycodes, not characters. The translation from "key code 30 with Shift held" to the Unicode character 'A' happens in userspace, typically via XKB (X Keyboard Extension) in the display server or a library like libxkbcommon. This is where keyboard layouts, dead keys, compose sequences, and input method editors (IMEs for CJK languages) live.

On macOS and Windows, similar splits exist between kernel-level key events and userspace character translation, though the boundaries differ.

Memory: where electrons become stored charge

The key event now exists as data in DRAM (Dynamic Random Access Memory), where your computer stores everything that needs fast access but not persistence.

DRAM stores each bit as charge on a tiny capacitor: charged = 1, discharged = 0. Modern DRAM cells have capacitances of a few tens of femtofarads. At typical operating voltages, this translates to roughly 30,000 to 50,000 electrons per bit when fully charged. But here's the thing about capacitors: they leak. The charge decays due to junction leakage currents in the access transistors, thermal noise, and even cosmic ray strikes. (Yes, really. That's why ECC memory exists in servers.)

There's a catch: reading a DRAM bit is a destructive act. The sense amplifier drains the tiny capacitor to measure its charge. Every time your CPU reads a memory address, the DRAM chip has to instantly write that data back into the cell, or it would vanish forever. Observing the data literally destroys it, requiring immediate reconstruction.

To prevent data loss, DRAM must be refreshed every 64 milliseconds (or 32ms at higher temperatures). During refresh, the memory controller issues refresh commands to the DRAM chip, which internally activates rows and uses sense amplifiers to restore the charge. (The controller doesn't read and rewrite data itself; the refresh circuitry is inside the DRAM.)

Your key event, maybe 100-1,000 bytes of state including timestamps, modifiers, and event metadata, occupies on the order of 10⁷ to 10⁸ electrons' worth of stored charge (tens of thousands per bit × thousands of bits). These cells get refreshed about 16 times per second whether your application reads them or not.

Part III: The application layer and network transmission

Browser event handling

The window manager delivers the key event to your browser, which runs its own event loop. The browser determines which DOM element has focus, fires the appropriate JavaScript events (keydown, keypress, keyup), and updates the text field.

If you're typing into a chat interface connected to an LLM, each keystroke (or the final Enter key) triggers an API call. The browser constructs an HTTPS request, typically 1-5 kilobytes containing:

HTTP headers (~500 bytes): Host, User-Agent, Authorization, Content-Type
Request body (~500-4,000 bytes): JSON payload with your prompt, conversation history, model parameters

The TCP/IP stack

This is where things get layered in ways that would make an onion jealous.

Your 1-5 kB HTTPS request gets:

TLS encrypted: The payload is encrypted using the session key negotiated during the TLS handshake. This adds ~20-40 bytes of overhead per record.
Segmented into TCP packets: The maximum payload per packet is the Maximum Segment Size (MSS), typically 1,460 bytes on Ethernet (derived from the 1,500-byte MTU minus 20 bytes for IP header and 20 bytes for TCP header).
Wrapped in IP packets: Each TCP segment gets an IP header with source and destination addresses.
Framed for Ethernet: Each IP packet gets an Ethernet frame header with MAC addresses and a CRC checksum.

The kernel's TCP/IP stack handles this transformation, maintaining state for the connection (sequence numbers, acknowledgment tracking, congestion window). Processing a packet through the full kernel network stack takes 5-50 microseconds depending on CPU speed and kernel configuration, a number that becomes significant when you're moving millions of packets per second.

Why not skip some of these layers?

Datacenters handling extreme packet rates sometimes do. Techniques like DPDK (Data Plane Development Kit) and AF_XDP bypass the kernel entirely, letting userspace applications talk directly to the NIC. This can reduce latency from 50+ microseconds to under 5 microseconds per packet, but it requires dedicating CPU cores to packet processing and rewriting your network stack from scratch.

For a single LLM request, the kernel path is fine. The added latency is invisible compared to the 500-2000ms you'll wait for inference.

Part IV: The physical network layer

Electrons on copper

Your packet leaves the kernel as bytes in a DMA (Direct Memory Access) buffer. The network interface card (NIC) reads this buffer directly, with no CPU involvement, and begins serialization.

The NIC's SerDes (Serializer/Deserializer) converts parallel data into a serial bit stream. For 10 Gigabit Ethernet, this means 10 billion bits per second, encoded using 64b/66b line coding (64 bits of data wrapped in 66 bits to ensure enough signal transitions for clock recovery).

Each bit transition drives 10⁶ to 10⁸ electrons through the differential pair. A quick aside: the signal propagates at close to the speed of light in the medium (roughly 0.6c in copper traces). Individual electrons, however, drift far more slowly, maybe millimeters per second. What matters for signaling is the electromagnetic field, not the identity of any particular electron. But I'll keep the "electron narrative" because it's useful for accounting.

The signal travels across a few centimeters of PCB trace to an SFP+ optical transceiver module plugged into the NIC.

The first electrical-to-optical conversion

Inside the SFP+ module, a laser driver circuit modulates a semiconductor laser (typically a VCSEL, or Vertical Cavity Surface Emitting Laser, for short-range, or a DFB laser for longer distances).

When electrons jump across the semiconductor bandgap, they emit photons. For telecommunications, this happens at one of several standard wavelengths:

850 nm for short-range multimode fiber (up to ~300m)
1310 nm for intermediate distances (up to ~10km)
1550 nm for long-haul transmission (up to ~80km without amplification)

Each bit drives 10⁶ to 10⁸ photons into the fiber core. Laser light is highly coherent (narrow linewidth, long coherence length compared to LEDs), which is what makes the phase and amplitude modulation in coherent systems possible. These photons propagate via total internal reflection down the fiber's core: a 9-micrometer-wide strand of ultra-pure silica glass for single-mode fiber, or 50-62.5 micrometers for multimode.

(I'm simplifying here. Modern coherent optical systems use phase, amplitude, AND polarization modulation to pack more bits per symbol. A single 400 Gbps coherent transceiver might use 16QAM modulation on two polarizations, achieving 8 bits per symbol. But the fundamental physics is the same: modulated light in glass.)

Part V: The backbone network

Traversing continents

Your photons now enter the optical network infrastructure, a hierarchy of access, metro, and backbone networks spanning continents and oceans.

A typical path from a user in New York to a datacenter in Iowa might traverse:

A few kilometers of metro fiber to a carrier point-of-presence
Interconnection to a long-haul backbone network
1,500+ kilometers of fiber across the Midwest
Entry to the datacenter via a meet-me room

A common misconception: fiber is "faster" because it uses light. In reality, light travels about 30% slower in glass than in a vacuum (refractive index ~1.5), roughly the same speed as signals in a high-quality copper cable. The advantage of fiber isn't speed, it's stamina. Copper loses signal integrity after a few meters at high frequencies; the electromagnetic interference and resistive losses become overwhelming. Glass carries a clean signal for kilometers. The magic isn't velocity, it's attenuation.

The photons don't make this journey in one go. Even the best fiber attenuates signal at about 0.2 dB per kilometer at 1550 nm. After 100 kilometers, you've lost 20 dB, meaning 99% of your optical power. The signal would be unreadable.

Erbium-doped fiber amplifiers

This is where EDFAs (Erbium-Doped Fiber Amplifiers) come in. Every 80-100 kilometers, the optical signal passes through a section of fiber doped with erbium ions.

A separate pump laser (typically at 980 nm or 1480 nm) excites the erbium ions to a higher energy state. When signal photons at 1530-1565 nm pass through, they stimulate emission from the excited erbium ions, producing additional photons at the exact same wavelength and phase. This is optical amplification without electrical conversion. The signal stays in the photonic domain.

EDFAs can provide 20-30 dB of gain (100x to 1000x amplification) with noise figures around 4-6 dB. A transcontinental link might use 20-40 amplifier sites, each boosting all wavelength channels simultaneously.

Wavelength division multiplexing

Modern backbone fibers don't carry just one wavelength. Dense Wavelength Division Multiplexing (DWDM) packs 40-96 separate wavelength channels into the C-band (1528-1564 nm), each carrying 100-400 Gbps.

The channels are spaced 50-100 GHz apart, about 0.4-0.8 nm in wavelength. At 400G per channel with 96 channels, a single fiber pair can carry 38.4 Tbps. This is how undersea cables connecting continents achieve their staggering capacities.

Your packet, a few thousand bytes among the petabytes flowing through the fiber, rides one of these wavelength channels, amplified by the same EDFAs boosting all the others, adding essentially zero marginal cost to the infrastructure.

ROADMs and optical switching

At major junction points, Reconfigurable Optical Add-Drop Multiplexers (ROADMs) can redirect individual wavelength channels without converting them to electrical signals. A ROADM uses wavelength-selective switches (typically based on liquid crystal on silicon or MEMS mirrors) to drop specific wavelengths to local equipment while passing others through.

Google's Jupiter network famously includes MEMS-based optical circuit switches that can reconfigure the datacenter network topology in milliseconds, moving entire wavelengths between different parts of the fabric. Your photons might bounce off microscopic mirrors a few times on their way to the right server rack.

Optical-to-electrical at the datacenter

At the datacenter edge, your wavelength channel hits a coherent optical receiver. A local oscillator laser beats against the incoming signal in a set of photodiodes (the optical equivalent of radio's superheterodyne receiver), recovering both amplitude and phase information across both polarizations.

The photodiodes convert photons back to electrons: 10⁶ to 10⁸ photons in, producing a proportional electrical current. This current feeds a transimpedance amplifier and analog-to-digital converter sampling at 60+ billion samples per second.

Digital signal processing recovers the original bit stream, correcting for chromatic dispersion (different wavelengths traveling at slightly different speeds), polarization mode dispersion, and phase noise. Modern coherent DSP chips burn 20-50 watts just to undo what physics did to your photons over 2,000 kilometers.

Part VI: Datacenter networking

At these speeds, the physics is so hostile that bit errors are inevitable. Modern protocols don't just detect errors; they use Forward Error Correction (FEC) to mathematically reconstruct damaged bits on the fly. This adds a rigid "latency tax" (often ~100 nanoseconds) to every hop, the price we pay for pushing copper and glass to their physical limits.

The journey to the GPU

Your packet enters the datacenter as electrical signals on a 100/400G Ethernet link. It passes through:

Border routers: Verify the packet, apply access control lists, decrement TTL
Load balancers: Direct your request to an available inference server
Spine switches: The core of the datacenter fabric, often using cut-through switching (beginning to forward before receiving the complete packet) for minimum latency
Leaf switches: Top-of-rack switches connected to individual servers
NIC: Back to electrical signals on the server motherboard

Each hop adds 200-500 nanoseconds of latency in a well-designed datacenter fabric. The total network latency from edge to server might be 2-5 microseconds, negligible compared to what's about to happen in inference.

Packet reassembly

The NIC receives frames and DMAs them into host memory, often performing checksum verification and receive-side coalescing (GRO/LRO) to reduce per-packet overhead. But TCP reassembly proper (reordering out-of-order packets, managing the byte stream, handling retransmissions) happens in the kernel's TCP stack. Once the stream is reassembled and decrypted, the userspace inference server reads your deserialized request.

You've now moved from keystroke to "prompt string ready for tokenization" in maybe 30-100 milliseconds for a same-continent request (dominated by speed-of-light latency in fiber), or 100-200 milliseconds for intercontinental.

Part VII: The inference engine

Tokenization

Your text prompt gets chopped into tokens, subword units that the model understands. The tokenizer (typically Byte-Pair Encoding or SentencePiece) maps "What is the speed of light?" into maybe 7-10 token IDs: [1, 1724, 374, 279, 4732, 315, 3177, 30].

These token IDs index into an embedding table: each ID maps to a 4,096 or 8,192-dimensional vector of floating-point numbers. Your 10-token prompt becomes a (10 × 4,096) matrix of embeddings.

Prefill: processing your prompt

Inference happens in two distinct phases with very different computational profiles.

First is prefill: the model processes your entire prompt in parallel, computing the internal representations for each token. This is compute-bound, with large matrix-matrix multiplications that keep the Tensor Cores busy. For your 10-token prompt, prefill takes milliseconds.

During prefill, the attention mechanism has O(n²) complexity in sequence length. Each token attends to all tokens that precede it, requiring a quadratic number of operations. The model builds a KV cache: the key and value vectors for each layer and each token, stored in GPU memory for later reuse.

Decode: generating the response

Once prefill completes, the model enters decode mode: generating tokens one at a time, autoregressively.

Here the compute profile flips. Each new token only needs to attend to the cached keys and values, not recompute them. The attention work per token scales as O(n) in context length, not O(n²). But each token still requires reading all the model weights from HBM, making decode memory-bandwidth-bound rather than compute-bound.

A modern LLM contains tens to hundreds of billions of parameters (public/open models like LLaMA are in this range; frontier systems may be larger, sometimes using mixture-of-experts architectures that make "parameter count" slippery). These weights are stored in GPU memory (HBM, or High Bandwidth Memory) as 16-bit or 8-bit values. The model consumes 200-800 GB of memory just for its weights, spread across 4-8 GPUs in a tensor-parallel configuration.

For each token you generate, the model performs:

Attention: Computing which prior tokens the current token should "attend to." With KV caching, this means reading cached keys/values and computing attention scores against them, rather than recomputing everything.
Feed-forward networks: Two dense matrix multiplications with a nonlinearity in between. These typically expand the hidden dimension 4x (to 16,384) and back down.
Normalization: Layer normalization across the hidden dimension.
Residual connections: Adding the layer's output to its input.

Each layer repeats this pattern. A 100-layer model processes your token through 100 attention layers and 100 feed-forward layers.

The total: roughly 10¹¹ to 10¹² floating-point operations per token for very large dense models, or 10⁹ to 10¹¹ for smaller ones. (The rule of thumb is "a few times the parameter count" in FLOPs per token, plus attention overhead that grows with context length.)

What's happening in the silicon?

An NVIDIA H100 GPU contains 528 Tensor Cores across 132 Streaming Multiprocessors. Each Tensor Core can perform 256 FP16 multiply-accumulate operations per clock cycle. At 1.5-2 GHz, that's roughly 1 PFLOP of dense FP16/BF16 throughput, or about 2 PFLOPS if you can exploit the structured sparsity features (which require 2:4 sparsity patterns in the weights).

During inference, these Tensor Cores are multiplying weight matrices against activation vectors. The transistors, about 8 × 10¹⁰ of them on an H100, toggle at roughly 10¹⁶ to 10¹⁷ times per second across the die. Each toggle moves maybe 10³ to 10⁴ electrons through a transistor channel, consuming the power that makes GPUs run hot enough to need liquid cooling.

The memory system struggles to keep up. HBM3 provides about 3.35 TB/s of bandwidth per GPU. A 100B parameter model in FP16 needs to read 200 GB of weights for each token. On a single GPU, that's 3350/200 ≈ 15 tokens per second if you're purely memory-bandwidth-bound. In practice, large models are tensor-parallel across 4-8 GPUs, aggregating bandwidth and approaching 60+ tokens per second, though real throughput is lower due to communication overhead, KV cache traffic, and kernel inefficiencies.

This is why speculative decoding, quantization (reducing weights from FP16 to INT8 or INT4), and KV-cache optimization matter so much. The compute is there; the bottleneck is moving data to the compute.

It's also why algorithmic breakthroughs like FlashAttention matter as much as hardware. FlashAttention restructures the attention computation to keep intermediate values inside the GPU's tiny, ultra-fast SRAM caches rather than round-tripping to HBM. Same math, same result, but with far fewer memory accesses. It's software designed entirely around the physics of data movement: a reminder that the boundary between "hardware" and "software" optimization has become vanishingly thin.

Part VIII: The context of training

Where did those weights come from?

The parameters in the model weren't invented. They were learned from data through backpropagation over weeks to months.

Training a frontier LLM requires:

Hardware: 10,000-25,000 GPUs running in parallel, connected by high-bandwidth networks (InfiniBand at 400 Gbps per link, NVLink at 900 GB/s between GPUs in the same node)
Time: 1-6 months of continuous computation
Data: 1-15 trillion tokens of text, code, and other sequences
Power: 20-50 megawatts sustained, a small city's worth

The training process runs 10²³ to 10²⁵ floating-point operations total. At maybe 10³ electrons per transistor toggle, across 10¹⁴ toggles per second per GPU, across 10⁴ GPUs, for 10⁷ seconds... you're pushing 10³⁰ to 10³³ electrons through transistor channels just to learn the weights that will process your prompt.

That's not counting the electrons that flow through the power grid, the data center cooling systems, the disk drives storing training data, or the countless auxiliary systems keeping everything running.

The physics of the cluster

Training isn't just inference repeated. It's a synchronized dance of 20,000 chips. If inference is a solo violin, training is a marching band where one tuba player tripping stops the entire performance.

The core problem is this: you can't fit a 100-billion-parameter model on one GPU. You have to slice the mathematics across silicon in three dimensions simultaneously.

Tensor parallelism splits individual matrix multiplications. Rows of a weight matrix live on GPU 0, columns on GPU 1. For every single token processed, these GPUs must exchange partial sums over NVLink: nanosecond-scale communication where nanoseconds actually matter. Pipeline parallelism splits the layers: GPU cluster A runs layers 1-10, cluster B runs 11-20. The activation electrons physically travel over InfiniBand cables from rack to rack between layers. Data parallelism means different GPUs read different chunks of the training data but share the same weights. A single training step involves all three slicing strategies operating simultaneously.

At the end of every backward pass (every single gradient computation) all 20,000 GPUs hold a local fragment of the gradient. They must all agree on the global average before taking an optimization step. This is the Ring All-Reduce algorithm: data flows in a logical ring through every chip in the cluster, each one adding its contribution and passing the result onward.

Here's the brutal physics: this is a global barrier. If one GPU is 5 milliseconds slower than the others due to thermal throttling, a slightly slower silicon bin, or an OS scheduling hiccup, every other GPU waits. You have a 25-megawatt machine held hostage by its slowest transistor.

The network topology is engineered specifically for this. Standard tree topologies create bottlenecks; training clusters use "rail-optimized" fabrics where the Rail 1 NIC of every server connects to the same switch, Rail 2 to another, and so on. This creates 8 distinct physical networks overlaid on top of each other, each dedicated to a slice of the All-Reduce traffic. It's the gradient synchronization algorithm hardwired into copper and glass.

When hardware fails (and it will)

With 20,000 GPUs, the Mean Time Between Failure is measured in hours. Hardware will fail during your training run. It's not a question of if, but when.

The defence is checkpointing: every 10-30 minutes, the entire cluster pauses to dump full model state (weights, optimizer momentum, learning rate schedules) to persistent storage. This is 50-100 terabytes of data written to NVMe drives in seconds. The power consumption profile of the datacenter literally changes shape during a checkpoint. If this I/O takes too long, you're burning megawatts for zero FLOPs.

But some failures are worse than crashes. At this scale, silent data corruption becomes real. A bit flips in the mantissa of a floating-point number during a matrix multiply, not in memory where ECC would catch it, but in an ALU during computation. The math is wrong, but the hardware doesn't know. The loss curve spikes. You've spent $2 million on electricity over the last week, and suddenly the model has "diverged" because a neutron from a supernova hit a transistor in Iowa.

You roll back to the last checkpoint and pray it wasn't corrupted too.

Heat as the enemy

A training cluster is a machine that converts 20 megawatts of electricity into 20 megawatts of heat. The cooling infrastructure is as complex as the compute infrastructure.

No two H100s are identical. Manufacturing variance means some chips run hotter than others at the same clock speed. To keep the cluster synchronous, you often have to underclock the "fast" GPUs to match the "slow" ones, otherwise the fast chips idle while waiting at the All-Reduce barrier, wasting power on empty cycles. If a specific rack develops a localised hot spot, those GPUs throttle their clocks to avoid damage, and the entire cluster slows to match. Twenty thousand chips, constrained by the thermal management of the worst-cooled corner of the datacenter.

A brief acknowledgment of the energy reality

A single inference request might consume 0.001-0.01 kWh. Training the model that serves that request consumed 10-100 GWh. The ratio is 10⁷ to 10¹⁰, meaning your individual query is essentially free-riding on a massive fixed cost.

(Whether this energy equation makes sense at scale is a question for infrastructure economists and climate scientists. But the physics doesn't care about economics.)

Part IX: The response journey

Serializing the output

The model generates tokens autoregressively, one at a time, each conditioned on all previous tokens. At 30-60 tokens per second, a 200-token response takes 3-7 seconds of GPU time.

Each generated token is detokenized back to text, serialized into a JSON response, and written to a response buffer. The inference server might stream these tokens incrementally via Server-Sent Events or WebSocket, or batch them into a single response.

The return path

The response bytes follow the same path in reverse:

Through the datacenter fabric (electrical)
Into the coherent optical transceiver (electrical→optical)
Across thousands of kilometers of backbone fiber (optical, amplified by EDFAs)
Through metro and access networks (optical, with O-E-O conversions at some points)
Into your home router or cellular base station (electrical/radio)
To your device's NIC (electrical)
Up the TCP/IP stack, decrypted by TLS, parsed by your browser

Each segment repeats the pattern: modulation, transmission, amplification or regeneration, demodulation. Your response photons might undergo 6-12 complete optical-electrical-optical conversions between the GPU and your screen.

Part X: Display and perception

Pixels as photon sources

Your browser parses the response, updates the DOM, triggers a re-layout, and issues GPU draw calls to render the text. The rendering pipeline rasterizes glyphs into a framebuffer, a 2D array of pixel colors at your display resolution.

At 4K (3840 × 2160), that's 8.3 million pixels updated 60-120 times per second. The display interface (DisplayPort or HDMI) streams this data at 10-40 Gbps to your monitor.

If you're using an OLED display, each pixel is its own light source. Organic semiconductors emit photons when current flows through them via carrier recombination, the same basic mechanism as the laser in a fiber transceiver, but without the optical cavity and stimulated emission that make a laser coherent. OLED emission is spontaneous and incoherent, spreading across a range of wavelengths rather than a single precise frequency.

Each pixel emits roughly 10⁸ to 10¹¹ photons per frame, driven by microamp currents switched over microsecond intervals. The pixel response time, how fast the OLED can switch luminance, is under 0.1 milliseconds, far faster than the 8-16 ms frame interval. PWM (Pulse Width Modulation) dimming at 480-2160 Hz controls brightness by varying duty cycle rather than current amplitude.

(LCD screens work differently. They modulate a backlight through liquid crystal polarization filters. But the end result is the same: photons at visible wavelengths heading toward your eyes.)

The final optical link

Those photons travel about 50 cm from your display to your cornea in 1.7 nanoseconds. Your cornea and lens focus them onto your retina, where they're absorbed by photoreceptor cells.

And when an OLED pixel displays pure black? It's physically off. No current flows. No photons are born. This is the inverse of everything else in this article, instead of electrons becoming photons, you get true absence. The contrast ratio is theoretically infinite because the denominator is zero. An LCD showing black is still blocking a full-power backlight, wasting energy to show you nothing. An OLED showing black is genuinely dark, consuming almost nothing.

Phototransduction

A single photon absorbed by a rod cell triggers one of nature's most sensitive signal cascades.

The photon is absorbed by rhodopsin, a G-protein-coupled receptor embedded in the photoreceptor's membrane. The rhodopsin changes conformation, activating a G-protein (transducin), which activates a phosphodiesterase enzyme, which hydrolyzes cyclic GMP (cGMP), which causes cGMP-gated ion channels to close.

This cascade amplifies massively at each stage: one rhodopsin activates ~100 transducin molecules, each transducin activates one PDE, and each PDE can hydrolyze ~1,000 cGMP molecules per second. The signal gain is 10⁵ to 10⁶, enough to detect single photons in dark-adapted conditions.

The resulting hyperpolarization of the photoreceptor modulates neurotransmitter release, passing the signal to bipolar cells, then to ganglion cells, whose axons form the optic nerve.

Visual processing

Your optic nerve carries about 1 million parallel channels to the lateral geniculate nucleus, then to the primary visual cortex. The hierarchy of visual areas (V1, V2, V4, IT) extracts increasingly abstract features: edges, shapes, objects, words.

Reading text engages the Visual Word Form Area (VWFA) in the left fusiform gyrus, a region that didn't evolve for reading (writing is too recent for that) but has been repurposed through learning. Pattern recognition, semantic decoding, and language comprehension happen across distributed cortical networks, all running on a 20-watt biological substrate.

The whole process, from photon absorption to word comprehension, takes about 150-300 milliseconds. You perceive it as instantaneous.

What you now understand

From keypress to perception, you've traced:

Mechanical energy (finger motion) → electrical charge (keyboard matrix current)
Serial data (USB HID) → kernel events → TCP packets → modulated electrical signals
Electrons in copper → photons in fiber → amplified photons → electrons again (repeated 6-12 times)
Datacenter fabric → GPU computation → generated tokens → return path
Display pixels → emitted photons → retinal absorption → neural signals → comprehension

The magnitudes involved:

10¹¹-10¹² electrons through your keyboard matrix
10⁴-10⁵ electrons per DRAM bit
10⁶-10⁸ photons per bit through fiber
10¹¹-10¹² FLOPS per token generated (large models)
10³⁰-10³³ electrons pushed during training
10⁸-10¹¹ photons per display frame
10⁵-10⁶ signal amplification in phototransduction

End-to-end latency: 300 milliseconds to 3 seconds for the first token, depending on network path and model size. Time for you to read the complete response: 10-30 seconds. The ratio of computation to perception is staggering.

The view from here

What’s incredible isn't just the complexity. It's the coherence. Every layer of this stack was designed by different people, in different decades, solving different problems. USB was designed in the 90s. Ethernet in the 70s. TLS in the 90s. Transformers in 2017. OLED physics was characterised in the 80s.

None of these designers were optimising for "let a language model answer a question." They were building reliable data transfer, or efficient optical amplification, or better display technology. Yet the whole thing composes into a system where keystrokes become photons become weights become photons become thoughts.

The infrastructure for AI isn't an AI system. It's a fossil record of every communications and computing breakthrough since the invention of the transistor, all layered on top of each other, all still running, all still necessary.

And remember: none of this was planned. It's a cathedral built by people who thought they were building houses.

Anyway. That's what happened while you were waiting for the spinner to stop.