Travis Head — The Expert Layer You Activate When the World Is On Fire

Growing up in India in the ’90s, everybody had one religion: Tendulkar. And yet there I was the lone kid in a room full of India fans cheering for Australia. Not because I wanted India to lose. But because something about that Aussie side felt like pure, ruthless excellence:

  • Warne’s sorcery

  • Gilchrist’s audacity

  • McGrath’s metronome

  • Ponting’s cold authority

It felt like watching a team that had found a higher gear. A different plane of performance. A machine tuned to borderline perfection.

And then I grew up. Life got heavier  with career, kids, responsibilities. India got strong.
Australia’s golden generation retired. And that inner fan in me- went quiet.

Until a man named Travis Head walked in a  World Cup final and played like he was batting in a video game. Head doesn’t “settle in.” He doesn’t “get his eye in.” He arrives fully loaded. He assesses nothing. He just destroys. And when he does it in big matches  something inside me wakes up again. The boy who loved the Aussie bulldozer style opens his eyes. He recently destroyed England in the first test of the Ashes and that is the real inspiration for the post.

That feeling of a sudden ignition is exactly what helped me understand something in AI called Mixture of Experts.

MoE = Travis Head

In a Mixture of Experts model, you don’t fire the whole network for every input.
You only activate the right expert for the right situation.Most experts sit idle.
Most neurons rest. Only the specialist steps forward.

That’s Travis Head. He’s the “expert layer” you only activate when you need surgical violence.

He’s not the batter for all conditions.
But when the model (or the team) faces chaos, the router ,the gate, says: “Load Head.exe. Activate the destruction module”

And then he does things no algorithm can predict.

AI and Sports. Sports and AI

People ask me why I mix sports with AI. Why I write with this blend of emotion, humor, nostalgia, and curiosity. It’s because this is how I learn. This is how I make sense of the world.
This is how I connect the engineer in me with the kid I once was.

AI research gives me the intellectual spark. Sports gives me the emotional spark. Writing ties both sparks together into a voice that feels like mine.

And somewhere between Transformers, MoE, and Travis Head, I’m rediscovering what it feels like to be awake, curious, and fully alive again.

Mixture of Experts — The Batting Order Inside the Transformer

After learning about the Transformer architecture, here’s the part that really made something click for me: Not every part of the model needs to think for every token.

Sometimes, you only need the right specialist at the right moment  running at full tilt. Enter Mixture of Experts (MoE). If the Transformer is the stadium and the pitch, MoE is the batting order inside it.

What Is Mixture of Experts (In Human Words)?

An MoE layer is basically:

  • A bunch of experts (tiny specialized neural networks)

  • A gate (a router that chooses which experts to activate)

  • Sparse activation (only 1–2 experts run for each token)

Instead of a dense model where every neuron fires for every word, MoE works like a fampous cricket cliche –  “Horses for courses”

Some experts learn math.
Some learn jokes.
Some learn Tamil lyrics.
Some learn how to code.
Some learn how to reason.

And the gate learns over millions of training steps – which specialist to call upon at what moment.

It’s specialization, efficiency and ruthless matching of talent to situation.

The Cricket Version (because my brain only learns with sports)

In a dense neural network: Every player on the team walks out for every delivery.
Chaos. Heat. No runs. Nobody wins.

In MoE:
When the ball is spinning, you send your best spin player.
When the ball is short, you send the pull-shot specialist.
When the bowler is rattled, you send someone who can finish the job.

That gate ,the router  becomes the biggest difference between brilliance and burnout.

Why MoE Is a Breakthrough

MoE gives you:

  • Bigger capacity (many experts = huge brain)

  • Lower compute cost (only a few experts activate)

  • Faster inference (because most of the model is sleeping)

  • Smarter specialization (each expert gets extremely good at one slice of the world)

This is how modern large models scale without melting the electricity grid. GPUs already run hot enough; MoE prevents them from becoming volcanic.

The Emotional Part

The idea that intelligence emerges not from everyone doing everything, but from the right expert stepping in at the right moment hit me personally.

It reminded me of something from my childhood. Growing up in India, everyone supported Tendulkar. I was the boy openly cheering for the Australians. Years later, Travis Head would walk into the opening game of the Ashes and do exactly what MoE does: activate only when needed and then unleash absolute destruction.

But that’s for the next post.

MoE woke up the engineer in me. Travis Head woke up the boy in me. And somewhere between those two, I found the voice that’s writing these posts today.

Next up:
Travis Head — The Expert Layer You Activate When Everything Is On Fire.

RNN, LSTM, CNN — The Models That Ruled Before Transformers Took Over

Before Transformers became the Rajinikanth (Before anyone learns AI, they should first know the man who said “If I say something once, it is like me saying it a 100 times.” Thats clarity, no Hallucination)  of AI architectures, the field ran on three big families of models. Each one tried to solve a different piece of the “how do we make machines understand patterns?” puzzle.

Let’s break them down in the simplest possible way.

RNN (Recurrent Neural Network)

What it is:

A model that processes sequences one step at a time, remembering what came before.

How it thinks:

“I saw this word earlier… let me keep that in mind.”

What it was used for:

  • Early language models

  • Time-series prediction

  • Simple speech tasks

Why it struggled:

Memory fades fast. RNNs forget long sentences. If a sentence was 20 words long, it remembered… maybe 5. Think of it like a student who remembers the line “The words are dark lovely and deep” but forgets “but miles to go before I sleep”

LSTM (Long Short-Term Memory)

An advanced RNN that uses “gates” to decide what to remember and what to forget.

How it thinks:

“I’ll store important stuff and throw away the junk.”

What it was used for:

  • Speech recognition

  • Machine translation

  • Predictive text

  • Music generation

Why it was better than RNNs:

LSTMs could remember longer sequences. Instead of forgetting after 5 words, they could recall 20, maybe 30. They were like the topper kid who remembered the whole poem and recited it proudly on stage.

Why they still fell short:

They processed everything sequentially. No parallelism → slow, expensive, don’t scale.

CNN (Convolutional Neural Network),

What it is:

A network that looks for patterns in small patches of an image or signal and stitches them into bigger patterns.

How it thinks:

“I’ll check every small window, find edges, curves, textures, and build up the picture.”

What it’s used for:

  • Image classification (cat/dog)

  • Object detection

  • Facial recognition

  • Early medical imaging tasks

Why CNNs were kings for 10 years:

They are insanely good at visual patterns:

  • they’re fast

  • they’re parallel

  • they reuse filters efficiently

They were the Aussie cricket team of computer vision – dominant, ruthless, unbeatable.

Why they didn’t become LLMs:

CNNs don’t have a natural way to handle long-range relationships in text. CNNs are great for “here’s a face,” but bad at “here’s a paragraph with meaning spread across 80 words.”

Why Transformers Replaced All Three

Transformers brought:

  • Attention (focus on what matters)

  • Parallelism (look at everything at once)

  • Long-context handling

  • Scalability

  • MoE compatibility

  • Cleaner training dynamics

RNNs forgot too easily. LSTMs remembered but were slow. CNNs saw patterns but not meaning. Transformers merged the best of all worlds. They became universal: text, vision, audio, protein folding, coding, reasoning… all with one architecture.

CNNs and Harari: Why Machines Saw Patterns but Couldn’t Gossip

CNNs were brilliant at one thing: detecting patterns in small windows.
Edges. Corners. Curves. Textures.

They could look at a tiny 3×3 patch of pixels and say, “Oh, this looks like an eyebrow” or “This is definitely a  wheel”. Then they stitched first edges, then shapes, then whole objects.

But here’s the catch: CNNs only look locally. Their entire worldview is “small patch → next small patch → next small patch.” They missed the long range connection. The thing that Harrari talks bout in Sapiens as the thing that made humans dominate the planet: Our ability to connect abstract ideas and gossip across tribes.

We’re not powerful because we see edges and textures.
We’re powerful because we can connect:

  • stories

  • beliefs

  • rumors

  • meanings

  • relationships

  • consequences that unfold over time

We can remember something someone said three weeks ago and use it to interpret something said today.

CNNs?
Nope. They’re the quiet, hardworking kid who studies hard but has no tea to spill.

Why This Leads Naturally to Transformers

Transformers introduced attention, which gives a model the ability to say: “This word from the beginning matters to this word at the end and let me connect them.”

That’s gossip.
That’s meaning-spreading.
That’s long-range dependency.
That’s human-style cognition.

The Architecture of AI — The Blueprint of the Brain

When I started writing about AI, I thought the “model” was whatever lived on the GPU and spat out answers.
Simple.
Done.
Onward.

Turns out, not quite. There’s something that sits before training, before inference, and before any cleverness shows up. It’s called architecture, and it’s the part we almost never talk about outside research circles.

Architecture = The Brain’s Blueprint

Architecture is just the design of how the model thinks.

Not the data. Not the training. Not the GPUs. Not the math.

Just the layout.
The wiring diagram.

For decades, this wiring diagram came in different shapes:

  • RNNs (models that remembered yesterday)

  • LSTMs (models that remembered yesterday slightly better)

  • CNNs (models that inspected images like bouncers checking IDs)

One architecture changed everything: The Transformer.

What Makes a Transformer a Transformer?

It has two big superpowers:

  1. Attention :the ability to look at every part of a sentence and decide what matters

  2. Parallelism : the ability to think about many things at once, without step-by-step bottlenecks

And inside each Transformer block, you mostly have two components:

  • Multi-Head Attention

  • Feed Forward Networks (FFN)

Stack  a number of those and suddenly you’ve built a modern LLM. This stack of blocks is called as architecture. Think of it as the playing conditions before the cricket match even starts:

  • the pitch

  • the boundaries

  • the field layout

  • the weather

The players (your neurons) and the coach (the optimizer) are important, but the ground determines how the match will flow.

Why This Matters for My Writing Journey

I realized that if I wanted to understand AI deeply enough to explain it to my kids one day, I needed to understand the blueprint first. Everything else (training, inference, MoE, GPUs, CUDA, Randy Johnson and Curt Schilling, and yes, Travis Head ) all sit on top of this architecture . This is the moment in the story where the camera pans out and you finally see the whole cricket ground.

Next up:
Mixture of Experts the specialist lineup inside the Transformer.

And after that?
A man named Travis Head walks into a game and becomes an expert layer all by himself.

Stay tuned.