Connecting
the Dots.
An intuition-first introduction to modern AI — focused on understanding.
Hey, I'm Yogesh. This project is an attempt to understand how modern AI systems work — not just what they do, but why the ideas behind them are structured the way they are.
It grew out of months spent working through lectures, books, and research papers, often revisiting the same derivation several times before the intuition became clear. This document tries to record that process — the order in which the understanding actually arrived.
The chapter on autoregressive models begins with a simple question: how can we model a joint distribution over many variables without storing an exponential number of parameters? From there the discussion moves through the chain rule, Bayesian networks, FVSBN, NADE, and MADE, eventually reaching modern sequence models and the Transformer. Each idea appears as a response to a limitation in the previous one.
The aim is to follow this thread through modern AI: large language models, diffusion models, latent variable methods, state space models, world models, and whatever architectures come next.
The goal is that a reader finishing a chapter should feel like they could have invented the ideas themselves if they had started from the same questions. At the moment it’s just me. Collaboration is welcome.
Whenever we want to "get a task done," we are ultimately constructing a function that maps inputs to outputs. Neural networks are universal function approximators — but a naive, fully connected model is inefficient, data-hungry, and impractical under real-world constraints.
The field has progressed by imposing increasingly sophisticated internal structure on these networks: convolutions for spatial structure, recurrence and gating for temporal dependencies, attention for relational modeling, residual connections for optimization stability, diffusion and latent-variable frameworks for generative modeling. Each architectural advance reshapes the underlying function class, injecting inductive bias that makes learning more efficient and generalization more robust. Scaling laws showed that scale matters, but architectural structure determines how efficiently data and compute are used.
In this sense, the central work of modern AI is not merely scaling parameters — it is systematically improving the architecture of the function itself. That is what this project is trying to trace, one chapter at a time.
Every concept is introduced as the answer to a specific failure. Exponential parameters → chain rule. Chain rule alone isn't enough → structural assumptions. And so on.
Equations are written out, not waved at. But each one is preceded by the question it answers and followed by what it still can't do.
Nothing is assumed. Ideas are explained from scratch. Terms like “attention” and “residual stream” are introduced only after the underlying idea is clear.
NADE's incremental pre-activation sum is the seed of the RNN hidden state. The LSTM's additive cell state echoes NADE again. These threads are made explicit.
Current Progress
ResNets and Transformers look different on the surface. The connection runs deeper than it appears.
Neither architecture overwrites its representations — they accumulate them. In ResNet, the residual function Fl learns a correction. In the Transformer, that correction is split into two structured parts: Δattn, which gathers information across positions, and Δffn, which transforms it locally. The final representation is the original embedding plus every increment every layer ever wrote. Depth, in both cases, is accumulation — not replacement.
The Transformer does not overwrite representations; it refines them, layer by layer, through addition. The name is apt in a way the original paper never quite spells out.
Right now, I am drafting this manuscript in Overleaf. I love the typographic rigor of LaTeX—it feels clean, serious, and easy to read. But a static PDF is fundamentally the wrong container for a field that moves this fast.
I want to build something that preserves that typographic elegance but evolves alongside my thoughts. A dynamic digital space where intuition is guided by interactive animations, where new research naturally triggers new conceptual branches, and where the text is a living artifact rather than a frozen export. If you are someone who is excited by building interactive technical publications, I would love to build this together.
I'm not looking for people to write sections for me. I'm looking for thinking partners — people who will read a draft and tell me where the intuition breaks, where I'm hand-waving, where the thread I'm following actually leads somewhere I haven't seen yet.
-
Review
Technical Reviewers — ML researchers, practitioners, or careful readers who can catch when an explanation is almost right but not quite, or where a mathematical derivation quietly skips a step.
-
Visual
Visual Collaborators — Someone who thinks visually about how to represent attention, gradient flow, or the residual stream in ways that are clear without being decorative.
-
Build
Platform Builders — Someone who can build the pipeline to enable a dynamic, interactive book-like experience, bridging LaTeX/Markdown with modern web frameworks.
Interested in collaborating?
If something in Connecting the Dots sparked an idea or raised a question, I'd love to hear from you. Feel free to reach out — whether it's feedback, discussion, or exploring an experiment together.
Send Email