Yogesh · A Living Document · Open Collaboration

Connecting
the Dots.

An intuition-first introduction to modern AI — focused on understanding.

About the project

Hey, I'm Yogesh. This project is an attempt to understand how modern AI systems work — not just what they do, but why the ideas behind them are structured the way they are.

It grew out of months spent working through lectures, books, and research papers, often revisiting the same derivation several times before the intuition became clear. This document tries to record that process — the order in which the understanding actually arrived.

The chapter on autoregressive models begins with a simple question: how can we model a joint distribution over many variables without storing an exponential number of parameters? From there the discussion moves through the chain rule, Bayesian networks, FVSBN, NADE, and MADE, eventually reaching modern sequence models and the Transformer. Each idea appears as a response to a limitation in the previous one.

The aim is to follow this thread through modern AI: large language models, diffusion models, latent variable methods, state space models, world models, and whatever architectures come next.

The goal is that a reader finishing a chapter should feel like they could have invented the ideas themselves if they had started from the same questions. At the moment it’s just me. Collaboration is welcome.

A Working Philosophy

Whenever we want to "get a task done," we are ultimately constructing a function that maps inputs to outputs. Neural networks are universal function approximators — but a naive, fully connected model is inefficient, data-hungry, and impractical under real-world constraints.

The field has progressed by imposing increasingly sophisticated internal structure on these networks: convolutions for spatial structure, recurrence and gating for temporal dependencies, attention for relational modeling, residual connections for optimization stability, diffusion and latent-variable frameworks for generative modeling. Each architectural advance reshapes the underlying function class, injecting inductive bias that makes learning more efficient and generalization more robust. Scaling laws showed that scale matters, but architectural structure determines how efficiently data and compute are used.

In this sense, the central work of modern AI is not merely scaling parameters — it is systematically improving the architecture of the function itself. That is what this project is trying to trace, one chapter at a time.

How the Dots Connect
01
Problem-first structure

Every concept is introduced as the answer to a specific failure. Exponential parameters → chain rule. Chain rule alone isn't enough → structural assumptions. And so on.

02
Honest mathematics

Equations are written out, not waved at. But each one is preceded by the question it answers and followed by what it still can't do.

03
No assumed intuition

Nothing is assumed. Ideas are explained from scratch. Terms like “attention” and “residual stream” are introduced only after the underlying idea is clear.

04
Cumulative narrative

NADE's incremental pre-activation sum is the seed of the RNN hidden state. The LSTM's additive cell state echoes NADE again. These threads are made explicit.

Current Progress

Updated March 2026
Ch 1 · AR Models
70%
Chapter 1 — Autoregressive Models Experiments ↗
Fixed-Length Models
Modeling joint distributions
Chain rule & exponential cost
Bayesian networks
Markov assumption
Parametric conditionals
FVSBN
NADE & weight sharing
NADE extensions (discrete, RNADE)
Autoencoders & the cheating problem
MADE & autoregressive masking
Variable-Length Models
Words to vectors — one-hot failures
Distributional hypothesis
Word2Vec — Skipgram & CBOW, GloVe
Geometry of word vectors
From NADE accumulation → RNN hidden state
RNNs & backprop through time
Vanishing & exploding gradients
LSTM & GRU
Bidirectional RNNs · Stacked RNNs
Encoder–decoder seq2seq
Attention — additive, dot-product, Q/K/V
Self-attention & cross-attention
Multi-head attention
Transformer — full decoder walkthrough
Ch 2 · LLMs
0%
Ch 3 · Latent Var.
40%
Read current draft
Taxonomy — Generative Models
Generative Models
I. Explicit Likelihood-Based Models
maximize log-likelihood · pθ(x) explicitly defined
A. No Extra Variables (Direct Models)
A1. Fixed-Length Autoregressive Models
FVSBNNADEMADEPixelRNN / PixelCNNImage AR Transformers
A2. Variable-Length Autoregressive Models (Sequence Models)
RNNLSTM / GRUGPT
B. Extra Variables, Not Hidden (Invertible · Exact Likelihood)
Normalizing Flows
NICERealNVPGlowMAF · IAF · Parallel WaveNetFlow++FFJORD
C. Extra Variables + Hidden (Latent-Variable · Approx. Likelihood)
C1. Amortized Inference (VAE Family)
VAEβ-VAECVAEHierarchical VAEs
C2. Non-Amortized Inference (Classical EM-Based)
GMMHMMFactor AnalysisProbabilistic PCAKalman Filters
II. Implicit Likelihood Models
divergence minimization · hidden latent variable
GAN Family
GANWGANf-GANStyleGANBigGAN
III. Score-Based Models
learn ∇ log p(x) · no tractable likelihood
A. Energy-Based Models
Classical EBMsContrastive DivergencePersistent CD / NCE-based EBMs
B. Score Matching
Hyvärinen Score MatchingDenoising Score MatchingSliced Score Matching
C. Diffusion / Score-SDE Models
DDPMImproved DDPMScore-SDELatent Diffusion (Stable Diffusion)Consistency Models
Excerpt — Chapter 1 · The Residual Stream

ResNets and Transformers look different on the surface. The connection runs deeper than it appears.

// ResNet — each block adds a correction xℓ+1 = x + F(x)

// Transformer — depth accumulates refinements, not replacements X(ℓ) = X(ℓ−1) + Δ(ℓ)attn + Δ(ℓ)ffn

// Unrolled to layer L X(L) = X(0) + ∑ℓ=1L (Δ(ℓ)attn + Δ(ℓ)ffn)

Neither architecture overwrites its representations — they accumulate them. In ResNet, the residual function Fl learns a correction. In the Transformer, that correction is split into two structured parts: Δattn, which gathers information across positions, and Δffn, which transforms it locally. The final representation is the original embedding plus every increment every layer ever wrote. Depth, in both cases, is accumulation — not replacement.

The Transformer does not overwrite representations; it refines them, layer by layer, through addition. The name is apt in a way the original paper never quite spells out.
The Medium: Beyond the PDF

Right now, I am drafting this manuscript in Overleaf. I love the typographic rigor of LaTeX—it feels clean, serious, and easy to read. But a static PDF is fundamentally the wrong container for a field that moves this fast.

I want to build something that preserves that typographic elegance but evolves alongside my thoughts. A dynamic digital space where intuition is guided by interactive animations, where new research naturally triggers new conceptual branches, and where the text is a living artifact rather than a frozen export. If you are someone who is excited by building interactive technical publications, I would love to build this together.

Who I'm looking for

I'm not looking for people to write sections for me. I'm looking for thinking partners — people who will read a draft and tell me where the intuition breaks, where I'm hand-waving, where the thread I'm following actually leads somewhere I haven't seen yet.

Interested in collaborating?

If something in Connecting the Dots sparked an idea or raised a question, I'd love to hear from you. Feel free to reach out — whether it's feedback, discussion, or exploring an experiment together.

Send Email