Yogesh · A Living Document · Open Collaboration

Connecting
the Dots.

An intuition-first introduction to modern AI — focused on understanding.

About the project

Hey, I'm Yogesh. This project is an attempt to understand how modern AI systems work — not just what they do, but why the ideas behind them are structured the way they are.

It grew out of months spent working through lectures, books, and research papers, often revisiting the same derivation several times before the intuition became clear. This document tries to record that process — the order in which the understanding actually arrived.

The chapter on autoregressive models begins with a simple question: how can we model a joint distribution over many variables without storing an exponential number of parameters? From there the discussion moves through the chain rule, Bayesian networks, FVSBN, NADE, and MADE, eventually reaching modern sequence models and the Transformer. Each idea appears as a response to a limitation in the previous one.

The aim is to follow this thread through modern AI: large language models, diffusion models, latent variable methods, state space models, world models, and whatever architectures come next.

The goal is that a reader finishing a chapter should feel like they could have invented the ideas themselves if they had started from the same questions. At the moment it’s just me. Collaboration is welcome.

A Working Philosophy

Whenever we want to "get a task done," we are ultimately constructing a function that maps inputs to outputs. Neural networks are universal function approximators — but a naive, fully connected model is inefficient, data-hungry, and impractical under real-world constraints.

The field has progressed by imposing increasingly sophisticated internal structure on these networks: convolutions for spatial structure, recurrence and gating for temporal dependencies, attention for relational modeling, residual connections for optimization stability, diffusion and latent-variable frameworks for generative modeling. Each architectural advance reshapes the underlying function class, injecting inductive bias that makes learning more efficient and generalization more robust. Scaling laws showed that scale matters, but architectural structure determines how efficiently data and compute are used.

In this sense, the central work of modern AI is not merely scaling parameters — it is systematically improving the architecture of the function itself. That is what this project is trying to trace, one chapter at a time.

How the Dots Connect

Problem-first structure

Every concept is introduced as the answer to a specific failure. Exponential parameters → chain rule. Chain rule alone isn't enough → structural assumptions. And so on.

Honest mathematics

Equations are written out, not waved at. But each one is preceded by the question it answers and followed by what it still can't do.

No assumed intuition

Nothing is assumed. Ideas are explained from scratch. Terms like “attention” and “residual stream” are introduced only after the underlying idea is clear.

Cumulative narrative

NADE's incremental pre-activation sum is the seed of the RNN hidden state. The LSTM's additive cell state echoes NADE again. These threads are made explicit.

Current Progress

Updated March 2026

Ch 1 · AR Models ▶

70%

Chapter 1 — Autoregressive Models Experiments ↗

Fixed-Length Models

Modeling joint distributions

Chain rule & exponential cost

Bayesian networks

Markov assumption

Parametric conditionals

FVSBN

NADE & weight sharing

NADE extensions (discrete, RNADE)

Autoencoders & the cheating problem

MADE & autoregressive masking

Variable-Length Models

Words to vectors — one-hot failures

Distributional hypothesis

Word2Vec — Skipgram & CBOW, GloVe

Geometry of word vectors

From NADE accumulation → RNN hidden state

RNNs & backprop through time

Vanishing & exploding gradients

LSTM & GRU

Bidirectional RNNs · Stacked RNNs

Encoder–decoder seq2seq

Attention — additive, dot-product, Q/K/V

Self-attention & cross-attention

Multi-head attention

Transformer — full decoder walkthrough

Ch 2 · LLMs

Ch 3 · Latent Var.

40%

Read current draft

Taxonomy — Generative Models

Generative Models

I. Explicit Likelihood-Based Models

maximize log-likelihood · p_θ(x) explicitly defined

A. No Extra Variables (Direct Models)

A1. Fixed-Length Autoregressive Models

FVSBNNADEMADEPixelRNN / PixelCNNImage AR Transformers

A2. Variable-Length Autoregressive Models (Sequence Models)

RNNLSTM / GRUGPT

B. Extra Variables, Not Hidden (Invertible · Exact Likelihood)

Normalizing Flows

NICERealNVPGlowMAF · IAF · Parallel WaveNetFlow++FFJORD

C. Extra Variables + Hidden (Latent-Variable · Approx. Likelihood)

C1. Amortized Inference (VAE Family)

VAEβ-VAECVAEHierarchical VAEs

C2. Non-Amortized Inference (Classical EM-Based)

GMMHMMFactor AnalysisProbabilistic PCAKalman Filters

II. Implicit Likelihood Models

divergence minimization · hidden latent variable

GAN Family

GANWGANf-GANStyleGANBigGAN

III. Score-Based Models

learn ∇ log p(x) · no tractable likelihood

A. Energy-Based Models

Classical EBMsContrastive DivergencePersistent CD / NCE-based EBMs

B. Score Matching

Hyvärinen Score MatchingDenoising Score MatchingSliced Score Matching

C. Diffusion / Score-SDE Models

DDPMImproved DDPMScore-SDELatent Diffusion (Stable Diffusion)Consistency Models

Excerpt — Chapter 1 · The Residual Stream

ResNets and Transformers look different on the surface. The connection runs deeper than it appears.

        // ResNet — each block adds a correction
        xℓ+1 = xℓ + Fℓ(xℓ)
      
        // Transformer — depth accumulates refinements, not replacements
        X(ℓ) = X(ℓ−1) + Δ(ℓ)attn + Δ(ℓ)ffn
      
        // Unrolled to layer L
        X(L) = X(0) + ∑ℓ=1L (Δ(ℓ)attn + Δ(ℓ)ffn)

Neither architecture overwrites its representations — they accumulate them. In ResNet, the residual function F_l learns a correction. In the Transformer, that correction is split into two structured parts: Δ_attn, which gathers information across positions, and Δ_ffn, which transforms it locally. The final representation is the original embedding plus every increment every layer ever wrote. Depth, in both cases, is accumulation — not replacement.

The Transformer does not overwrite representations; it refines them, layer by layer, through addition. The name is apt in a way the original paper never quite spells out.

The Medium: Beyond the PDF

Right now, I am drafting this manuscript in Overleaf. I love the typographic rigor of LaTeX—it feels clean, serious, and easy to read. But a static PDF is fundamentally the wrong container for a field that moves this fast.

I want to build something that preserves that typographic elegance but evolves alongside my thoughts. A dynamic digital space where intuition is guided by interactive animations, where new research naturally triggers new conceptual branches, and where the text is a living artifact rather than a frozen export. If you are someone who is excited by building interactive technical publications, I would love to build this together.

Who I'm looking for

I'm not looking for people to write sections for me. I'm looking for thinking partners — people who will read a draft and tell me where the intuition breaks, where I'm hand-waving, where the thread I'm following actually leads somewhere I haven't seen yet.

Review
Technical Reviewers — ML researchers, practitioners, or careful readers who can catch when an explanation is almost right but not quite, or where a mathematical derivation quietly skips a step.
Visual
Visual Collaborators — Someone who thinks visually about how to represent attention, gradient flow, or the residual stream in ways that are clear without being decorative.
Build
Platform Builders — Someone who can build the pipeline to enable a dynamic, interactive book-like experience, bridging LaTeX/Markdown with modern web frameworks.

Interested in collaborating?

If something in Connecting the Dots sparked an idea or raised a question, I'd love to hear from you. Feel free to reach out — whether it's feedback, discussion, or exploring an experiment together.

Send Email