After a five-day conference, your notebook looks less like a record of learning and more like the diary of a madman. There are arrows pointing nowhere. There are heavily underlined acronyms you no longer recognize. There is a doodle of a remarkably stressed-looking cat.

P.S. Why do we always draw cats during deep learning talks?

The truth about conferences is that not all notes are created equal. The vast majority of what you write down is just atmosphere. It is the feeling of being in a room with smart people. But once the caffeine wears off and you are sitting at your desk, you have to separate the signal from the noise. This is not a recap. This is a thoughtful sorting exercise. These are the notes that survived.


The Elegance of Tandem Transformers

Speaker: Praneeth Netrapalli

We are all currently obsessed with making models exponentially bigger, but the real elegance lies in making them smarter about how they compute. Netrapalli opened with a deceptively clean framing: LLMs are conditional generative models. You give them a prompt, they generate from a conditional distribution. That is it. Everything else is engineering.

The architecture itself is four stacked components: an embedding layer that tokenizes input, transformer blocks with attention layers, transformer blocks with feed-forward layers, and a loss function that produces probabilistic output. The attention mechanism is where things get interesting. Given an input token, three parameter matrices (Query, Key, Value) work together so that future tokens attend to all past tokens and update their representations accordingly. It is, at its core, a mechanism for letting every part of a sentence gossip with every other part.

But the note that survived from this talk was not the architecture. It was the economics. Training is a one-time cost. Inference is a continuous cost. And autoregressive generation, the way these models actually produce text, forces you into sequential matrix-vector multiplications. You cannot parallelise your way out of it. That is the real bottleneck.

Netrapalli introduced two approaches to this problem. The first, HIRE, attacks activation sparsity. The key insight is that latency is not compute-bound but memory-bound: the bottleneck is transferring weights from RAM to cache. So instead of dumping the entire weight matrix, you transfer only the top-k most relevant values. They apply this approximation to both the softmax layer and the feed-forward layers, and the results held up.

The second approach was Tandem Transformers, and this is where it got genuinely elegant. Decoding (generation) is far more expensive than encoding (understanding). So you take a large language model and a small language model, project the encodings of the LLM into the SLM, and let the smaller model handle the expensive autoregressive generation. In their experiments with PaLM2 (using the Gecko, Otter, and Bison size variants), tandem distillation made the pipeline both fast and, crucially, accurate enough to be usable.

Then comes speculative decoding: the SLM drafts tokens, and the primary LLM acts as a verifier. If the SLM gets a token wrong, the LLM backtracks, corrects it, and restarts. It is basically the architectural equivalent of having an intern draft your emails while you just approve them.

P.S. Delegation is a true art form.


The Limits of Local Reasoning

Speaker: Seema Nagar

When you represent data as a graph, the first question is deceptively simple: how do you learn from the structure? Nagar walked through this systematically, starting from the adjacency matrix (just zeros and ones for an unweighted graph) and building up.

Graph Convolutional Networks (GCNs) aggregate information from local neighbours with shared, normalised weights. They average information from immediate neighbours, with normalisation to prevent high-degree nodes from dominating. They work well when nearby structure is the main signal. Graph Attention Networks (GATs) take this further by learning which local neighbours actually matter using attention weights. Instead of treating all neighbours equally, important neighbours contribute more strongly to the node update. It is honestly a lot like knowing which of your friends to actually listen to when they give advice.

Graph Transformers go further still: global attention so that distant nodes can interact directly, bypassing the local neighbourhood entirely.

But the limitation that stuck with me was oversmoothing. Deep GNNs make all nodes look similar. Information from distant nodes gets diluted. You are effectively limited to three to five layers before everything collapses into a mathematical echo chamber. Traditional GNNs simply do not have global receptive fields.

Graphormers address this with three clever additions: centrality encoding to capture each node's importance in the graph, spatial encoding using shortest-path distances added as bias in attention, and edge encoding that handles multiple edge types by averaging edge features along shortest paths.

The discussion on dynamic graphs was where things got genuinely forward-looking. Nodes and edges can change over time. When the topology is fixed and only signals change, you combine a static GNN with a sequence model: the spatial component (GCN) extracts structural features at each time step, while the temporal component (RNN, LSTM) processes the sequence to capture trends. When the topology itself evolves and new nodes or edges appear, you move to either discrete-time snapshots or continuous-time streams with memory-based GNNs that maintain state for each node.

The foundational models section was a glimpse of the field's ambitions. UniGraph, a cross-domain model for text-attributed graphs, combines graph structure with text semantics using a Text Encoder and Graph Encoder fused into a unified representation. MolFM goes further, combining molecular structure, biomedical text, and knowledge graphs, pretrained using structure-text contrastive loss, cross-modal matching, and masked language modelling, and then fine-tuned for specific drug discovery tasks with limited data.

You cannot just look at your immediate neighbours if you want to understand the whole network. That was the surviving note.


The Human Problem of AI Context

Speaker: Prateeti Mohapatra

Prateeti Mohapatra mapped the entire trajectory from basic prompt engineering to full multi-agent systems in a single talk, and the clarity of the progression was the most valuable part. The pipeline is: LLM, then prompt engineering, then RAG, then Agentic AI, then Multi-agent.

Prompt engineering is, by her own framing, a process of trial and error with no guarantee of consistent outputs. Zero-shot prompting, few-shot prompting with in-context learning. It works. But it is fundamentally unpredictable. RAG (Retrieval Augmented Generation) adds external knowledge retrieval to ground the model. Then you move into workflows: predictable execution with linear sequences and deterministic tasks (easy debugging), versus flexible decision making where the LLM decides next steps for open-ended tasks (complex debugging).

The ReAct framework (Reason and Act) and the Agent2Agent protocol got mentions, along with the distinction between MCP (used internally within an agent) and A2A (used for inter-agent communication). But the most fascinating takeaway was not any of these protocols. It was the terminology around context management.

Long context windows lead to context poisoning, context distraction, context confusion, and context clash. Four failure modes. One slide. When your breakage has its own taxonomy, you have moved past denial and into interior design. It turns out that giving an AI too much information causes exactly the same problems it causes in humans.

math lady confused meme

Knowing the Boundary of the Tool

Speaker: Joseph Ellaway

We tend to treat structural biology tools like AlphaFold as magic wands. Joseph Ellaway was brilliantly precise about what they actually do, and more importantly, what they cannot.

The pipeline itself is impressive: MSA plus structure databases feed into 48 Evoformer blocks of deep learning, producing atomic coordinates and a suite of confidence scores. pLDDT for local accuracy, rewarding locally correct structures. PAE for pairwise representation, giving you a value for every residue pair where colour indicates expected position error if predicted and true structures were aligned on any residue. ipTM for overall confidence. pDockQ for docking quality. Clash scores. The confidence infrastructure is almost as sophisticated as the prediction itself.

The 3D-beacons network represents an open collaboration for programmatic access to both experimentally determined and predicted structure models. OpenFold is the trainable PyTorch reproduction of AlphaFold 2, putting the architecture into the hands of researchers who want to actually modify it rather than just use it.

But the surviving note from this talk was the three clean limitations. AlphaFold can only accept the 20 standard amino acids. It cannot predict conformational variability. And it fails to predict the effects of point mutations. Three sentences that define the exact boundary of the tool. Knowing that boundary is precisely what makes it useful.

P.S. You cannot be disappointed by a model if you perfectly understand its limitations.


The Anxiety of the Unknown

Speaker: Arnab Mukherjee

This was, without exaggeration, one of the most enjoyable talks of the entire conference, and Prof. Mukherjee's background is not even in reinforcement learning. His infectious enthusiasm and incredibly creative outlook turned what could have been a textbook lecture into something genuinely brilliant.

Reinforcement Learning is fundamentally about sequential decision making under uncertainty. Data comes as state-action pairs. The goal is to maximise future reward over many steps. You take an action, you alter the state of the world, and you might not see the reward until much later. The Bellman equation is just a very rigorous way of dealing with the anxiety of the unknown.

The Markov Decision Process framework makes this concrete. Given any state, the agent takes an action, transitions to a new state, and receives a reward. The Markov property says the future depends only on the current state and action, not on the entire history. A trajectory is a sequence of rewards stretching into the future. Tasks can be episodic (with a clear end) or non-episodic (continuing indefinitely).

The fundamental challenges are clean and sobering. Sequential feedback: a decision now can impact outcomes much later. Evaluative feedback: you do not know if something works unless you try it. Sampled feedback: you cannot observe everything. This leads directly to the exploration versus exploitation trade-off, handled practically with epsilon-greedy strategies.

The taxonomy splits into model-free and model-based approaches. Policy-based RL learns a function that directly tells the agent what action to take. Deterministic policies always return the same action for a given state. Stochastic policies output probability distributions. Value-based RL learns a value function that maps states to expected discounted returns, dependent on the chosen policy. Q-Learning learns the value of taking a particular action in a given state.

The frozen lake MDP example was where everything clicked. The pseudocode: start with all zeros, iterate through each state, calculate the reward for every action, update. Simple enough to fit on a slide. Deep enough to actually feel the uncertainty. When you do not know the model at all, you fall back to Monte Carlo methods or temporal difference learning.

Deep Q-Learning generalises across similar states, where one update improves many states at once. Policy gradient approaches optimise the expected return directly.

For a hot second, I genuinely believed I could navigate a frozen lake using a stochastic policy. I cannot. But the fact that I felt like I could is a testament to how well this was taught.


The Quiet IBM Note

Speaker: Amith Singhee

A concise but grounding talk on AI at IBM, touching on foundation models trained with self-supervision that can do multiple things well, and CogMol for controlled generation of molecules for drug design. Sometimes the most valuable conference notes are the ones that connect theory to industry, reminding you that all of this has to survive contact with real problems.

P.S. For a guy who has been in the tech field for a decade, it was quite impressive seeing him answer every question, even the ones that were outside of his domain expertise. His homework must be solid. Or he is just that smart. Either way, solid.


Hallucinating Biochemistry

Speaker: Alisa Khramushin

This is where the conference shifted from understanding biology to designing it. Alisa Khramushin walked through the full arc of computational protein design, starting from classical sequence alignment, PSSMs, and HMMs, and arriving at something that genuinely feels like science fiction.

Inverse protein folding is the core problem: given a desired structure, design a sequence that folds into it. ProtMPNN handles this by letting you control sequence diversity through temperature. Increasing temperature increases diversity. The probability is a simple softmax with temperature scaling. But the critical insight was that sequence recovery does not correlate with in-silico refolding. Optimising for one does not guarantee the other. That is a subtlety you can easily miss.

Top7 got a mention as the first fully de novo protein, a milestone worth pausing on. We actually designed a protein from scratch. The current pipeline for function-conditioned design is: conditioning, then backbone generation, then sequence design, then structure prediction. You can encode partial information when you have a desired functional motif, binder, or active site.

De novo binder design was where the tools started to feel powerful. MaSIF uses surface-based search: extract fingerprints, build a patch database, find complementary patches, seed candidates, and graft with Rosetta to produce binders. The diffusion model approach (backbone generation, sequence optimisation, structure prediction) was cleaner. RFDiffusion handles conformational landscape optimisation. Hallucination methods go even further: no target backbone, just optimise a sequence until the model is confident. BindCraft does exactly that, though creating surface structures remains difficult.

The metamorphic proteins section was straight out of a sci-fi novel. These are multi-state proteins with two distinct structures, marginal thermostability that allows spontaneous unfolding, and structural dissimilarity between the populated states. Modelling transition feasibility uses a transforming potential (RMSD from the final state) and an unfolding potential that defines a differentiable function to compute the number of contacts within a protein and then minimises them. We are not just predicting structures anymore. We are hallucinating biochemistry.


Finding the Right Projection

Speaker: Jagannath Mondal

Prof. Mondal's talks (he appeared twice across the day) were about the fundamental problem underneath all computational structural biology: how do you represent complex, high-dimensional molecular dynamics data in a way that is actually useful?

The MD pipeline is deceptively simple: take an initial configuration, define an interaction potential, compute forces, propagate the system over time using Newton's laws, and compute observables. The hard part is finding the projection that best represents ensemble data.

Dimensionality reduction splits into linear and nonlinear methods. PCA finds linear combinations of input coordinates that maximise variance. Time-lagged PCA is the more interesting variant: the slowest degrees of freedom are those whose autocorrelation functions decay the slowest. On the nonlinear side, you have UMAP, autoencoders, and VAEs.

The progression from autoencoders to VAEs is clean. An autoencoder learns a compressed representation. A VAE adds structure to the latent space: Z equals mu plus sigma times epsilon, where epsilon is drawn from a distribution. That single modification transforms a compression tool into a generative model.

His later talk on simulations with AI connected this to diffusion models (noising forward process, denoising reverse process) and causal attention transformers, tying the representation learning to actual generation.


The Grammar of Nature

Speaker: Shruthi Viswanath

The closing talk tied the entire conference together in a way that felt almost inevitable. We have spent years treating protein sequences as standard biological data. Treating them as language changes everything.

Dr. Viswanath's group started the talk from statistical language models that explicitly compute the probability of the next word. The problems are immediate: combinatorial explosion and data sparsity. The Markov assumption simplifies things into N-gram models, but you lose context. That is the fundamental trade-off that transformers were built to solve.

The mapping from natural language to proteins produces three families of protein language models: encoder-only, decoder-only, and encoder-decoder. The training objectives mirror NLP directly. Autoregressive modelling (given a sequence, predict the next token) is used by ProtGPT2 and ProGen. Masked language modelling (randomly mask some amino acid residues and predict them) is used by ESM. Span corruption (mask a stretch of amino acids) is used by ProtT5.

ESM2 is the current powerhouse. Transformer architecture, trained on Uniref50 data, using Rotary Positional Encoding (RoPE) and pre-norm instead of post-norm. These are the same architectural decisions that matter in natural language, applied to biology.

The question of what pLM embeddings actually encode is genuinely fascinating. Embedding norms quantify the magnitude of vectors in N-dimensional space, and fine-tuning these models for specific tasks means we are learning to read the grammar of amino acids. Nature has a vocabulary. These models are learning to parse it.


Conclusion

If you look closely at the surviving notes, a pattern emerges. Whether it is making language models more efficient through tandem architectures, teaching agents to navigate the minefield of context management, mapping the hidden structure of graphs, designing entirely new proteins through hallucination, or parsing the vocabulary of amino acids, we are all just trying to find the underlying grammar of complex systems.

The rest is just atmosphere.