Five Months in Munich: Revisiting 1991, Without Erasing the Decades That Made It Scale

Jun 26, 2026

A new joint essay from Jürgen Schmidhuber and Sakana AI’s David Ha argues that the modern AI stack was effectively sketched in a single Bavarian spring — but the honest reading also has to honour the labs that turned those sketches into systems that actually run.

On 18 June 2026, Jürgen Schmidhuber (KAUST, IDSIA) published a long retrospective on his IDSIA homepage titled “Munich 1991: the Roots of the Current AI Boom,” with a preface by David Ha (Sakana AI, formerly Google Brain). The piece is unusual in form — a personal historical timeline rather than a peer-reviewed paper — and unusual in ambition. Ha’s framing line is the one that will travel: “virtually every core building block of these modern systems was published in a span of just a few months back in 1991.” Schmidhuber, for his part, writes that he is “proud of the work my team did in 1991 in my home city when compute was millions of times more expensive than today.” The essay is licensed CC BY-NC-SA 4.0 and is anchored by more than seventy reference codes (FWP1, UN1, VAN1, LSTM1, HW1, HW2, TR1, GAN14, DS1, and others) that map 1991 technical reports to today’s headline architectures.

For readers who follow Schmidhuber’s running commentary on AI history, the structure is familiar: a precise calendar of contributions from his Technical University Munich group, followed by an argument that subsequent re-publications did not adequately cite the originals. What is genuinely new this time is the framing partner. Ha is not a historian relitigating credit; he is a working researcher who built “World Models” (2018) and now leads recursive self-improvement work at Sakana AI. His endorsement — “Jürgen’s contributions have deeply shaped my own thinking over the years” — gives the 1991 timeline a present-day research connection through Sakana AI and shows how much Schmidhuber’s work has impacted the field and will also do in the future.

A five-month calendar that reads like a deep-learning syllabus

The dates Schmidhuber pins down are specific. On 26 March 1991, Technical Report FKI-147-91 introduced Fast Weight Programmers — a slow network learning to compute the weight changes of a fast network — including an outer-product variant that later work identified as mathematically equivalent to what the 2020s literature came to call linearised self-attention or linear Transformers. This is not merely a retrospective analogy: Schlag, Irie, and Schmidhuber’s ICML 2021 paper, “Linear Transformers Are Secretly Fast Weight Programmers,” gives the formal connection between linearised self-attention and early-1990s fast weight controllers. On 30 April 1991, FKI-148-91 introduced two ideas in one document: unsupervised pre-training for deep RNNs, including a hierarchy in which each RNN tries to predict its next input and passes only unexpected inputs upward, and neural network distillation, described as compressing or distilling a teacher net, the chunker, into a student net, the automatizer, that does not forget its old skills. On 15 June 1991, Sepp Hochreiter — then Schmidhuber’s diploma student at TUM — submitted the thesis that analysed the vanishing-gradient problem and proposed residual connections with weight 1.0 to keep gradients alive. And on 31 August 1991, the first peer-reviewed paper on a GAN-style adversarial system appeared in an MIT Press / Bradford Books volume, with a precursor technical report, FKI-126-90, from February and November 1990.

Schmidhuber’s bookkeeping points map cleanly onto today’s reference architectures: FWP1 to Vaswani et al.’s 2017 Transformer (TR1); UN1 to “the P in ChatGPT” and to the distillation pipeline that DeepSeek-R1 (DS1, 2025) leaned on; VAN1 to LSTM1 (Hochreiter and Schmidhuber, Neural Computation, 1997) and onward to Highway Networks (HW1, May 2015, with Rupesh Kumar Srivastava and Klaus Greff) and ResNet (HW2, He et al., December 2015); and the August 1991 paper to GAN14 (Goodfellow et al., 2014). Schmidhuber also notes that as of January 2026, “the two most frequently cited papers of all time (with the most citations within 3 years — manuals excluded) are directly based on the work of 1991.”

The priority claim, stated plainly

Read as a priority argument, the essay is straightforward. The “first kind of Transformer,” in Schmidhuber’s telling, predates Vaswani et al. by 26 years. Unsupervised pre-training and distillation predate their canonical citations by decades. Residual learning predates ResNet by 24 years and was operational inside LSTM long before the feed-forward generalisation arrived. The adversarial generator-versus-predictor minimax game predates Goodfellow et al. by 23 years. By 1993, Schmidhuber writes, the lineage was already “solving problems of depth > 1000” — “1200 time steps or virtual layers.” He also reminds readers that Munich at the time was simultaneously the cradle of Ernst Dickmanns’s self-driving cars, which were already operating in highway traffic at 175–180 km/h, making the city “the epicenter of AI” in the early 1990s.

None of these specific claims are new in Schmidhuber’s catalogue; what is new is the consolidation, the calendar, and the framing through Ha.

What the priority claim leaves out — and what the field owes the engineers

Here the AI Morning Newsletter wants to be careful, because a one-sided reading of this essay does a disservice to the field. Having an idea in a 1991 technical report, even a beautifully prescient one, is not the same thing as making that idea work at the scale and reliability the world now depends on. At the same time, it would be unfair to read the early reports as if they had access to today’s hardware regime: in 1991, compute was vastly more expensive, experiments were slower, and many demonstrations that now look natural were practically out of reach. The honest story is therefore not “idea versus implementation,” but a longer scientific arc in which early architectural principles, later scaling work, and the collapse in computation cost all mattered.

Consider the Transformer. The Fast Weight Programmer line was not just loosely reminiscent of later linear attention: the outer-product fast-weight update and linearised self-attention are formally the same computational pattern under the correspondence made explicit in the ICML 2021 paper by Schlag, Irie, and Schmidhuber. It is therefore fair to say that the unnormalised linear-attention mechanism of linear Transformers was already present in the early-1990s Fast Weight Programmer work; the later literature changed the framing, terminology, and surrounding architecture more than the core computation. At the same time, the 2017 Transformer paper from Vaswani and colleagues did something different and also important: it identified a particular combination of scaled dot-product attention, multi-head decomposition, positional information, normalisation, training recipe, and hardware-friendly tensor shapes that turned attention into a general-purpose workhorse on real corpora. The terminology also has a layered history. The specific phrases “linear attention” and “linear Transformer” became popular only much later, but Schmidhuber’s 1993 recurrent fast-weight extension already used the language of learned internal spotlights of attention. Even the positional-information story is not cleanly confined to 2017, since Schmidhuber points to the 1991 chunker work as an earlier, differently situated precedent. The fairest reading is therefore neither “Transformers were simply invented in 1991” nor “1991 was merely a vague precursor,” but that the mathematical genealogy is deeper than the standard citation trail suggests.

The same applies to residual learning. Hochreiter’s 1991 thesis correctly diagnosed the vanishing-gradient problem and proposed identity-weight self-loops as a fix — a foundational insight that Schmidhuber rightly highlights. Highway Networks by Srivastava, Greff, and Schmidhuber appeared in May 2015, roughly half a year before the December 2015 ResNet preprint, and made trainable very deep feed-forward networks a concrete empirical object. ResNets then popularised a simplified special case of this idea in computer vision: Highway Networks use gates, but when these gates are initialised or driven toward 1.0, the result is precisely the plain residual connection with weight 1.0. The relationship should therefore not be flattened into a story in which Highway Networks were merely “concurrent” background and ResNets supplied the decisive idea. The more accurate reading is that residual learning passed through several stages: Hochreiter’s analysis, LSTM-style gradient-preserving paths, Highway Networks, and then the ResNet popularisation that became extraordinarily influential in computer vision.

The same is true for GANs and distillation. Goodfellow et al.’s 2014 paper did not invent the broad adversarial generator-versus-predictor idea, but it specified a formulation and demonstration regime that made the principle reproducible and widely adoptable by researchers with GPUs. Likewise, the distillation lineage that runs through to systems such as DeepSeek-R1 in 2025 should not be described as something no 1991 report could have anticipated in principle. A fairer formulation is that the 1991 chunker–automatizer work belongs in the ancestry of distillation, while later systems added many further ingredients: scaling, data pipelines, optimisation practice, reinforcement-learning variants, and engineering tricks from the intervening decades, including work Schmidhuber’s group continued to publish after 1991.

This is not a defence of under-citation. In science, plagiarism and systematic failure to credit prior work are not minor etiquette problems; they distort the record of discovery. Schmidhuber’s own formulation of the credit problem is useful here: the inventor of an important method should receive credit for inventing it, while the popularizer should receive credit for popularizing it, but not for inventing it. That distinction also protects the engineers and experimental scientists who made these ideas work at scale. The people who normalised, debugged, benchmarked, simplified, parallelised, and deployed these architectures — at Google, OpenAI, Anthropic, Meta, DeepSeek, and in academic groups around the world — are not mere technicians implementing old blueprints. But their contribution is best described as scaling, stabilising, popularising, and extending methods whose deeper ancestry should be cited honestly.

The forward-looking argument: LLMs are not enough

The essay’s most interesting passages are not the priority claims; they are the forward-looking ones. Schmidhuber writes: “In 1991, however, it was already totally obvious that LLM-like NNs alone are not enough to achieve Artificial General Intelligence (AGI). No AGI without mastery of the real world!” He points to the 1990 world-model work (FKI-126-90), to artificial curiosity (maximising prediction errors and learning progress), to meta-learning (his own 1987 diploma thesis on “learning how to learn”), and to recursive self-improvement as the path forward. Ha’s preface signals that Sakana AI is actively pushing on RSI, which gives the historical argument a present-tense edge. There is also a sharper geopolitical note. Schmidhuber observes that in 1995, the combined GDP of Germany and Japan was roughly 1:1 with that of the USA and China; “only 3 decades later, this ratio is now down to 1:5!” He suggests “self-replicating AI-driven all-purpose robots may be the answer” for German and Japanese economic recovery — a characteristically provocative line from a researcher who has long argued that physical-world AI, not chat, is the real prize.

How to read this essay well

For practitioners, the most useful posture is to take the 1991 calendar seriously as intellectual history and as a reading list, while also taking seriously the experimental work that made the calendar matter. The fact that Fast Weight Programmers, deep residual learning, unsupervised pre-training, distillation, and adversarial training were all on the table in Munich in a single five-month window is genuinely remarkable, and the citation record should reflect it. The fact that it took twenty-five years, several hardware generations, and the patient empirical work of thousands of researchers to make those ideas robust is equally remarkable, and the citation record should reflect that too.

Schmidhuber’s closing emphasis — that adaptive recurrent world models “suggest a simple explanation of consciousness & self-awareness,” and that the path to AGI runs through curiosity, meta-learning, and world models rather than larger language models alone — is the part of the essay most worth arguing with on technical merits. It is also the part that connects 1991 Munich most directly to 2026 frontier work.

Thanks for reading Andreas' AI Morning Read! This post is public so feel free to share it.

Readers who want to engage with the original timeline can find it at people.idsia.ch. The AI Morning Newsletter would encourage anyone who cites a Transformer, a ResNet, a GAN, or a distilled student model this week to glance at the 1991 references — and to thank the engineers who made them scale.

If you liked this blog post, I recommend having a look at our free deep learning resources or my YouTube Channel.

Text and images of this article are licensed under Creative Commons License 4.0 Attribution. Feel free to reuse and share any part of this work. AI was used to support the creation of this article.

Andreas' AI Morning Read

Discussion about this post

Ready for more?