I’ve been passionate about machine learning for as long as I can remember. I built my first robot more than 20 years ago out of Lego Mindstorm bricks. It’s been a wild ride since then, and the field keeps evolving. Running a business that is closely tied to the progress of state-of-the-art machine learning means I’m trying to stay up to date with what is going on. In this post, we will go through what I consider the most interesting breakthroughs recently. We will cover embeddings, attention, transformers, and multi-modal models.

At the end of the post, I will share some thoughts on what this means for society and business. If you don’t want all the technical details, you can skip to the end now. But I encourage you to actually try to understand the recent breakthroughs. Otherwise, you will have a hard time determining what these breakthroughs mean for you and your business.

DALLE2. Image generated by an algorithm based on caption provided by a human. Caption: Teddy bears working on new AI research on the moon in the 1980s

Learning requires efficient abstractions

Machine learning, in my opinion, is about finding efficient abstractions that enable the robust interpretation of diverse data. The “problem” with reality is that most things are rare. If you look too closely at each data point it appears unique. You have to “squint” to cope with all the information we are exposed to. Humans are great at this. We half-ass most data processing and make heavy use of prejudice in the name of efficiency. This ability has served us well in our effort to process diverse data in an energy-efficient way. We are optimized to minimize learning time and energy consumption while maximizing procreation. It is less good if you want a consistent and fair evaluation of data. Humans are bad at consistency and fairness.

Think of it this way: if we always considered each situation as a completely new situation based on minor changes we would get exhausted. Instead, we remember previous situations we’ve been in and abstract away the specific details. “Oh, I’m approaching an intersection now, I know there can be cars coming from different directions. Sure, it is a slightly different intersection than before but I still have some idea of what might happen.”

The challenge for machine learning researchers is to figure out how to extract the most efficient abstractions. If they are too crude, the quality of output goes down. If they are too exact, learning becomes very, very expensive. I equate these abstractions to “common sense”.


Embeddings were my first real machine learning “mind-fuck” experience. I still remember reading Mikolovs paper the first time and realizing what a huge thing the embedding concept was going to be. It’s so simple yet so elegant. I won’t spend much time on it here since it’s old news, but it still forms the foundation for so many things, so you better make sure you understand this concept. A good way to learn is to watch this presentation from Google. Let’s say we have 10.000 words and want to represent a word as a vector. One option would be to replace the word with a vector of dimension 10000 with all cells set to zero except the one representing our word. Problem is, you would get a lot of very large vectors with mostly zeros in them. Instead, we can create lower dimensional vectors representing each word. If you create these vectors based on the context in which words ocrrus in it turns they preserve semantic properties and open up for using linear algebra on them in cool ways (like in the picture below). Learn about embeddings.

Attention and Transformers

In recent years, the most influential paper in my opinion is “Attention Is All You Need” from the Brain at Google. This paper demonstrated that attention mechanisms can replace both recurrent and convolutional neural networks, while still both improving results and lowering the cost of training. After years of just pushing the limits of RNNs and CNNs, a new paradigm emerged. RNNs had the downside that they did not allow for parallelized training as each new prediction was predicated on the previous prediction. Some tried to overcome this limitation by turning sequences into “images” and then applying CNNs instead. The problem with CNNs is that they have a hard time learning distance relationships as “signal strength” diminishes with distance. Enter the Transformer.

The Transformer follows the overall architecture of encoder/decoders. Transformers process sentences in the form of a sequence of embeddings that it learns during training. Each token in the sentence is represented by a vector (in the original paper they use 512-dim vectors) with positional encoding added, i.e. the vector also indicates the position of the word in the sentence. The Encoder learns a representation of the input data (encoding) based on which the Decoder produces the target output. This pattern provides a dimensionality reduction step, which goes back to the idea of finding abstractions. Transformers only pass on the information necessary for decoding from the input to the encoding.

Both the Encoder and Decoder make use of attention. So what is attention? The idea behind attention is to replace embeddings with better embeddings that also contain information about the context in which the word appears. Understanding how it works requires linear algebra. The easiest to understand is scaled dot-product attention. Let’s assume we have an input sentence X = [x_1…x_n] where each word x is an embedding. Our goal is to replace the word x_n with a new vector y_n. We do this by summing over all the words in the sentence, with each word weighted by some number. The self-attention layer introduces two matrices, W_Q and W_K by which we transform the words before we combine them. The interpretation of these matrices is not obvious. We want to put numbers into these matrices so that the model considers the right context words when interpreting a specific word.

Prof. Lennart Svensson has a great lecture explaining this. We know that words that are related cluster with respect cosine distance of their word embeddings. One way to interpret the W_K matrix is that we need to select it so that it forms an identity matrix in relation to which the key vectors are identical to the original word embeddings. The keys would preserve the original similarity between words at face value. If we then select W_Q so that our new embedding points in the same direction as the specific words we want the model to care about, their inner product will then be large.

I think of attention this way: attention means learning parameters that are used to create new embeddings for input words that contain information about the context. All of the parameters that fill these matrices are learned simultaneously. That means that embeddings and attention, along with all other parameters, are selected so that they together maximize the quality of the network. This gives the model enormous expressive capacity, at the price of a huge number of parameters and a huge cost of computation.

Transformers use Multi-Head Attention which means that it uses multiple queries per word rather than just one. The reasoning is that words could have different meanings depending on context. The model uses these multi-headed attention mechanisms in multiple ways. One of these is referred to as “self-attention”. The idea behind self-attention is that the word itself impacts its own meaning and that the most appropriate word embedding depends on the context. I won’t go through all of the details. The original Transformers paper has a nice illustration of the output weights of these attention heads for various sentences. As you can see from the first attention head, the weights resemble dependency parsing, which I guess is exactly the point.

Anyway, this is just the first step of the Encoder block. We feed the input sentence as a set of vectors with positional encoding, and we get a set of weights back that describe how each token should be regarded based on the entire sentence.

We add residual connections that carry over previous embeddings to subsequent layers, i.e. we mix together the original embeddings with the information learned from the multi-head attention mechanism. Finally, we add some layer normalization, and voila we have our basic Encoder block. The transformer then uses six of these stacked, with the output from the last block serving as input to the Decoder.

The Decoder is similar to the Encoder. One major difference is that the attention mechanism is “masked”, which means it gets a gradually increased visibility of the input sentence. This makes sense since we cannot time travel. Another difference is that it treats its most recent output as the last token of its input.

This kind of processing is called auto-regressive. A nice consequence of this is that the output can have a different length than the input. The Decoder ultimately outputs a vector the size of our known vocabulary, and the softmax layer converts that into a vector of probabilities. It might seem simple to then just pick the most probable word, but it turns out using something called beam search produces even better results. Anyway, the predicted word is fed back to the decoder as the last part of the next decoder input. This process continues until we predict the end-of-sentence token.

In the end, we have a construction that assumes no recurrence or convolutions when processing the input data. As long as we can express our input as sequence data, we can apply this approach even in computer vision or reinforcement learning. The core idea is to enrich the embeddings with features from the global context.

If you want an even better explanation of all of this, I highly recommend Prof. Lennart Svensson’s lecture.

Few-Shot Learners such as GPT-3

In May of 2020 researchers from OpenAI described the development of GPT-3 in the paper “Language Models are Few-Shot Learners”. The big breakthrough in GPT-3 is that it removes the need for huge task-specific labeled datasets in order to learn specific tasks, assuming analogous tasks are already represented in available language datasets (e.g. all of the internet). To quote from the paper:

…humans do not require large supervised datasets to learn most language tasks – a brief directive in natural language (e.g. “please tell me if this sentence describes something happy or something sad”) or at most a tiny number of demonstrations (e.g. “here are two examples of people acting brave; please give a third example of bravery”) is often sufficient to enable a human to perform a new task to at least a reasonable degree of competence. Aside from pointing to a conceptual limitation in our current NLP techniques, this adaptability has practical advantages – it allows humans to seamlessly mix together or switch between many tasks and skills, for example performing addition during a lengthy dialogue. To be broadly useful, we would someday like our NLP systems to have this same fluidity and generality.

In order to achieve this, they build on the capacity of transformers. There is a race going on where the number of parameters in language models is growing almost as fast as transistors. Each increase has brought improvements to all sorts of NLP tasks, and there is evidence that they get better and better the larger they get.

In the paper the authors describe four different settings in which they want to evaluate their model. This is relevant to understand the title of the paper, and more importantly, when it works and not.

The title of the paper refers to “Few-Shot Learners” as it is described in the image above. The model architecture they use is basically the same as in several earlier papers such as “Language Models are Unsupervised Multitask Learner” and “Improving Language Understanding by Generative Pre-Training” ( the latter is the paper that coins the term GPT, which is short for Generative Pre-Training). The following figure describes the basic idea:

GPT uses a generative pre-training stage in which a language model learns language and then adds a supervised fine-tuning stage in which the pre-trained model is adapted to a target task. The amount of examples used defines the degree of fine-tuning (zero, one, few, etc). The Transformer architecture serves as the backbone of this pre-training stage. Here is how the authors describe the setup:

So, they are learning to predict words using Transformers. Notice the use of “unsupervised corpus”. Personally, I think this is misleading. The corpus is unsupervised in the sense that no one explicitly labeled the meaning of words in the corpus. But it is supervised in the sense that words have been put into sentences by humans, i.e. someone always needs to provide structure somehow.

Anyway, since the model is trained on continuous text, and not for example questions and answers, they apply some tricks to get data into shape.

This setup is improved on a bit for later versions of GPT, but GPT-3 follows the same pattern. They then feed these learning beasts huge amounts of text:

That’s a lot of human knowledge right there. This is where the supervised learning part comes in. In order for GPT-3 to work, we need to have access to enormous amounts of documented human knowledge. Whatever the model knows is a result of what is in these 300 billion tokens of knowledge. Fun fact: a challenge for the authors of the paper was to filter out all of the test data from this corpus since most data is somewhere on the internet. The crazy thing that happens when you put this gigantic model to work is that it can generate long sequences of text that read as if written by a human. And all it needs is a few words to get started. For example, news article generation:

To evaluate they asked humans to determine if a news article was written by a human or by GPT-3. Results show that humans have a pretty hard time distinguishing real from robot.

So how is this possible? By now you should have some idea. By training a huge language model on a huge corpus of text we get a model that learns how to interpret words based on their context. Using that model, and a few tokens to give the model a starting point, it can then predict suitable next words. The attention mechanism gives the model a strong “memory” that lets it keep track of context.

Multi-Modal Models

A modality is the type of channel used to communicate such as through images or sound. Humans rely on multi-modal input in order to navigate the world. We hear, see and feel. This is an obvious source of inspiration for researchers. Before we jump into state-of-the-art multi-modal models we will review how the pre-training concepts from GPT have been applied to visual data.

The first paper I recommend is “Learning Transferable Visual Models From Natural Language Supervision”. This paper builds on the breakthroughs with autoregressive and masked language models and describes what the authors call CLIP (Contrastive Language-Image Pre-training).

The idea behind the paper is to learn directly from raw text about images, rather than learn predefined object classes. Typically when training object detection models we define a set of classes and then proceed to label each object with one of these classes. Besides being very tedious, this can also be limiting as not all objects fit into obvious classes. So instead, the paper describes how to learn the connection between an image and its caption. CLIP is an efficient method of learning this that is similar to GPT that is able to predict the text snippet based on an image.

More recently, models like DALLE2 have become the darling of the internet for their ability to generate realistic images and art from a natural language description. The mechanics of DALLE2 are described in the paper “Hierarchical Text-Conditional Image Generation with CLIP Latents”. The technique relies on a CLIP latent space but learns an inverted version of CLIP, i.e. a decoder (or “unClip”). The resulting decoder is a non-deterministic function that can generate images given a caption. The encoding/decoding process generates, similar to GANs, semantically similar images. An even cooler aspect is that you can semantically modify images by moving in the direction of any encoded text vector. This image gives a high-level overview of the unClip process.

The results are very, very cool. For example, you can naturally blend styles by interpolating the CLIP image embedding of different images and decoding the vectors.

The reason this sort of continuous interpolation is possible is that the images and text are both embedded in the same latent space. That allows us to apply language-guided image manipulation. You can also write a text into a prompt and get an image back. The decoder is based on diffusion models to produce images conditioned on CLIP image embeddings and optionally text captions. In order to get high-resolution images, they train diffusion upsampler models. A few more tricks are applied to make results even better, but essentially it’s CLIP encoding and then diffusion-based unClip. You can find the details in the paper.


So what does all this cool progress mean for humans? Or for businesses that support teams training machine learning models? I think a few things are clear: