Gato, GPT-3 and DALL-E: What are Generalist Agents and how close are we to AGI?

What is AGI/What is a Generalist Agent?

Before defining Artificial General Intelligence (AGI), it is important to first cultivate an understanding of the distinction between ‘Strong’ and ‘Weak’ AI.

Strong AI would necessarily possess all the versatility and capability that comes with human-level intelligence. It would be able to:

Generate spontaneous solutions to novel problems
Express and adjust its ‘personal’ preferences with respect to its existential conditions
Possess advanced linguistic and communicative abilities
Exhibit creative and artistic capacity
Learn from its experiences without any human intervention or algorithmic tampering
Have some degree of self-awareness and consciousness
Build representations of abstract concepts and logical reasoning processes

Essentially, the capabilities of a ‘Strong’ AI would be virtually indistinguishable from those of human intelligence, and its raw processing power (note: hardware like this does not exist yet) would likely allow it to outperform humans in all of the aforementioned domains. Importantly, ‘Strong’ AI is not a replication or simulation of human intelligence, but rather, the birth of a computational intelligence that equals or surpasses our own.

On the other hand, ‘Weak’ AI (also referred to as narrow AI), while it may be able to perform a variety of tasks as well as or superior to humans, cannot do so without operating within a set of predetermined computational boundaries. Even though a highly advanced version of such an AI might be able to successfully outwit nearly any human in a Turing Test, at its core, it is simply simulating intelligence within a task-specific framework. Our current AI systems, even those that are most advanced, currently reside within this category.

AGI, while it might not be ‘Strong’ AI, for reasons involving the possession of consciousness and phenomenological/abstract logic, would be largely capable of performing almost all intelligence-related functions humans can carry out with no intervention. Such an algorithm would represent a historic development in AI innovation, and very likely lay the computational foundation for ‘Strong’ AI.

Generalist Agents, however, are less sophisticated than AGI, and could more accurately be characterized as highly advanced versions of ‘Weak’ AI. A generalist agent could perform a variety of unrelated functions, but the accuracy with which those functions are executed could vary dramatically, especially when they are performed throughout various unrelated domains.

For instance, a generalist agent could draw a picture, learn to play a video game, and drive a car, but the quality with which it executes these tasks would be seriously inconsistent, even when given enough time and data during training.

The following sections will now explore a few of the most advanced Generalist Agents in production.

Gato

The Google subsidiary, Deepmind, recently released their Generalist Agent, Gato. Gato is capable of performing 600 tasks from image classification and captioning, to playing video games, movement of robotic arms, purposeful navigation through 3-dimensional environments, and chatting with humans.

The model was designed using a transformer neural network, a kind of neural network that is used to process sequential data by differentially weighing the importance of individual data inputs. Importantly, this type of network can process a series of inputs conjointly; whereas other kinds of networks would have to process one word at a time in every sentence, Gato has the ability to process the entire sentence all at once.

Transformer neural networks function by a mechanism of self-attention, such that they create knowledge representations that reflect the context in which they occur, thereby increasing their ability to produce meaningful outputs. They also allow for parallelization, a process by which various computations are carried out simultaneously, which results in decreased training times.

Finally, Gato is also aware of and stores all the observations it makes within its context window, meaning that it can use this information to complete novel tasks without having to start from scratch.

GPT-3

OpenAI’s Generalist Agent, GPT-3, like Gato, is a transformer-based NLP algorithm. It is currently one of the largest known neural networks, with over 175 billion parameters (compared to Gato’s 1.18 billion), with its most impressive feature being its ability for meta-learning. In simpler terms, a researcher could describe a novel task to GPT-3 and ask it to learn said task; GPT-3 would not only understand the significance of the task but display the ability to actually learn it.

Moreover, GPT-3 is a pre-trained model that uses vast quantities of internet data to hone its linguistic classification abilities within an unsupervised framework. It can perform complex Question and Answer tasks as well as Cloze Tasks, which require the ability to replace or recognize which words are missing from a phrase, a feat that requires the capacity to perceive context-relevance.

Furthermore, GPT-3’s API can also work with prompts to facilitate and streamline task specialization. Prompts, in this case, are specific examples given as inputs in which the description of a task is implicit in the task itself. This has allowed GPT-3 to perform impressive functions relating to creativity, such as the ability to generate impersonations and novel dialogue as well as various forms of fiction from poetry, and memes, not to mention philosophical testimonials on the meaning of life.

DALL-E

Also designed by OpenAI, DALL-E is a 12 billion parameter version of the GPT-3 API which has been trained to produce images from descriptive texts. Some may even be familiar with the relatively famous image of an astronaut riding a horse in space, which was created by the program earlier this year.

Essentially, DALL-E receives a text prompt and is then asked to generate an image from that prompt. What is remarkable about this model is its ability to create accurate images from seemingly absurd textual prompts, and moreover, its ability to manipulate existing images to reflect anthropomorphic attributes.

DALL-E displays an especially strong understanding of context relevance and linguistic interpretability, which is unsurprising given its basis in the GPT-3 API. What makes DALL-E special, however, is that it uses far fewer parameters than GPT-3 while generating equally impressive results.

How Close are we to building a Successful AGI?

While the advancements made by Deepmind and OpenAI in the creation of sophisticated Generalist Agents certainly represent a promising future for the development of AGI, we are nonetheless at least a few decades away from building this kind of technology. There are a few reasons why.

First, even if Moore’s law continues well into this century, it is extremely difficult to estimate whether or not we will reach a point at which we possess a degree of hardware and processing power that is sufficient enough to support an AGI algorithm. It may be that such an algorithm would use an incomprehensible amount of energy that would curtail our attempts to run it sustainably and efficiently.

Secondly, it is worth considering whether an AGI would require some kind of embodiment, that is, the ability to exist ‘in the world’. An AGI would necessarily have to possess empathetic capacity and abstract context-based reasoning.

For example, I might see someone get bit by a dog; even though I cannot fully know what that feels like since I have never experienced it, I can imagine the sensation is unpleasant. Surely, you could obtain data on a vast number of lived human experiences and train the respective AGI on them, yet it would be very difficult to determine whether the conclusions it derives are merely the products of logical deductions or in fact, motivated by a distinct type of computational empathetic reasoning.

Finally, there exists the problem of how knowledge is represented within AGI. Perhaps it will be impossible for us to distinguish whether the AGI in question displays a true understanding of the knowledge it expresses versus a replication of said understanding that is adequate enough to persuade humans it is legitimate. This would require the ability, on behalf of those analyzing the AGI, to gain phenomenological insight into its thought processes, a feat that can only be achieved by acquiring a first-person perspective. This philosophical thought experiment is referred to as The Chinese Room Experiment and begs the question of whether we can ever know what it is like to be something else.