Before we jump into this exciting singularity thing, I’d like to mention that, as an essay, this is a more personal and less formal writing, sharing my perspective on the Natural Language Understanding evolution and highlighting some ideas that look important in that context.
This is not a comprehensive industry report nor it was meant to be one, but I hope it would be an interesting reading both for Machine Learning Engineers and for a broader audience interested in the current AI uprise.
There are three parts to the story:
- The history part briefly reminds us how we got to our current AGI state from a multilayer perceptron in just twelve years.
- The present day section focuses on the latest achievements of LLMs and current industry trends. If you are deep in context and looking for some fresh ideas, skip to that part.
- The mystery part presents some ideas on what could follow the current AGI stage.
The history
So, first of all, Machine Learning has been around for a while, about a decade or duodecennial, depending on whether you count from Tomas Mikolov’s word2vec publication or from Andrew Ng’s Machine Learning course on Coursera. Kaggle was launched in 2010, and Fei-Fei Li gathered Imagenet in 2009. Not that long ago, you’d probably agree if you’re over 30.
Some people would argue that machine learning has been around much longer, but I am now speaking about industry adoption of deep learning algorithms aka the technology momentum, not about pure research. And here we are not touching the stuff like classic ML algorithms covered in scikitlearn, all the regression, clustering, and time series forecasting kinds of things. They are silently doing their important job but people do not call them AI, no hype around, you know.
Why did that AI spring happen 12 years ago? Deep learning (training a multiple-layer neural network with errors back propagation) finally became feasible on an average GPU. In 2010 the simplest neural network architecture, a multi-layer perceptron, had beaten other algorithms in handwritten digit recognition (famous MNIST dataset), a result achieved by Juergen Schmidhuber et al.
Since that point around 2010, the technology became more and more robust. There have been a few game-changing moments —said word2vec model release which brought semantic understanding to the world of Natural Language Processing (NLP), the public release of Tensorflow and Keras deep learning frameworks a little later, and of course, the invention of Transformer in 2017, which still is a SOTA neural network architecture, having expanded beyond the world of NLP. Why is that? Because Transformer has attention and is capable of handling sequences such as texts with O(n2) complexity which is enabled by the matrix multiplication approach allowing us to look at the whole input sequence. The second reason for Transformer’s success in my opinion is the flexible Encoder-Decoder architecture allowing us to train and use models jointly and separately (sequence-to-sequence or sequence-to-vector).
The OpenAI GPT family models (the Transformer Decoder) have made some noise going beyond the tech industry since GPT-3 already could produce fairly humanlike texts and was capable of the few-shot and some zero-shot learning. The last part is more important, the GPT-3 paper is even named “Language Models are Few-Shot Learners” — this ability of Large Language Models to quickly learn from examples was first stated by OpenAI in 2020.
But bang!
ChatGPT’s release has come with hype we’ve never seen before, finally drawing huge public attention. And now, the GPT-4 is going beyond that.
Why is that? For the last 7 years, since neural networks started showing decent results, what we’ve been calling AI was actually a narrow artificial intelligence — our models were trained to solve some specific set of tasks — recognize objects, perform classification or predict the following tokens in the sequence. And people have only been dreaming of the AGI — an artificial general intelligence, capable of completing multiple tasks on a human level.
Present day
LLMs reasoning abilities are game changers
In fact, what happened with the instruction-based LLMs tuning, or, as they call it in OpenAI, reinforcement learning from human feedback—
GPT-3.5+ models finally learned the ability to reason over the provided information. And that changes things — before LLMs were closer to a reasonably good statistical parrot, but still very useful for a lot of applications such as text embeddings, vector search, chatbots, etc. But with instruction-based training, they effectively learn reasoning from humans.
What exactly is reasoning?
The ability to use the provided information in order to derive conclusions through some logical operations. Say A is connected to B and B is connected to C, so is A connected to C? GPT-4 features a much more complex reasoning example on their official product page. The model’s ability to reason is so strong and flexible that it can produce a structured sequence of instructions or logical operations to follow in order to achieve a given goal using “common knowledge” or “common sense” along the way, not just the information provided in the prompt.
Before LLMs with such reasoning abilities, the other tool well designed for reasoning was a knowledge graph, with nodes containing entities and edges as predicates or relations of entities. This is a form of information storage that provides explicit reasoning abilities. At some point, I was involved in building a question-answering system which among other things used a knowledge graph to find the information asked — you just had to detect the intent, see if we have this kind of relations in the graph, check for the particular entities mentioned, and, if they existed, query this subgraph. In fact, this pipeline provided a translation of the query in natural language into a SPARQL query.
Now you can provide this factual information to the model in plain text as the context part of your prompt and it will “learn” it in zero-shot and will be able to reason on that. Wow, right?
And you are not limited to the number of entities and relation types contained in the graph. Plus, you have that “common sense”, the general understanding of the concepts of our world and their relations, which was the trickiest part of separating machine learning models from human cognition. We did not even notice how we became able to give instructions in natural language and they started working correctly without too explicit explanations.
Reasoning plus knowledge are the two crucial components of intelligence. For the last 20 years, we’ve put roughly all human knowledge to the Internet in the form of Wikipedia, scientific publications, service descriptions, blogs, billions of lines of code and Stackoverflow answers, and billions of opinions in social media.
Now we can reason with that knowledge.
GPT-4 is the AGI
These reasoning abilities are well demonstrated in the official OpenAI tech report on GPT4:
GPT-4 exhibits human-level performance on the majority of these professional and academic exams. Notably, it passes a simulated version of the Uniform Bar Examination with a score in the top 10% of test takers.
According to the GPT-4 results on a number of human tests, we are somewhere around AGI — OpenAI even uses these words on their webpage, and a recent Microsoft 150+ pages paper with an in-depth study of GPT-4 capabilities on different domains named “Sparks of Artificial General Intelligence: Early experiments with GPT-4” carefully but explicitly claims that AGI is here:
Given the breadth and depth of GPT-4’s capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system.
and later:
The combination of the generality of GPT-4’s capabilities, with numerous abilities spanning a broad swath of domains, and its performance on a wide spectrum of tasks at or beyond human-level, makes us comfortable with saying that GPT-4 is a significant step towards AGI.
The reason for that claim is:
Despite being purely a language model, this early version of GPT-4 demonstrates remarkable capabilities on a variety of domains and tasks, including abstraction, com- prehension, vision, coding, mathematics, medicine, law, understanding of human motives and emotions, and more.
And to nail it:
Even as a first step, however, GPT-4 challenges a considerable number of widely held assumptions about machine intelligence, and exhibits emergent behaviors and capabilities whose sources and mechanisms are, at this moment, hard to discern precisely <…>. Our primary goal in composing this paper is to share our exploration of GPT-4’s capabilities and limitations in support of our assessment that a technological leap has been achieved. We believe that GPT-4’s intelligence signals a true paradigm shift in the field of computer science and beyond.
I highly recommend that you spend some time with this study as behind these loud claims there is a very interesting analysis of how said models work and an extensive comparison of GPT-4 to ChatGPT results on a variety of non-trivial tasks from different domains.
LLMs plus search
If we need to apply LLM’s reasoning abilities to make conclusions over some specific information not expected to be learned by the model while training, we can use any kind of search—retrieval plus ranking mechanism, no matter if you store your data as vector embeddings in some ANN index like Faiss or in an old school full-text index like Elastic — and then feed these search results to an LLM as a context, injecting it in a prompt. That’s kind of what Bing 2.0 and Bard(now powered by PaLM2) searches do now.
I have implemented this search + LLM call system with a DPR architecture, where ChatGPT replaced the Reader model, and with the full-text Elastic search, in both cases, the overall quality of the system depends on the quality of the data you have in your index — if it is specific and complete, you can count on better answers than the vanilla ChatGPT provides.
Some even managed to make a Swiss knife library around GPT, call it a vector database, and raise a good round on that— my hat goes off!
But due to the textual interface of GPT models, you can build anything around it with any tools you are familiar with, no adapters are needed.
Model analysis
One of the questions that could give a clue to further model advancements is how these large models actually learn and where those impressive reasoning abilities are stored in model weights.
This week OpenAI has released a paper“Language models can explain neurons in language models” and an open-source project aiming to answer these questions by peeling away the layers of LLMs. The way it works — they observe the activity of some part of the model’s neural network frequently activated on some domain of knowledge, then a more powerful GPT-4 model writes its explanations on what this particular part or a neuron of the LLM being studied is responsible for and then they try to predict the original LLM’s output on a number of relevant text sequences with GPT-4, which results in a score being assigned to each of its explanations.
However, this technique has some drawbacks. First, as the authors state, their method gave good explanations only to 1000 neurons out of around 300000 neurons having been studied.
Here is the paper citation:
However, we found that both GPT-4-based and human contractor explanations still score poorly in absolute terms. When looking at neurons, we also found the typical neuron appeared quite polysemantic. This suggests we should change what we’re explaining.
The second point is that this technique currently does not provide insights on how the training process could be improved. But it is a good effort in terms of model interpretability study.
Maybe if the neurons studied would be united into some clusters based on their interdependencies and these clusters would demonstrate some behavioral patterns that could be changed due to different training procedures, that would give us some understanding of how certain model capabilities are correlated to training data and training policy. In some way, this clustering and differentiation could look like the brain’s segmentation into different areas responsible for particular skills. That could provide us with insights on how to efficiently fine-tune an LLM in order for it to gain some particular new skill.
Agents
Another trending idea is making an autonomous agent with a looped LLM — Twitter is full of experiments like AutoGPT, AgentGPT, BabyAGI, et al. The idea is to set a goal for such an agent and to provide it with some external tools such as other services’ APIs so it can deliver the desired result via a loop of iterations or chaining models.
Last week Huggingface released Agents in their famous Transformers library to:
“easily build GenerativeAI applications and autonomous agents using LLMs like OpenAssistant, StarCoder, OpenAI, and more”. (c) Phillip Schmid
The library provides an interface to chain models and APIs capable of responding complex queries in natural language and supporting multimodal data (text, images, video, audio). The prompt in this case includes the agent’s description, a set of tools (mostly some other narrow case neural networks), some examples, and a task. Agents would facilitate model usage for non-engineers but are also a good start to building a more complex system on top of LLMs. And, by the way, here is the Natural Language API, a different kind of Internet to what you know.
BTW, Twitter is going really crazy these days around AI, everybody is building something on top of LLM models and showing it to the world — I have never seen so much enthusiasm in the industry. If you want to investigate what’s up — I’d recommend starting that rabbit hole dive with Andrew Karpathy’s recent tweet.
Coding co-pilots
Codex, powering Github copilot has been around for a while, and a few days ago as a Colab Pro subscriber I received a letter from Google, saying that in June they would (citing the letter)
start gradually adding AI programming features to Colab Among the first to appear:
- single and multi-line hints for code completion;
- natural language code generation, which allows you to send code generation requests to Google models and paste it into a notebook.
By the way, last week Google announcedPaLM 2 family of models, among which there is Codey, Google’s specialized model for coding and debugging, that probably would be powering these announced features.
To conclude this section, I’d like to say that my personal choice of NLP over CV around 2016 was made due to the fact that language is the universal and ultimate way people transfer information. We even think with the concepts from our language, so the system is complex enough to define ourselves and the world around us. And that brings the possibility of creating a language-driven system with reasoning abilities and consciousness that are humanlike or even surpassing that level. We’ve just scratched the surface of that true reasoning around half a year ago. Imagine where we are and what will follow.
The mystery
If by any reason you are unfamiliar with Tim Urban, the author of the waitbutwhy blog, read his post on AGI, dated 2015 — check out how this looked from the past, just 7 years ago, when there were NO LLMs around and no Transformer models either. I shall quote a few lines of his post here, just to remind you where we were 7 years ago.
Make AI that can beat any human in chess? Done. Make one that can read a paragraph from a six-year-old’s picture book and not just recognize the words but understand the meaning of them? Google is currently spending billions of dollars trying to do it.
But after we achieve AGI, things would start moving at a much faster pace, he promises. This is due to the law of accelerated returns formulated by Ray Kurzweil:
Ray Kurzweil calls human history’s Law of Accelerating Returns. This happens because more advanced societies have the ability to progress at a faster rate than less advanced societies — because they’re more advanced.
Applying this law to current LLMs it is easy to go further and say that the ability to learn and reason over all the data saved in the Internet would bring this superhuman memory to human-level reasoning and soon the smartest people around would be outsmarted by the machine the same way as chess champion Kasparov was beaten by Deep Blue computer in 1997.
This would bring us to Artificial Super Intelligence (ASI) but we do not know how it looks yet. Maybe we’d need another feedback loop for training it as the GPT-4 human feedback learning provides just human-level reasoning. It’s highly possible that the better models would teach the weaker ones and this would be an iterative process.**Just speculating — we’ll see.
The thing Tim really outlines in the second part of his post on AGI is that due to this law of accelerated returns, we might not even notice the point when our systems surpass AGI and that things would be a little out of our understanding then.
For now, just a small percentage of people who work in tech understand the real pace of the progress and the astonishing potential instruction-based LLMs tuning brings. Geoffrey Hinton is one of them, publicly speaking of such risks as job market pressure, fake content production, and malicious usage. What I find even more important is that he points out that current systems capable of zero-shot learning of complex skills might have a better learning algorithm than humans do.
The concern with modern LLMs comes from the fact that while they provide a huge leverage in a lot of tasks, the abilities to work with these models —pre-train, fine-tune, do meaningful prompting, or incorporate them in digital products — is obviously unequal around the society, both in terms of training/usage costs and skills. Some people from twitter or huggingface community would argue that we have quite capable open source LLMs now as an alternative to OpenAI hegemony, but still, they are following the trend and are less powerful, plus they require certain skills to handle. And while OpenAI models are such a success, Microsoft and Google would invest even more into that research, to try and stop them. Oh, Meta too, if they finally let the Metaverse go.
One of the most demanded skills nowadays is writing code – software engineering dominated the tech scene and salaries for the last 20 years. With the current state of the coding co-pilots it looks like a good chunk of the boilerplate code soon would be either generated or efficiently fetched and adapted, which would look the same way for a user, saving developers lots of time and maybe taking some job opportunities out of the market.
There is another idea in that very good post on AGI and beyond it sounding like AGI would be capable of autonomous self-improvement. For now vanilla LLMs are still not autonomous agents and by no means incorporate any willpower — the two ideas that scare people. Just in case. Do not confuse the model’s training process involving reinforcement learning with human feedback, where the RL algorithm used is OpenAI’s Proximal Policy Optimization, with the final model being just a Decoder part from the Transformer predicting token sequences.
Probably you’ve noticed that a few papers I’ve cited were released last week — I am sure the following weeks would bring new releases and ideas that I wish I had covered in this post, but that’s the sign of the time.
Seems like we are rapidly entering the new era of software and have made a few steps towards the singularity point, as the innovations in the machine learning industry are already happening at an unprecedented pace — like several a month while last year we’ve seen just a few big releases. Enjoy the ride!
This article was originally published by Ivan Ilin on Hackernoon.