Readings on Large Language Models
In a conversation on Bluesky, I commented that I have never found an explanation of the neural net part of the LLMs like ChatGPT that makes sense. “It works like your brain” is clearly not an explanation, since we know so little about how our brains work, but it’s a go-to for reporters who find all this heavy going.
I specifically asked for links, because a good explanation obviously is going to be longer than six 200-character posts, but one reply guy gave it his best, which was very bad. But I did get several good links back.
This post is far from an explanation of neural nets or ChatGPT. It’s some of the things I learned and links to resources that at least look useful. My understanding of these models is a work in progress.
Large language models, explained with a minimum of math and jargon. Also at Ars Technica. I found this article particularly useful for its definition of “word vector.” I’m not sure that definiton will be as useful to someone who doesn’t know linear algebra, but the concept of “word vector” gives me a tool to understand how LLMs predict the next word and how “training” can go badly wrong. The word vector is a description of where a word exists in a multidimensional space where the dimensions are various kinds of meanings.
Because it’s expressed in numbers of a type that computers commonly manipulate, it’s possible to do mathematical operations on words. Like doctor – man + woman = nurse. Oops! That has to do with the training, which helps to define the position of the word vector in that multidimensional space. The qualities chosen as the dimensions of that space will also influence how the LLMs string words together.
This article talks about layers of “transformers,” 96 of them in GPT-3. I think these layers are each a neural net. Obscurantism begins here, as the authors say that we don’t know what is happening in these layers, but research is working on finding out. This is where the religious aspect discussed by Henry Farrell is shoehorned in. We don’t know how the model keeps track of certain information but seems to be carrying it along in a way we can’t see and then
- In the attention step, words “look around” for other words that have relevant context and share information with one another.
- In the feed-forward step, each word “thinks about” information gathered in previous attention steps and tries to predict the next word.
And now, by my lights, we are much too far into anthromorphism to bother going much further.
Quite a bit further on, we get to, I think, a description of the layers as neural nets.
So in the largest version of GPT-3, there are 49,152 neurons in the hidden layer with 12,288 inputs (and hence 12,288 weight parameters) for each neuron. And there are 12,288 output neurons with 49,152 input values (and hence 49,152 weight parameters) for each neuron. This means that each feed-forward layer has 49,152 * 12,288 + 12,288 * 49,152 = 1.2 billion weight parameters. And there are 96 feed-forward layers, for a total of 1.2 billion * 96 = 116 billion parameters! This accounts for almost two-thirds of GPT-3’s overall total of 175 billion parameters.
There’s only a short bit about training the models, but it involves trial and error on the part of the neural net to place the word-vector in multidimensional space. This kind of thing, called “successive approximations,” has been done in many areas of mathematics, but, like the rest of an LLM, it involves many, many variables and calculations. It looks to me like the difference from standard methods of successive approximations is that one can’t necessarily expect the results to converge on a single answer.
Then there’s the usual paen to how wonderfully the LLMs work.
What Is ChatGPT Doing … and Why Does It Work? This is by Stephen Wolfram, which immediately raised red flags for me. What I’ve read from Wolfram before has been overcomplex. But this, while long, is accessible if you’re willing to put in the time. All these explanations are long, because the structure of LLMs is complex.
I won’t go into the same level of detail on this article that I did the first. I found the first easier to learn some concepts from, but Wolfram goes into some of the structure that I’ve been wondering about. At the same time, I found that practically every one of his paragraphs prompted more questions from me.
- He doesn’t use the phrase “word vector.” Why? Is this one of the difficulties in finding a good explanation – that different people use different terminology?
- In the sixth paragraph, he talks about “temperature” and an arbitrarily chosen parameter that represents it. Boltzmann? Why Boltzmann? asks the statistical thermodynamicist.
Later in the piece, there are comments about adjusting parameters. This is something that is done as models get more complex, but there need to be better mechanistic reasons than “it makes the results come out right,” which seems to be the criterion here. And how do we know what is right?
I am also told by several commenters that the “neural net” is modeled on an obsolete understanding of how vision works. It’s not how vision works, and it’s not how your brain works, but the words “neural net” give the impression that it’s something like that. So the developers are bullshitting themselves along with the rest of us.
The Wolfram piece is helpful, but I don’t want to go into all the questions it provokes in this post, which is getting long already. So I’ll stop here and hopefully get back to it.
Also recommended by commenters, but I haven’t checked them out, so I can’t recommend or not recommend them.
Photo by Joakim Honkasalo on Unsplash
Cross-posted to Nuclear Diner