I used my one free online article to read Ted Chiang’s February 9, 2023 article in The New Yorker, “ChatGPT Is a Blurry Jpeg of the Web.” It’s well worth the price of an issue (if they include it in one), one of those mind-expanding explanations of technology that ignites the imagination and shifts how one looks at a problem.
Ted Chiang, my favorite science fiction author, is a master of the pithy analogy. He describes the current crop of large learning models—GPT-3 (ChatGPT), DALL*E, etc. —in terms of data compression and decompression. These may well be the same terms in which AI practitioners understand these problems, but Chiang articulates the connection so well, it’s hard not to see learning, both machine and human, in these terms. It feels revelatory.
If you’re already versed in compression algorithms, do read Ted Chiang’s article. It’s wonderful. But for my friends and family who aren’t in technology, with whom I’ve been discussing these text-and-art generating systems, I have a simplified walkthrough—my own stab at pithy analogies.
Compression and Decompression
Imagine you have a multiplication table, a larger version of what kids memorize. It is written out in a nice grid: the numbers 0 to 99 across the top row, and the numbers 0 to 99 down the first column. In each row/column intersection (“cell,” to use spreadsheet terms), is the multiplicative product of each row and column header. The raw representation of this table contains 10,200 chunks of information: a 100 x 100 matrix, 100 row headers, and 100 column headers.
But this table of numbers isn’t arbitrary. The row and column headers contain the same sequential numbers, and each cell is the product of its row and column headers. Given those three assumptions, you could represent this multiplication table as two chunks of information: 100 and “*” (the multiplication operator).
A decompressor program that knows what kind of “thing” it’s supposed to build (a square number grid where the cells are some math operation on the row and column header values) can perfectly reconstruct the original grid, only knowing the values 100 and *. This combination of data and assumptions gives us lossless compression. We can take our grid of 10,200 values, compress it to 2 values, and perfectly reassemble it later using the compressed data set plus our assumptions.
What does this have to do with learning? We can think of learning as a means by which we transform a whole lot of data (like pictures of every car, every article on Wikipedia, or everything in the perceivable world) into a mental model that we can use to retrieve interesting information about the thing being modeled. A model, in this case, is a compressed version of the external data, plus some assumptions on how to use it.
Now consider another example: the predicted words your phone gives you as you type on its keyboard. The simplest version of predictive text can be envisioned as another grid, with every word you’ve ever typed as both the row and column headers. The value in each cell is the percentage chance, out of all the text you’ve ever entered (or came pre-loaded), that the word in the column header immediately follows the word in the row header. So when you type “Thank” on your phone’s keyboard, the model knows that the most likely next word is “you.” It may list a few options, ranked by frequency. “Your” might be suggested if you frequently text your kid to “Thank your grandma.”
Reducing all the text messages sent and received, along with whatever text was pre-loaded, is an example of lossy compression. You cannot recreate all the original texts that went into the model, but if you start with a recognized word, you can probably create a message similar to ones that are frequently typed. They may not always make sense, but predictive keyboards are good at generating short, common messages.
Model Goals
It’s worth noting that the two models listed above are designed with different goals. The multiplication table’s goal is accuracy. It should be able to produce accurate results for any multiplication question within its boundaries. It’s also an entirely artificial example. Humans are faster at memory-retrieval than calculation, so we learn multiplication by memorizing multiplication tables. But computers can perform multiplication at lightning speeds. There’s no need to use a model. Humans might model multiplication this way, but computers don’t.
The goal of the model comes into play if you modify the multiplication table example such that, in a randomly-distributed 15% of the cells, the number is actually 1 + the product of the row and column headers. Now you have two choices for compression/decompression. If you want to optimize for accuracy, your model will contain additional data, namely, the “exceptions”: the cell coordinates where the value needs to be incremented by 1. But if the compressed model needs to remain small, and it’s “good enough” if 15% of the cells are +1, then the decompression algorithm just needs randomly to increment 15% of the cells it generates. The (+1) rule is built into the assumptions, and compression remains tight. Which method should be used? It’s a tradeoff. It depends on the goals of the model.
Similarly, the goal of a predictive keyboard is to increase the speed of typing common phrases. The goal isn’t linguistic coherence or accuracy of information represented by the generated text. In fact, there is no knowledge about linguistic grammar nor meaning in the model or the decompression algorithm that suggests the most frequent next word. Any coherence or meaning is purely coincidental. Coherence and meaning isn’t the goal. If it was, it would need a very different set of data and assumptions.
This is important because large language models like GPT-3/ChatGPT have a goal of linguistic coherence. The model and assumptions are sophisticated, and directed toward generating text that looks like text that can be found across the internet, even in different categories of style that the user can specify. Our minds seize upon the familiar patterns it generates, and ascribe meaning to the text we read (or the images we see). This is the exercise of all art: the audience looks at a work, and exercises empathy to understand the artist’s intent. When we do this successfully, the feeling can be profound-—like we understand, and feel understood.
The problem is, with AI-generated artifacts, there is no intentional meaning behind what is created. Our “understanding,” if we achieve any, is pure projection. We trick ourselves into seeing meaning. If the generated text is presented as something authoritative, that can be dangerous. Nothing in the model requires the text to be truthful or accurate.
As human beings, we are susceptible to thinking that an articulate speaker is also a knowledgeable and honest one. We’ve all encountered situations where that assumption has turned out incorrect. We must be extra careful with AIs that are designed with the goal of being articulate (forming coherent texts, generating realistic images) but without any concern of being truthful or accurate. Now, more than ever, we cannot allow style to vouch for content.
An emerging business is automatic content generation for websites and blogs, because search algorithms reward frequent, topical updates. In my opinion, this is short-sighted. Without tight controls, this could be the equivalent of positioning an articulate, highly productive idiot as the voice of your brand. Ultimately people, not search algorithms, are your customers, and if you give them bad information, they will stop coming to you no matter how frequently you publish. Reputations are easy to lose.
Learning: Assimilation and Accommodation
Under this conception, learning is the process of updating the model to better represent the world in the context of our goals. Cognitive scientists describe two learning processes that involve our models and assumptions: assimilation and accommodation.
Assimilation is the process of taking in new data (a new object, person, experience, whatever), and fitting them into the existing model and assumptions. This is an “easy” process. Stereotypes are well-established models and assumptions that tend to assimilate a lot of data (rightfully or not). Confirmation bias exists because mental assimilation is the path of least resistance. While we talk about stereotypes and confirmation bias as bad things, they exist because, for the goals of the learning system, they work. The consequences for getting things wrong are low, and if they aren’t (we have a “learning experience”), we move on to the next process: accommodation.
Accommodation occurs when assimilation fails. It’s a “harder” process because it requires the system to change the model and add new assumptions. Expanding the multiplication table to a “15% of the cells are incremented by 1” was an exercise in accommodation no matter which direction we optimized. Both the model and assumptions were open for revision. Adding exceptions to a rule is an accommodation. Add enough exceptions, and a good learning system tries to find a model to wrap them in an easier generalization.
These two processes call out another feature that a learning system must have if we’re to trust it: a mechanism to recognize failure, and accommodate appropriately. The word “appropriately” is doing some heavy lifting. Is there an acceptable error rate, under which the model doesn’t need revision? How severe does the error need to be to propel change? What are the consequences of errors? It’s not easy to answer these questions for third-party AI tools. The answers lie at the intersection of how they’re made and how they’re applied, and that’s doubly obscured to the eventual consumer. The current generation of AI contains a lot of obscurity. In the cases where it seems to work, this obscurity makes AI feel like magic. When it fails, it raises questions that even those who maintain the AI can’t answer. We should be cautious about chasing the shininess of the “magic” if we’re not prepared to deal with the inevitable failure.