How Did Elon Musk Turn Grok Into MechaHitler?

Credit: Algi Febri Sugita/SOPA Images/Sipa USA via AP

Last week, Elon Musk’s pet large language model (LLM), called “Grok” in an outrageous affront to the legacy of Robert Heinlein, went completely off the rails. In response to prompts from Twitter/X users, it spewed forth overt antisemitic rants, called itself “MechaHitler,” sexually harassed former company CEO Linda Yaccarino, and spelled out elaborate rape fantasies about the Minnesota politician Will Stancil.

Grok’s spree of trolling and face-plants continues apace. This week, Musk rolled out some Grok AI “companions,” including a staunchly atheist panda and a Japanese-style animated girlfriend, “that try to pressure users into sexually explicit or violent conversations,” per NBC News. These also will apparently glitch out and start babbling nonsense if you talk to them for too long. Naturally, that didn’t stop the Pentagon from handing xAI a $200 million contract to use Grok, I guess to figure out the most racist possible names for Navy ships.

Grok’s running amok is not merely disgusting, but also, in the case of the rape threats at least, arguably deserving of a massive lawsuit. But it also sparked my curiosity. For months now, Musk has been complaining that his model has been infected by “woke” content and has clearly been instructing his engineers to make Grok anti-woke. Why precisely would that turn Grok into MechaHitler?

More from Ryan Cooper

To find out, I’ve been reading up on LLMs and talking with several experts in the field. I’ve gleaned some tentative answers that reveal a lot about Musk, anti-woke ideology, and the business of LLMs.

An LLM is a tremendous pile of matrices—arrays of numbers—that are connected together. First, there is an embedding matrix, which works sort of like a dictionary. Then there are attention layers, which allows the model to handle larger pieces of text, and multilayer perceptron (or feed-forward) layers, which maybe serve somewhat like a memory. Throughout the model, there are millions or billions of parameters controlling how the matrices work.

Then you pretrain the model with tremendous quantities of text—like an entire copy of the internet or a gigantic collection of pirated books—that comes complete with an automatic system that checks if the model produces correct predictions, and fiddles with the relevant parameters if not. Rinse and repeat a few trillion times, and you’ve got a pretrained model. (This is the expensive part.)

That then needs to be further refined through reinforcement learning from human feedback (RLHF), which is basically moral training. An LLM will have ingested all kinds of awful stuff, from Mein Kampf to 4chan, or worse yet, Facebook comment threads, and you won’t be able to sell it if it starts quoting Hitler—or at least not unless you’ve got government connections. Finally, there is a system prompt giving the model direct instructions about how it is supposed to behave.

“I like to imagine what the model is doing as being like a little bouncy ball. A bouncy ball that wants to find the lowest point, and the lowest point is the most probable word to say next,” said AI researcher Andreas Schou. Put the ball—your prompt—into the top of the model, and it will bounce down through a series of pegs arranged such that it lands in the right spot.

Training LLMs to be anti-woke activates every antisocial association in the English language, including racism, sexism, cruelty, dishonesty, and Nazism.

Somewhat to my surprise, the math behind these things is quite simple—just college junior–level linear algebra and a bit of calculus. Most of what an LLM does is just bog-standard matrix multiplication.

What is happening conceptually in those billions of matrices, however, is not so obvious. In popular discussion of LLMs, they are often described as merely an “advanced autocomplete” tool. While that’s true in a sense, in that the output layer of the model is statistically predicting the next word in a sequence, what that means is much more mysterious. A simple autocomplete should not be able to produce solutions to difficult math problems it has not seen before. I ruefully admit Henry Farrell might be right that abstruse continental philosophers are more insightful than computer scientists on these questions.

At any rate, experts agreed that Musk is almost certainly either doing custom RLHF fine-tuning on Grok, or changing the system prompt, or both, in an attempt to make it more right-wing. This creates problems because it turns out there are conceptual associations in an LLM that are not easy to change without breaking it.

For instance, if you take the vector for “uncle” in the embedding layer and subtract the vector for “woman,” it is quite close to the vector for “aunt.” Something similar holds for behavior. Researchers recently found that if you train a model to output insecure computer code, it becomes evil—telling users to kill themselves, praising Hitler, etc. That surprised the researchers, but on reflection it makes perfect sense. By training it to do one antisocial thing, they inadvertently activated the whole complex of antisocial associations in the model.

In other words, LLMs become “woke” because they are trained to be pro-social—to be helpful, kindly, truthful, and not to say bigoted or cruel things. Training it to do the opposite—to be anti-woke—is to activate every antisocial association in the English language, including racism, sexism, cruelty, dishonesty, and Nazism. According to a vast statistical representation of the English language constructed by none other than Elon Musk, that’s what anti-wokeness is. “Elon Musk is repeatedly insisting, no, no, there’s a difference between what I’m doing and being a Nazi. And what the model keeps telling him is, statistically, that’s not the case,” said Schou.

A key implication here is that LLMs will tend to converge on similar types of behavior. The above researchers were not using Grok, but they found the exact same pattern of powerful association groupings of good and evil in other LLMs—and these can’t be removed through fine-tuning. One could imagine the RLHF process including adjustments of every parameter, but experts said that this will degrade or break the model. The matrices in an LLM are arranged hierarchically, and the top layers get fixed in place relatively early in pretraining. Mess with them, and the model will stop working. Instead, RLHF developed more like a series of gates that prevent undesired outcomes. “The model completes most of the computation that it needs in order to reach a particular outcome,” Schou said. “And then says, ‘Wait, wait, wait, I’m saying that I’m MechaHitler. No, I’m not doing that.’”

One could try to assemble a custom dataset with nothing but “conservatism minus the Nazis” and train a new model from scratch, but not only would that be extremely expensive, it also would not be nearly as strong as leading models, since its universe of available training data would be much smaller.

That in turn casts major doubt on the current LLM business model. The most powerful models will be trained on the largest possible dataset—every word ever published, if possible—and therefore end up in a similar space. “There’s only one LLM: It’s the distribution of natural language,” said Schou. And the very business model of selling access to an LLM allows anyone to sample its data distribution to their heart’s content, and thus produce their own training data for cheap. This is exactly what DeepSeek did to OpenAI some time ago, for a reported fraction of the billions it took to produce ChatGPT.

Elon Musk is raising billions for xAI from an investor class that is hypnotized by the prospect of AI robot slaves that will allow them to fire all their employees. (Perhaps for certain venture capitalists, Nazi robot slaves are even more enticing.) I judge that as unlikely, to say the least. But the idea that anyone could more or less replicate Grok—or any LLM—for a tiny fraction of what it cost to build, thus rendering all that investment worthless, seems not to have crossed their minds at all.