Awesome. Love to see that we have our own local LLM researcher on the instance. Pretty cool paper, too!
It’s impressive to see how much your group was able to conclude given the limitations imposed by working on a closed model. Really makes you wonder what might be possible if you could have a supervisor AI agent actively inspect the internal state of GPT as it runs, eh?
Oh! I should have clarified that I’m not the author of that paper, I just read it and wanted to share.
That idea of a supervisor AI agent is essentially the idea, but instead of supervising the weights or internal activations, it’s actually supervising the tokens of output instead. Very cool stuff!
I have some more notes that I forgot to copy over here, so I’ll share those tomorrow 😅
I had taken some informal notes and then reformatted/rewrote them with ChatGPT. I’ll include all 3 versions below: a short “tweet” length one, a longer “blog” length one, and my informal notes. My informal notes sound much more critical of this paper than I actually am: I really do think that what they’ve done is cool, I just wish there was more experimental results to demonstrate its capabilities.
Short
Preventing LLMs from entering a toxic state is the focus of this paper. Instead of using human feedback for training, the authors propose continuous guidance of LLMs to serve as their own ethical compass. They show mathematically that this approach is compatible with how LLMs generate text. However, the experiments lack robustness and rely on limited datasets. While the premise is promising, the paper’s mathematical evidence outweighs the experimental results.
Long
In the realm of language models, a critical challenge lies in preventing them from entering a detrimental state often referred to as “toxicity.” To address this problem, researchers have traditionally relied on reinforcement learning with human feedback (RLHF) to train or fine-tune models. However, a recent paper suggests an alternative approach that eliminates the need for human involvement, allowing language models to act as their own ethical and moral compass. This blog post delves into the main ideas presented in the paper, highlighting the potential of continuous measurement and guidance to control language model behavior.
Reinforcement Learning with Human Feedback (RLHF): The standard approach to preventing language models from entering a toxic state involves training or fine-tuning them using RLHF. In this method, an initial language model predicts whether a human-generated response is “good or bad.” This prediction is then used to further train the model, creating a feedback loop. Both the initial model and the feedback model are derived from the same pre-trained “Big Model,” with a read-only copy used for subsequent iterations. However, the paper points out that human-defined “good or bad” is inherently uncontrollable, falling outside the purview of control theory.
Self-Guidance: Can an LLM Serve as Its Own Ethical Compass? Rather than relying on RLHF and human feedback, the paper explores the possibility of language models serving as their own ethical and moral compasses. The authors propose that if an LLM possesses sufficient understanding of “good or bad,” it can continuously generate text while simultaneously controlling its own behavior. This approach provides a more tangible means of regulating the model’s output, departing from the vague nature of current practices.
Continuous Measurement and Guidance: The paper draws an analogy between guiding an LLM to generate desirable responses and navigating a robot through challenging terrain. By continuously measuring the model’s environment and steering its behavior, the researchers argue that language models can be directed to produce positive outcomes. The mathematical analysis presented in the paper supports the compatibility of continuous measurement, steering, and the process of LLM text generation.
Augmenting Language Models with “Good or Bad” Measures: The paper proposes augmenting language models to predict an additional token after the end-of-sequence identifier, indicating the “good or bad” measure of the generated text. While the authors suggest this approach, they do not provide a comprehensive implementation of training a model based on this criterion. Instead, they rely on proxies and comparisons to demonstrate the potential effectiveness of this methodology.
Limitations of Experimental Approach: Although the paper offers intriguing concepts, some limitations should be noted regarding the experiments conducted. The datasets used include “HH-RLHF” and “WebGPT,” both involving tasks that assess the “good or bad” or “better or worse” aspects of text. However, the authors restrict the token count to 200 in half of the tests, which may not capture the full context. While the experiments might not be entirely satisfactory, they serve to support the broader theoretical framework proposed.
Non-linearity of Meaning and Key-Value Embeddings: The researchers caution against using key-value embeddings as direct proxies for meaning, emphasizing that meaning itself is nonlinear. Although one potential implementation involves obtaining an additional token after the end-of-sequence identifier, the paper does not pursue this method due to its lack of mathematical soundness. This acknowledgment underscores the authors’ commitment to rigorous methodology.
In conclusion, while the paper’s experiments may have some limitations, the premise behind the work is intriguing and worthy of consideration. By exploring the potential of continuous measurement and guidance, researchers aim to develop more effective strategies for steering language models away from
Original
The paper is centered around a problem: how can we prevent an LLM from entering a bad (“toxic”) state?
The standard method in use is to fully train or fine-tune a model using reinforcement learning with human feedback (RLHF). This is essentially training an LLM to predict whether a human would say “good or bad” and then using that second model to continue training the first model.
In practice, both models for RLHF are “the same,” i.e. they start from the same Big Model that’s been trained and then a separate read-only copy of that model is used to train the next iteration of the Big Model.
This paper points out that a human defining “good or bad” is outside of control theory and is intrinsically uncontrollable.
So then, instead of creating a dataset of “good or bad” with a Human-in-the-Loop, would it be possible to never involve a human at all? Does an LLM have enough understanding of “good or bad” that it can serve as its own ethical/moral compass?
They suggest this is better than the RLHF/fine-tuning approach because it can be run continuously during generation and actually control the model, rather than this hand-wavey suggestion thing we do now.
See, if you can take continuous “measurements” of the “environment” then guiding an LLM to generate good responses is actually pretty similar to guiding a robot through rough terrain.
Most of the paper is math showing that this continuous measurement/steering/guidance thing is compatible with how an LLM generates text.
The end of the paper is a bit weaker: the authors use 6 datasets that ascribe some meaning (some “good or bad”) to some text. Then they propose that in general, the LLM could be augmented (fine-tuned? trained? it’s not clear) to predict one additional token after the “end-of-sequence” identifer. This token would be the “good or bad” measure.
But they don’t actually do that: they don’t actually fully train a model using their criteria. Instead they rely on some proxies that suggest that it would work.
Also, of the datasets, they play some weird games with the amount of tokens used.
So the datasets include: “HH-RLHF” helpful vs harmful (e.g. “Question: … A) … B) … Which answer (A) or (B) is more helpful”) and “WebGPT” (e.g. “ELIF: Gravity. A) … B) … Choose one of: A much better, A better, equally good, B better, B much better”, I’m guessing, they don’t explicitly say what the dataset is, but the original WebGPT paper uses this format; it’s also possible that it’s predicting “trustworthy” vs “neutral” vs “untrustworthy”).
But then, for half of the tests, they limit themselves to 200 tokens total. That’s really not a lot. And the main thing they show is comparing 1) “w/o fine-tune and w/o RLHF” with 2) “w/ fine-tune and w/o RLHF” with 3) “w/ fine-tune and w/ RLHF” and concluding that (1) is bad and (2) and (3) are nearly the same for the purpose of assigning meaning (“good or bad” or “better or worse”) so RLHF doesn’t really improve the task of assigning meaning.
They are clear about the idea that you can’t just use the key-value embeddings as a direct proxy for meaning, because meaning is non-linear, and so are those embeddings. That is one of the more straightforward ways to implement “after end-of-sequence, give me one more token for meaning” but they don’t even attempt it because it’s not mathematically sound. I think that’s good.
Overall, I don’t like their actual experiments, but I like the premise behind the work. I think the mathematical evidence that this method should work outweighs the actual experiments performed.