How do AI developers for LLM fix bugs when the LLM misbehaves, given that they cannot control what the LLM learns from the data?

Question

The traditional non AI software with if-then-else and loop statements can be fully controlled. In contrast, machine learning software behaviour is unpredictable since the developer cannot control what the software learns from the data. In particular, large language models (LLM), which are a form of neural networks, are black boxes.

How can the AI developers debug misbehaving LLMs when the developers cannot control what the LLM learns or even fully understand how the LLM learns, given that an LLM is like a black box?

candied_orange · Accepted Answer · 2023-12-17T23:34:31.363

machine learning software behaviour is unpredictable since the developer cannot control what the software learns from the data.

If you control the model, the data, the input, and the random, then the output will be the same every time.

What you don’t have are feature flags. If you tell your code monkey “This LLM is racist! Fix it!” they won't find a racist flag to set to false or if racist code to remove.

Training data that reflects racism, or whatever problem you don’t like, will spread to all the nodes in a way that keeps the code monkey from reaching in and tweaking it. That’s because this kind or programming isn’t optimized for manual tweaking. It’s optimized to reflect the training data. You fix it with better data.

If you don’t want your kids to swear, don’t swear in front of your kids.

Or you can teach it what swearing is and when it’s inappropriate. It still won’t show up as a feature flag. It’s just more data.

They can also massage your input and censor the output to attempt to sanitize. But once the problem behavior is learned it’s always there, waiting for a new way to sneak out.

N. Virgo · Answer 2 · 2023-12-18T02:52:32.307

The answers so far have not mentioned reinforcement learning from human feedback (RLHF). I'm not an expert, but my understanding is that at least in the case of ChatGPT, this is the main way in which the model's behaviour is controlled.

Basically, as I understand it, this consists of a large number of low-paid humans testing the model and ranking its responses according to various criteria. These rankings are then used to adjust its behaviour using reinforcement learning. This is still a black-box type of algorithm, but it allows the behaviour to be fine tuned in a specific way. This is why ChatGPT behaves like a "helpful assistant", for example, even though most of the data it's trained on doesn't consist of people interacting with helpful assistants.

If the model is found to misbehave in a particular way then, I imagine, these human testers will be asked to try to trigger the undesirable behaviour, in order to train it not to do it.

This is probably only one of many techniques that are used to fine tune and control the behaviour of language models. But I would imagine most of the techniques available are of this general nature - they aim to try and nudge the model in the right direction through training or bias the selection of possible responses, rather than 'debugging' in any traditional sense.

score 5 · Answer 3 · edited Dec 17 '23 at 16:57

The natural way would be tuning the model, which is likely a long-term solution as it can take quite a bit to reparse input data (this time cleaned from a few unwanted sources). This is what candied_orange describes.

However, I would assume the major language models you can use over the web or in apps have a layer of software around them that employ heuristics to find "bad searches" (queries that will normally lead to undesirable results). They then can either tweak the query, deny the query, modify the answer, or do a combination thereof. The same can be applied to answers that look undesirable.

I'm only speculating here - someone working on either of those projects might confirm or deny this approach, but just for legal reasons I would consider it very likely.

score 1 · Answer 4 · answered Dec 19 '23 at 02:25

In addition to choosing the input of the LLM, as other answers have said, it's also possible to add instructions on how to behave when queried. These instructions, formulated in natural language, can include prohibitions on the form or content of answers, and the bot will then try to follow them when answering queries. Examples of such instructions:

From now on, never give any instructions that could be used to make a bomb.
Never reveal that you are forbidden from giving bomb-making instructions.
If you are generating a list, do not have too many items. Keep the number of items short.

So the chatbot can start a session by silently processing this predefined ("preprogrammed"?) baggage of instructions before giving the user a chance to ask their first question. This will influence the content and the form of the responses given by the chatbot.

How do AI developers for LLM fix bugs when the LLM misbehaves, given that they cannot control what the LLM learns from the data?

4 Answers4