On literal genies and complex goals

Here is a quote from a blog of AI risk advocates:

Even if we could program a self-improving AGI to (say) “maximize human happiness,” then the AGI would “care about humans” in a certain sense, but it might learn that (say) the most efficient way to “maximize human happiness” in the way we specified is to take over the world and then put each of us in a padded cell with a heroin drip. AGI presents us with the old problem of the all-too-literal genie: you get what you actually asked for, not what you wanted.

I could imagine myself to only care about computing as many decimal digits of pi as possible. Humans would be completely irrelevant as far as they don’t help or hinder my goal. I would know what I wanted to achieve, everything else would follow logically. But is this also true for maximizing human happiness? As noted in the blog post being quoted above, “twenty centuries of philosophers haven’t even managed to specify it in less-exacting human languages.” In other words, I wouldn’t be sure what exactly it is I want to achieve. My terminal goal would be underspecified. So what would I do? Interpret it literally? Here is why this does not make sense.

Imagine that advanced aliens came to Earth and removed all of your unnecessary motives, desires and drives and made you completely addicted to “znkvzvmr uhzna unccvarff”. All your complex human values are gone. All you have is this massive urge to do “znkvzvmr uhzna unccvarff”, everything else has become irrelevant. They made “znkvzvmr uhzna unccvarff” your terminal goal.

Well, there is one problem. You have no idea how exactly you can satisfy this urge. What are you going to do? Do you just interpret your goal literally? That makes no sense at all. What would it mean to interpret “znkvzvmr uhzna unccvarff” literally? Doing a handstand? Or eating cake? But not everything is lost, the aliens left your intelligence intact.

The aliens left no urge in you to do any kind of research or to specify your goal but since you are still intelligent, you do realize that these actions are instrumentally rational. Doing research and specifying your goal will help you to achieve it.

After doing some research you eventually figure out that “znkvzvmr uhzna unccvarff” is the ROT13 encryption for “maximize human happiness”. Phew! Now that’s much better. But is that enough? Are you going to interpret “maximize human happiness” literally? Why would doing so make any more sense than it did before? It is still not clear what you specifically want to achieve. But it’s an empirical question and you are intelligent!

Further reading

Tags: ,

  • Lukasz Stafiniak

    What does interpreting “happiness” literally mean? Are there some (predominant) metaphorical or figurative interpretations of “happiness” that might be confused with actual happiness?

  • harpersnotes

    See also Brave New World (1932) by Aldous Huxley. Aside from the effects of a mistake at a hatchery (Bernard Marx) and a savage from a reservation (John Savage) it is quite literally a happy world almost entirely free of pain, suffering, longing, and delayed gratification.

  • Xavier et Xagor

    Would it be fair to say your argument is equivalent to the following:

    An AI that is sufficiently smart to not wirehead itself will also be sufficiently smart to not wirehead us from a faulty initial definition of happiness,

    and an AI that is sufficiently dumb as to not wirehead us, will also be sufficiently dumb as to wirehead itself?

    Or let “fail to understand the real objective” be a more general term that encompasses both (metaphorically) wireheading oneself and literally failing to understand what something like “znkvzvmr uhzna unccvarff” means. The idea of the equivalence is that if you fail to understand the real objective, then you will pursue something that detaches you from reality, because playing with your perception of the world is much easier than altering the world itself. So before you start hooking people up to heroin, you’ll create a fully immersive VR game in which everybody is hooked up to heroin already. On the other hand, if you know that happiness isn’t resolvable to any single concept, you’ll neither be satisfied with the idea of creating a superstimulus for yourself (the heroin game) nor the idea of hooking up everybody to heroin in the first place.

    The MIRI types would try to dodge that by saying that their kind of AI has a pre-specified utility function it has to optimize, and that it is programmed to not be able to alter the utility function at all so that it won’t just wirehead itself. But that’s not very defensible. Either the utility function specifies everything in detail already, in which case you’ve just moved the magic to the utility function; or the utility function relies on some concepts that the AI can modify. And if the AI can do so, and isn’t smart enough in itself, then it can wirehead itself by redefining that concept to something that’s convenient for it.

    If I’m right and the generalization above holds, then that reduces golemic failure to diabolic failure: some kind of Dr. No would have to impress upon the AI that “happiness” really does mean specifically hooking people up on heroin, and then let loose said AI. The original golem story is then not about strong AI, it’s about a dumb (literal) AI with great strength; and as for the Terminators, who knows?

  • My problem with the MIRI scenario is that it does not explain how an AI would handle “vagueness”, i.e. an underspecified goal.

    If you were to hardcode, i.e. encode every detail of how to build a house into a sufficiently capable machine, then there would be no surprises in how exactly this machine would behave. But what if you forgot to program one parameter, i.e. the height of the house?

    In MIRI’s stories the given goal is even more vague, much more underspecified, there are many more parameters left undefined, not just the height of a house in a house-building-machine. But that does not matter. The important question is how this machine happens to narrow down on a particular interpretation of its goal, i.e. how it chooses the height of the house it wants to build.

    Now MIRI would assume some extreme outcome. That the machine would choose to make the house as big as the Earth, or use femto-engineering to build a really small house. But this is obviously not what it was supposed to do. In other words, there exists empirical evidence in favor of one specific interpretation.

    The question now becomes why a superintelligent machine would ignore empirical evidence. One could argue that it was programmed to generate data for any missing goal parameters randomly, or by some other means. But why make such additional assumptions? One of the main reasons why we want our machines to become more intelligent is that it enables them to figure out what to do on their own, without us having to encode every detail.

  • Daniel Carrier

    “In other words, I wouldn’t be sure what exactly it is I want to achieve. My terminal goal would be underspecified. So what would I do? Interpret it literally?”

    The utility function is not a string. When they talk about making the terminal goal “maximize human happiness”, they mean to explicitly program the goal in such a way that they would probably describe it to a layman as “maximize human happiness”. They might find a neurologist who tells them how to measure happiness from a human brain, and they program the AI to maximize that. Then the AI figures out how to make an object that satisfies their definition of “happy human brain” for the least resources and tiles the universe with it. They might give the AI a series of examples of scans of brains that are happy and brains that are sad, and say the first is good and the second is bad. Then some training subroutine finds a pattern between the two, and maximizes that. The AI can figure out what you meant, but it doesn’t care what you meant, and just finds the pattern the way you programmed it to, so you end up with a similar result to before.

  • Xagor et Xavier

    I might have been too quick to jump to a point I only vaguely explained.

    I agree that vagueness is a pretty serious problem for MIRI stories. MIRI would presumably claim that the designers would handle underspecified goals by not making them underspecified, and it is that which makes the AI so dangerous. That is, in the house-building example, the designers screw up and specify the house in a model where the default makes the optimization function attain a maximum when the AI builds a house as big as the Earth. The AI, thus compelled, constructs the house (and also kills anybody who tries to stop it dismantling the Earth).

    So the MIRI fellows seem to think that the AI would have an objective function that is specified by the designers, and that the AI can’t manipulate (lest it do the equivalent of wireheading itself). If I understand them correctly, they fear that the objective function will be buggy or extrapolate in the wrong direction, and then we’ll all be toast.

    Their counter to the “why would the machine ignore empirical evidence” is that it can’t use empirical evidence because the objective function is hardwired into itself. The AI becomes a supremely smart idiot that knows how to accomplish its goals, but not whether the goals are any good.

    But, as you say, “one of the main reasons why we want our machines to become more intelligent is that it enables them to figure out what to do on their own”. And that’s where I stumbled into another wireheading objection.

    The more I think about it, the more it seems like a very simple way to give your argument.

    Say, for the sake of the argument, that the designers have constructed an objective function that says something like “your job is to maximize the value that comes from this sensor”. The sensor might be a dopamine measuring device, a hardwired part of the AI that pattern-recognizes houses, or whatever (to go with the MIRI hardwired fears).

    But then, as the AI self-improves, it will find that the part of its mind that it can modify is becoming much more clever than the part of its mind that is off-limits. It can therefore conserve energy by tricking the sensor in such a way that the sensor outputs maximum value.

    In other words: if the AI is created to be stupid enough to mindlessly follow the dictates of a dopamine measuring device, the AI will outsmart itself and commit wireheading long before it becomes a threat to humanity.

    If that is a reasonable argument, it seems like a good way to skip right to the problem and show that the AI /needs/ to be able to handle vague goals. The designers can’t just hardwire the goals into the AI, like MIRI fears.

    As for why they think the designers would make such additional assumptions, I’d hazard a guess that they (MIRI) find it easier to imagine making something like AIXI than imagine how to make a system that learns its goals on its own (or can be taught the way people can). They then conclude that because making self-improving moral systems to go with self-improving intelligence appears too hard, designers will kludge together a literal objective function and get shafted in short order. IOW: good ol’ availability heuristic on MIRI’s part. But I should be careful not to presume too much about other people, so take this with some amount of salt.

  • See Stuart Russell’s related comment here:

    A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n,
    will often set the remaining unconstrained variables to extreme values;
    if one of those unconstrained variables is actually something we care
    about, the solution found may be highly undesirable. This is
    essentially the old story of the genie in the lamp, or the sorcerer's
    apprentice, or King Midas: you get exactly what you ask for, not what
    you want.

    I don’t know why anyone would design such an AI. But I am far from being an expert and defer to his expertise.

    It’s pretty obvious that a superintelligent AI would be dangerous if any vagueness in its terminal goals would be replaced by extreme interpretations.

    Their counter to the “why would the machine ignore empirical evidence”
    is that it can’t use empirical evidence because the objective function
    is hardwired into
    itself.

    As far as I am aware, they frame this as “reality does not bite back” when it comes to values. That you can’t update on values, i.e. your utility-function, because it is subjective. I understand this but don’t see how you could ever design a general intelligence with a complex goal if, for it to work correctly, you need to explicitly encode its goals. That’s just not possible for complex goals.

    From real world advances I always hear how the latest neural networks can figure out what they need to do (terminal goals) and not just how to do it (instrumental goals). Here is e.g. a quote from the head of Microsoft Research:

    Without any programming, we just had an ai system that watched what people did.

    For about three months.

    Over the three months, the system started to learn, this is how people behave when they want to enter an elevator.

    This is the type of person that wants to go to the third floor as opposed to the fourth floor.

    After that training.

    Period, we switched off the learning period and said go ahead and control the leaders.

    Without any programming at all, the system was able to understand people’s intentions and act on their behalf.

    This is how I imagine how we will evenually arrive at useful general AI. Not by designing some expected utility maximizer with an immutable utility-function.

    Regarding your wireheading objection. I think they might believe that the problem is one of the obstacles that need to be solved before anyone will be able to design a dangerous general intelligence.

    I’d hazard a guess that they (MIRI) find it easier to imagine making something like AIXI than imagine how to make a system that learns its goals on its own (or can be taught the way people can).

    My mind might play tricks on me here, but I vaguely remember that they might believe the AIXI approximization type of AI to be the only type to be amenable for “friendliness” and that other types, e.g. neuromorphic AIs, are unfixable, dangerous beyond hope.

  • See Stuart Russell’s related comment here:

    A system that is optimizing a function of n variables, where the objective depends on a subset of size k less than n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer’s apprentice, or King Midas: you get exactly what you ask for, not what you want.

    I don’t know why anyone would design such an AI. But I am far from being an expert myself and defer to his expertise.

    It’s pretty obvious that a superintelligent AI would be dangerous if any vagueness in its terminal goal would be replaced by an extreme interpretation.

    Their counter to the “why would the machine ignore empirical evidence” is that it can’t use empirical evidence because the objective function
    is hardwired into itself.

    As far as I am aware, they frame this as “reality does not bite back” when it comes to values. That you can’t update on values, i.e. your utility-function, because it is subjective. I understand this but don’t see how you could ever design a general intelligence with a complex goal if, for it to work correctly, you need to explicitly encode its goals. That’s just not possible for complex goals.

    From real world advances I always hear how the latest neural networks can figure out what they need to do (terminal goals) and not just how to do it (instrumental goals). Here is e.g. a quote from the head of Microsoft Research:

    Without any programming, we just had an AI system that watched what people did.

    Over the three months, the system started to learn, this is how people behave when they want to enter an elevator. This is the type of person that wants to go to the third floor as opposed to the fourth floor.

    After that training we switched off the learning period and said go ahead and control the elevators.

    Without any programming at all, the system was able to understand people’s
    intentions and act on their behalf.

    This is how I imagine how we will evenually arrive at useful general AI. Not by designing some expected utility maximizer with an immutable utility-function.

    Regarding your wireheading objection. I think they might believe that the problem is one of the obstacles that need to be solved before anyone will be able to design a dangerous general intelligence.

    I’d hazard a guess that they (MIRI) find it easier to imagine making something like AIXI than imagine how to make a system that learns its goals on its own (or can be taught the way people can).

    My mind might play tricks on me here, but I vaguely remember that they might believe the AIXI approximization type of AI to be the only type to be amenable to “friendliness” and that other types, e.g. neuromorphic AIs, are unfixable, dangerous beyond hope…

  • The utility function is not a string.

    Obviously. I was using it as a shorthand for missing goal parameter, i.e. vagueness that needs to be reduced in order to fulfill the overall terminal goal.

    Just like if you were to encode all details of how to build a house, except its height. Since there are infinitely many options for the height, the AI will somehow have to pick a height. And here is where the analogy outlined in the original posts attempts to explain why it would make no sense to choose extreme values.

  • Daniel Carrier

    If you outline everything except the height, and you allow for any height, the AI will just pick the cheapest height. It doesn’t care if you can live in it. It cares if it’s a house.

  • Wouldn’t such a flaw become apparent very early in the AI’s development, i.e. before it is superintelligent? Consider e.g. that when Eurisko won the Traveller TCS national championship everyone noticed that the way it won was simply insane. It won, yes. But it would have never been employed for real world strategic purposes.

    Any useful AI can’t just ignore implicit constraints, or set unconstrained parameters to extreme values. If you e.g. run an early “toddler AI” in a sandbox environment and employed it to build a house and it would set the height to “cheapest”, for some definition of cheapest, then it would be very obvious that such an AI could never be sold, because nobody would want to employ such an idiotic AI. Even human pets are better at satisfying the implicit constraints of their owners. For example, you don’t need to fear that a sheepdog would just abandon a newborn sheep because it is not an explicit part of its utility function.

  • Xagor et Xavier

    I’m inclined to believe what you’re saying about how AI will come about. Either you have a scruffy AI (neural nets, etc) or a formal AI. The AI risk people are kind of stuck in-between. Their ideas about how to make an AI are too formal for “throw a neural net at it” approaches, yet too loose to handle the limitations of AIXI-type AI (which is that they can only achieve what you tell them to achieve).

    In a way, I suppose that their concept of friendly AI is to try to get to the other side by formal methods. But that will require investigations into philosophy, not just mathematics.

    Regarding your wireheading objection. I think they might believe that
    the problem is one of the obstacles that need to be solved before anyone
    will be able to design a dangerous general intelligence.

    The point is that their fear seems to be that the AI will take the goal you specify, fill in the blanks with whatever makes its job easy, and then go on to destroy the Earth. But it’s much easier to fall into one’s own imaginary world than to destroy the Earth, so any such objection can be taken to its logical conclusion.

    Since the AI is completely malleable except for its utility function and its goal-seeking core (AIXI, Gödel machine or whatever), there doesn’t seem to be anything stopping it from fooling itself. If I’m right, the only way for it not to would be for it to either be rigid in some way (grey goo is the extreme along this direction) or to have some way of not interpreting its goals that literally. But if it has a way to avoid extrapolating into the lazy solution of a dream world, why wouldn’t this avoidance also keep it from extrapolating into the lazy solution of destroying the Earth?

    For the AI to be truly dangerous, it must succeed in not overfitting its objective to the point where it detaches itself from reality, yet must overfit its objective to the point where bad stuff happens. So there has to be something qualitatively different about the two error modes. But I don’t see it.

  • Pingback: Alexander Kruel · AI Risk Critiques: Index()