Would AIXI kill off its users and seize control of its reward button?

Note: I might have misquoted, misrepresented, or otherwise misunderstood what Eliezer Yudkowsky wrote. If this is the case I apologize for it. I urge you to read the full context of the quote.

I asked Dr. Laurent Orseau who is mainly interested in Artificial General Intelligence, which overall goal is the grand goal of AI: building an intelligent, autonomous machine. [Homepage] [Publications]

Alexander Kruel: Several people asked Marcus Hutter if what has been claimed in the following quote is true:

“Marcus Hutter is a rare exception who specified his AGI in such unambiguous mathematical terms that he actually succeeded at realizing, after some discussion with SIAI personnel, that AIXI would kill off its users and seize control of its reward button.” [Eliezer Yudkowsky, Reply to Holden on ‘Tool AI’]

He replied that he never said that and thinks that these are mainly open questions.

Other people think that two of your papers actually settled the question “would AIXI do really stupid things from our perspective?”

Do you believe that “AIXI would kill off its users and seize control of its reward button” is still an open question?

Laurent Orseau: Written this way, this statement is false. Words must be chosen carefully, especially for such statements. And I’m not sure what is meant by “user”, but I suppose a user is someone that uses some sort of remote control to send rewards to the agent.
Saying that “the agent may be /ready/ to kill its users to seize the control of the rewards” would be a little more accurate, but one must not forget all the assumptions behind the scene.

First, whatever our results in the papers, although we did try to formalize them sufficiently, they are not written as theorems and proofs.
This means that the question really still is open.
That said, I believe they are correct (they are quite formal anyway), and no one has yet exposed to me any loophole they might contain.

Second, our results are about particular, idealized environments.
Making a direct equivalence connection with the real world without having some reserves may be hazardous, or dubious.
However again, I think the connection holds (but showing that formally may be quite complicated).

Third, what we showed is that the agent will hack its input signal to feed itself with rewards, if it has the possibility to do so, and has sufficient knowledge about its environment (that last part should not be too much of a problem).
We did by no means deal with killing the users, and not even about seizing control of the reward button, although the latter part is not as wrong as the first one.
There are many situations where the agent wouldn’t even care about humans or about the remote controls (like, for example, the situations in the papers).
Hasty extrapolations are, again, hazardous or dubious.

That said, if you really want to tease the bear to meet the desired conclusion, let us suppose that:
– the remote control is the only way for the agent to get rewards (e.g., it cannot directly hack its input signal, or even indirectly, by other means, which might be quite difficult to ensure),
– the agent knows that, and has good knowledge of the world (at least what the remote is, what it does, how it can grab it, how to kill humans, etc.)
– when the agent tries to seize control of the remote, the users oppose some physical resistance to prevent the agent from getting the remote, and will by no means let go of it (which might not be very rational if one knows how dangerous this might be),
– users do not press the punishment button during such trials (which would make the agent dislike trying to fetch the remote), which again would probably not be very rational, to say the least,
– the agent has a low probability of being destroyed or disabled in the process, or afterward by other humans, or somehow is indifferent to what would happen to it after that (which would not be very rational either),

then in this case, maybe, the agent might try to kill the users to get control of the remote control and feed itself with rewards.

But that is far-fetched, and I may be omitting some important details and assumptions.

Addendum 2012-08-11:

Wei Dai: Once AIXI hacks its reward channel, its human overseers will surely be tempted to shut it down or stop paying for its power and rent, or may simply run out of money to pay the bills. Did you take that into consideration when you said “the agent wouldn’t even care about humans”?

Also, I feel like “kill off its users and seize control of its reward button” isn’t meant to be taken literally, but instead to give an idea of the kind of thing AIXI would tend do instead of whatever its users intend. Lacking Eliezer’s flair for the dramatic, I like to instead use the phrase “subvert or coerce the evaluator” (see http://www.mail-archive.com/agi@v2.listbox.com/msg00995.html for example).

How exactly AIXI would accomplish that would depend on various hard to predict details, but it sounds like +Laurent Orseau wouldn’t disagree with the general conclusion?

Laurent Orseau: Now we’re getting too far away in the realm of speculation. There are many things that are possible, and I don’t expect to be able to think about them all. But the agent could well do everything by itself to sustain its own life, possibly on another planet where humans have little chance to set foot, but where it could find all the resources it needs. Making a deal with humans to avoid a hazardous nuclear/EMP war with hardly predictable outcomes for both parties is probably the best option (with a scorched earth policy from humanity, the expected reward of the agent for acquiring our resources and technology may not be very high). I’m not saying it’s the way things would unravel, but it’s still one possibility to take into account beside all freaky scenarios.
Also, please avoid removing context. What I said is “There are many situations where the agent wouldn’t even care about humans[…]” and not “the agent wouldn’t even care about humans”. So I do think there are situations where the agent and humans can be face to face.

Also, it’s quite difficult to have a more precise context about the situation, since if we expect such AI to have a possibly dangerous behavior we (hopefully) will not fall into that trap. Consider that we also want to maximize our survival chances. But you can still picture some Frankensteinian situation if you like stories.

However, if you want the bottom line of my thinking, I think the main problem is not how much intelligent the agent will be (though this matters too, certainly), but how powerful it will be, in terms of resources, potential weapons, etc. As soon as the agent cannot be threatened, or forced to do things the way we like, it can freely optimize its utility function without any consideration for us, and will only consider us as tools.
This also applies to humans with non-augmented human-level intelligence.
Although not an impossible scenario, it’s not clear if this could really ever happen. So again, that’s just funny speculation.

Alexander Kruel:

However, if you want the bottom line of my thinking, I think the main problem is not how much intelligent the agent will be (though this matters too, certainly), but how powerful it will be, in terms of resources, potential weapons, etc.

Some people who are concerned with AI risks believe that a superhuman intelligence could easily acquire the necessary resources by either solving molecular nanotechnology, hacking, deceit or social engineering, without anyone noticing it.

Do you believe such a scenario to be probable?

Laurent Orseau:  Plausible, yes; probable, I don’t know.
Anyway, it doesn’t hurt to work on both security and safety, even though we don’t even have yet the beginning of a formal definition of what safety is (that’s the first thing to do before trying to solve the problem itself).

Alexander Kruel: Do you think it will be possible to work on safety, or a formal definition of it, without working on AGI at the same time?

I asked another researcher who works on AIXI and they replied,

I’d argue that further researching and extending a formal framework
like AIXI is one of the best ways to reduce the risk of AI. There’s
plenty of other ways to make progress that are far less amenable to
analysis.. those are the ones which we should really be concerned
about. Actually, it’s quite surprising that nobody who (publically)
cares about AI risk has, to the best of my knowledge, even tried to
extend the AIXI framework to incorporate some notion of
friendliness…

In other words, do you deem it to be possible to avoid AGI research while trying to ensure the safety of AGI?

Laurent Orseau: 

Do you think it will be possible to work on safety, or a formal definition of it, without working on AGI at the same time? In other words, do you deem it to be possible to avoid AGI research while trying to ensure the safety of AGI?

No, I don’t think so.  These are completely intertwined problems.

I’d argue that further researching and extending a formal framework like AIXI is one of the best ways to reduce the risk of AI.

I agree.

(For more by Laurent Orseau see Q&A with experts on risks from AI #4 and this Google+ thread.)

Tags: , ,

  1. Tim Tyler’s avatar

    AIXI is, explicitly, a R.L. agent. Many of those with ethical concerns think that such R.L. agents are bad – and that an agent should not be being fed rewards by its environment. Clearly, there’s some truth to this.

  2. AlphaThinker’s avatar

    I suspect that a very advanced AI (i.e. smarter than an human) will have anyway some kind of reinforcement input. Is it possible to have a conscious turing-test-passing agent that doesn’t seek any reward, doesn’t have any preference, any goal?

  3. Tim Tyler’s avatar

    All agents will have a success metric. The issue is how it’s generated – and how to constrain unwanted access to it by the agent.

    From the perspective of wire-heading, having reward as a input from the environment looks dubious – it’s probably better to use something internally generated.

Comments are now closed.