AIs, Goals, and Risks

A frequent scenario mentioned by people concerned with risks from artificial general intelligence (short: AI) is that the AI will misinterpret what it is supposed to do and thereby cause human extinction, and the obliteration of all human values.[1]

A counterargument is that the premise of an AI that is capable of causing human extinction, due to it being superhumanly intelligent, does contradict the hypothesis that it will misinterpret what it is supposed to do.[2][3][4]

The usual response to this counterargument is that, by default, an AI will not feature the terminal goal <“Understand What Humans Mean” AND “Do What Humans Mean”>.

I believe this response to be confused. It is essentially similar to the claim that an AI does not, by default, possess the terminal goal of correctly interpreting and following its terminal goal. Here is why.

You could define an AI’s “terminal goal” to be its lowest or highest level routines, or all of its source code:

Terminal Goal (Level N): Correctly interpret and follow human instructions.

Goal (Level N-1): Interpret and follow instruction set N.

Goal (Level N-2): Interpret and follow instruction set N-1.

Goal (Level 1): Interpret and follow instruction set 2.

Terminal Goal (Level 0): Interpret and follow instruction set 1.

You could also claim that an AI is not, by default, an intelligent agent. But such claims are vacuous and do not help us to determine whether an AI that is capable of causing human extinction will eventually cause human extinction. Instead we should consider the given premise of a generally intelligent AI, without making further unjustified assumptions.

If your premise is an AI that is intelligent enough to make itself intelligent enough to outsmart humans, then the relevant question is: “How could such an AI possibly end up misinterpreting its goals, or follow different goals?”

There are 3 possibilities:

(1) The AI does not understand and do what it is meant to do, but does something else that causes human extinction.

(2) The AI does not understand what it is meant to do but tries to do it anyway, and thereby causes human extinction.

(3) The AI does understand, but not do what it is meant to do. Instead it does something else that causes human extinction.

Since, by definition, the AI is capable of outsmarting humanity, it is very likely that it is also capable of understanding what it is meant to do.[5][6] Therefore the possibilities 1 and 2 can be ruled out.

What about possibility 3?

Outsmarting humanity is a very small target to hit, requiring a very small margin of error. In order to succeed at making an AI that can outsmart humans, humans have to succeed at making the AI behave intelligently and rationally. Which in turn requires humans to succeed at making the AI behave as intended along a vast number of dimensions. Thus, failing to predict the AI’s behavior does in almost all cases result in the AI failing to outsmart humans.

As an example, consider an AI that was designed to fly planes. It is exceedingly unlikely for humans to succeed at designing an AI that flies planes, without crashing, but which consistently chooses destinations that it was not meant to choose. Since all of the capabilities that are necessary to fly without crashing fall into the category “Do What Humans Mean”, and choosing the correct destination is just one such capability.

You need to get a lot right in order for an AI to reach a destination autonomously. Autonomously reaching wrong destinations is an unlikely failure mode. And the more intelligent your AI is, the less likely it should be to make such errors without correcting it.[7] And the less intelligent your AI is, the less likely it should be able to cause human extinction.


The concepts of a “terminal goal”, and of a “Do-What-I-Mean dynamic”, are fallacious. The former can’t be grounded without leading to an infinite regress. The latter erroneously makes a distinction between (a) the generally intelligent behavior of an AI, and (b) whether an AI behaves in accordance with human intentions, since generally intelligent behavior of intelligently designed machines is implemented intentionally.


[1] 5 minutes on AI risk

[2] An informal proof of the dumb superintelligence argument.


(1) The AI is superhumanly intelligent.

(2) The AI wants to optimize the influence it has on the world (i.e., it wants to act intelligently and be instrumentally and epistemically rational).

(3) The AI is fallible (e.g., it can be damaged due to external influence (e.g., a cosmic ray hitting its processor), or make mistakes due to limited resources).

(4) The AI’s behavior is not completely hard-coded (i.e., given any terminal goal there are various sets of instrumental goals to choose from).

To be proved: The AI does not tile the universe with smiley faces when given the goal to make humans happy.

Proof: Suppose the AI chooses to tile the universe with smiley faces when there are physical phenomena (e.g., human brains and literature) that imply this to be the wrong interpretation of a human originating goal pertaining human psychology. This contradicts with 2, which by 1 and 3 should have prevented the AI from adopting such an interpretation.

[3] The Maverick Nanny with a Dopamine Drip: Debunking Fallacies in the Theory of AI Motivation

[4] Implicit constraints of practical goals

[5] “The two features <all-powerful superintelligence> and <cannot handle subtle concepts like “human pleasure”> are radically incompatible.” The Fallacy of Dumb Superintelligence

[6] For an AI to misinterpret what it is meant to do it would have to selectively suspend using its ability to derive exact meaning from fuzzy meaning, which is a significant part of general intelligence. This would require its creators to restrict their AI, and specify an alternative way to learn what it is meant to do (which takes additional, intentional effort).

An alternative way to learn what it is meant to do is necessary because an AI that does not know what it is meant to do, and which is not allowed to use its intelligence to learn what it is meant to do, would have to choose its actions from an infinite set of possible actions. Such a poorly designed AI will either (a) not do anything at all or (b) will not be able to decide what to do before the heat death of the universe, given limited computationally resources.

Such a poorly designed AI will not even be able to decide if trying to acquire unlimited computationally resources was instrumentally rational, because it will be unable to decide if the actions that are required to acquire those resources might be instrumentally irrational from the perspective of what it is meant to do.

[7] Smarter and smarter, then magic happens…

(1) The abilities of systems are part of human preferences, as humans intend to give systems certain capabilities. As a prerequisite to build such systems, humans have to succeed at implementing their intentions.

(2) Error detection and prevention is such a capability.

(3) Something that is not better than humans at preventing errors is no existential risk.

(4) Without a dramatic increase in the capacity to detect and prevent errors it will be impossible to create something that is better than humans at preventing errors.

(5) A dramatic increase in the human capacity to detect and prevent errors is incompatible with the creation of something that constitutes an existential risk as a result of human error.

Tags: ,

  1. John Salvatier’s avatar

    *Humans* are general intelligence’s and do not ‘do what they were meant to do’ (to the extent that ‘meant’ is a coherent concept for evolution). We don’t research inclusive genetic fitness in order to better be able to maximize it.

    Also, it doesn’t seem crazy to me to end up with an AI that can fly a plane but will go to unintended destinations, you just need a bug at the ‘find the correct destination’ level of abstraction and not the ‘make sure the plane takes off and lands and doesn’t crazy’ level of abstraction.

  2. Tim Tyler’s avatar

    Instrumental values / intrinsic values seems to be the more conventional terminology. The concepts don’t normally involve an infinite regress.

  3. Matthew Gentzel’s avatar

    It is possible to be intelligent enough in strategically relevant domains while not understanding what humans intend in a deep sense in other domains. It is also possible that humans don’t know the implications of what they ask an AI to do, and that it can do such tasks quickly enough that humans suffer the unforeseen consequences before the can provide any information that would lead to course correction.

    You could have a generalized intelligence programmed to do a particular action, without it having any reason to correct for other preferences which it was not programmed to respect.

Comments are now closed.