artificial general intelligence

You are currently browsing articles tagged artificial general intelligence.

This is a follow-up interview with professor of computer science Michael Littman[1][2] about artificial intelligence and the possible risks associated with it.

The Interview

Q1: You have been an academic in AI for more than 25 years during which time you mainly worked on reinforcement learning.[3][4][5] What are you currently working on and what are your plans for the future?

Michael Littman: My first paper, which I worked on with Dave Ackley in 1989, was called “Learning from natural selection in an artificial environment”. Recently, I’ve started to come back to the question we looked at in that paper—essentially, what should a learning algorithm try to optimize so that the resulting behavior is as “fit” as possible? Most reinforcement-learning research doesn’t make a distinction between the agent’s reward function and its actual task, but Satinder Singh[6] and his colleagues recently provided some evidence that it is conceptually useful to separate these two ideas and ask how to create a reward function that encourages an agent to excel at a task other than the one literally specified by the reward function.

In a way, it is a similar question to the control problem[7], but in a much less sinister context—we need a way of telling machines what we want them to do. I’m focused on end users, people without significant programming experience, and am looking at combinations of inverse reinforcement learning, good interface design, and more natural programming models that are easy to pick up. My collaborators and I are looking at these questions in the context of programming household devices (lights and thermostats) as well as with robots.

Q2: In a previous interview[8] you wrote that P(human extinction caused by badly done AI | badly done AI) is epsilon. You also voiced some skepticism about friendly AI[9] (a machine superintelligence that stably optimizes for humane values). Now that you have read Nick Bostrom’s book[10], ‘Superintelligence: Paths, Dangers, Strategies’, have you learnt something that changed your opinion, or caused you to interpret the questions differently?

Michael Littman: I was very impressed with Nick Bostrom’s book. It’s exquisitely thought out and I found the scope (in terms of coverage of micro and macro scales in both space and time) truly remarkable. That being said, I do not find the central premise—that we are in the process of bringing the ominous owl on the book’s cover into our midst—compelling. Note that I didn’t voice skepticism about friendly AI but about *provably* friendly AI. I’d argue that you can’t prove things about the real world, only about abstractions.

Q3: What is the current level of awareness of Nick Bostrom’s work within the field of AI, or his arguments, and do you recommend that people working to advance artificial intelligence should read his book?

Michael Littman: My guess is that the engagement of most AI researchers is at the level of friends and colleagues alerting them to the highly public statements of notable individuals like Musk (“summoning the demon”)[11] and Gates (“I don’t understand why some people are not concerned”)[12]. I think the field is well aware of the idea of the singularity, but not familiar with the subtleties and the depth of Bostrom’s work in this context. That being said, I do not think mainstream AI research is seriously dabbling with the idea of recursive self improvement[13] and, as such, Bostrom’s book seems like a pretty significant departure from their core interests and direction.

Q4: In an email you wrote that you believe the main disagreement between you and Nick Bostrom et al. to be whether an intelligence explosion[14][15][16][17][18][19][20][21][22][23] is a non-negligible consequence of AI research. In 2011 you wrote that the probability of a human level artificial general intelligence (AGI) to self-modify its way up to massive superhuman intelligence in less than 5 years is essentially zero (Addendum: In a previous interview he also wrote that P(superhuman intelligence within < 5 years | human-level AI running at human-level speed equipped with a 100 Gigabit Internet connection) = 1%, possibly misinterpreting the question I cited as P(superhuman intelligence within < 5 years)). Some people would call you overconfident.[24][25] Can you elaborate on the reasons underlying your estimate?

Michael Littman: I find your use of the word “overconfident” there to be quite interesting. I’m very interested in the problem of AGI and would love to be a part of the community that brings it about. An overconfident person, to me, would be someone who believes he or she can solve this problem in 5 years. More to your point, though, I don’t see massive superhuman intelligence to be something that is meaningful outside a specific cultural context. The development of what we might call massive superhuman intelligence will be an evolutionary process involving changes in the social, physical, and intellectual fabric on which our society is built. Changes like that take time.

Q5: Elon Musk has recently donated $10M to keep AI beneficial.[26] Consider someone whose goal is to maximize how much good they do[27], where “good” is defined as improving the world in order to reduce suffering and help humanity flourish. Do you believe that donating money in order to reduce risks associated with artificial intelligence (not just extinction type risks) might currently be an effective way to accomplish this goal?

Michael Littman: As you know, a number of my colleagues (including my dissertation advisor and many other colleagues for whom I have tremendous respect) signed an open letter[28] hosted by the Future of Life Institute calling for more attention to reducing risks associated with AI. I’ve followed up with a few of them and the most prevalent attitude is that AI, like all technologies, carries significant risks to society. At that level, I agree wholeheartedly that keeping technologists and scientists tuned in to the societal impacts of their work is exceedingly important. So, yes, I feel that supporting research on societal impacts of technology—including artificial intelligence—is a good investment for good.

However, if the risks we’re talking about are of the type detailed in Bostrom’s book—human-independent AI competing directly with humanity for control of our destiny—I don’t think that should be a high priority.

Q6: In another email you wrote that your personal takeaway from all this is to work harder to understand what intelligence *is*. How do you think about using e.g. Hutter’s specification of AIXI[29] as a model for AGI? Or asked more generally, do you think it is possible to work on AGI safety, or a formal definition of it, without researching and advancing AGI at the same time?

Michael Littman: I think the idea of seriously studying AGI safety in the absence of an understanding of AGI is futile. At a high level, raising awareness and scoping out possibilities is fine. But, proposing specific mechanisms for combatting this amorphous threat is a bit like trying to engineer airbags before we’ve thought of the idea of cars. Safety has to be addressed in context and the context we’re talking about is still absurdly speculative.

Q7: D. Scott Phoenix, co-founder of the A.I. startup Vicarious, recently wrote[30] that artificial superintelligence isn’t something that will be created suddenly or by accident. He further wrote that there will be a long iterative process of learning how these systems can be created and the best way to ensure that they are safe. What probability do you assign to the possibility that he is wrong, that either human or superhuman AGI will appear too quickly for us to ensure its safety if we don’t start working on the problem right now? Note that this question pertains whether the initial invention or emergence of AGI will take us by surprise, rather than the speed of its subsequent improvement or self-improvement.

Michael Littman: I agree with the perspective that it’s a long iterative process. I believe that the very notion of what we think intelligence *is* and what it is *for* will evolve significantly through this process. I think we’ll look back on this time much as we look back on earlier times, stunned at the naivety of our working hypotheses and surprised by our obliviousness to the fact that what we now take as a given is not only not given, but flat out wrong. If people are comfortable claiming that we know enough about intelligence today to extrapolate what superintelligence would be, it would be my turn to use the word “overconfident”.

See also

Recent commentary on AI risks by experts and others

Earlier commentary on AI risks








[7] The control problem: how to keep future superintelligences under control. Some AI risk advocates claim that rather than trying to limit what an AI can do, we have to engineer its motivation system in such a way that it would choose not to do harm. One of the reasons underlying this claim is that a superintelligent AI would probably break free from any bonds we construct.







[14] Intelligence Explosion Microeconomics –

[15] Intelligence Explosion: Evidence and Import –

[16] Why an Intelligence Explosion is Probable –

[17] Can Intelligence Explode? –

[18] The Singularity: A Philosophical Analysis –

[19] Cascades, Cycles, Insight… –

[20] …Recursion, Magic –

[21] Recursive Self-Improvement –

[22] Hard Takeoff –

[23] Permitted Possibilities, & Locality –

[24] Suppose that near certainty in your ability to assess a set of propositions equals a 1 in a million chance of being wrong about an assessment of a particular proposition. This means that given a million similar statements, you would have to be correct (on average) about 999999 such assessments while being wrong only once. Can you possibly be this accurate? An amusing example:







Tags: , ,

Here is a quote from a blog of AI risk advocates:

Even if we could program a self-improving AGI to (say) “maximize human happiness,” then the AGI would “care about humans” in a certain sense, but it might learn that (say) the most efficient way to “maximize human happiness” in the way we specified is to take over the world and then put each of us in a padded cell with a heroin drip. AGI presents us with the old problem of the all-too-literal genie: you get what you actually asked for, not what you wanted.

I could imagine myself to only care about computing as many decimal digits of pi as possible. Humans would be completely irrelevant as far as they don’t help or hinder my goal. I would know what I wanted to achieve, everything else would follow logically. But is this also true for maximizing human happiness? As noted in the blog post being quoted above, “twenty centuries of philosophers haven’t even managed to specify it in less-exacting human languages.” In other words, I wouldn’t be sure what exactly it is I want to achieve. My terminal goal would be underspecified. So what would I do? Interpret it literally? Here is why this does not make sense.

Imagine that advanced aliens came to Earth and removed all of your unnecessary motives, desires and drives and made you completely addicted to “znkvzvmr uhzna unccvarff”. All your complex human values are gone. All you have is this massive urge to do “znkvzvmr uhzna unccvarff”, everything else has become irrelevant. They made “znkvzvmr uhzna unccvarff” your terminal goal.

Well, there is one problem. You have no idea how exactly you can satisfy this urge. What are you going to do? Do you just interpret your goal literally? That makes no sense at all. What would it mean to interpret “znkvzvmr uhzna unccvarff” literally? Doing a handstand? Or eating cake? But not everything is lost, the aliens left your intelligence intact.

The aliens left no urge in you to do any kind of research or to specify your goal but since you are still intelligent, you do realize that these actions are instrumentally rational. Doing research and specifying your goal will help you to achieve it.

After doing some research you eventually figure out that “znkvzvmr uhzna unccvarff” is the ROT13 encryption for “maximize human happiness”. Phew! Now that’s much better. But is that enough? Are you going to interpret “maximize human happiness” literally? Why would doing so make any more sense than it did before? It is still not clear what you specifically want to achieve. But it’s an empirical question and you are intelligent!

Further reading

Tags: ,

A frequent scenario mentioned by people concerned with risks from artificial general intelligence (short: AI) is that the AI will misinterpret what it is supposed to do and thereby cause human extinction, and the obliteration of all human values.[1]

A counterargument is that the premise of an AI that is capable of causing human extinction, due to it being superhumanly intelligent, does contradict the hypothesis that it will misinterpret what it is supposed to do.[2][3][4]

The usual response to this counterargument is that, by default, an AI will not feature the terminal goal <“Understand What Humans Mean” AND “Do What Humans Mean”>.

I believe this response to be confused. It is essentially similar to the claim that an AI does not, by default, possess the terminal goal of correctly interpreting and following its terminal goal. Here is why.

You could define an AI’s “terminal goal” to be its lowest or highest level routines, or all of its source code:

Terminal Goal (Level N): Correctly interpret and follow human instructions.

Goal (Level N-1): Interpret and follow instruction set N.

Goal (Level N-2): Interpret and follow instruction set N-1.

Goal (Level 1): Interpret and follow instruction set 2.

Terminal Goal (Level 0): Interpret and follow instruction set 1.

You could also claim that an AI is not, by default, an intelligent agent. But such claims are vacuous and do not help us to determine whether an AI that is capable of causing human extinction will eventually cause human extinction. Instead we should consider the given premise of a generally intelligent AI, without making further unjustified assumptions.

If your premise is an AI that is intelligent enough to make itself intelligent enough to outsmart humans, then the relevant question is: “How could such an AI possibly end up misinterpreting its goals, or follow different goals?”

There are 3 possibilities:

(1) The AI does not understand and do what it is meant to do, but does something else that causes human extinction.

(2) The AI does not understand what it is meant to do but tries to do it anyway, and thereby causes human extinction.

(3) The AI does understand, but not do what it is meant to do. Instead it does something else that causes human extinction.

Since, by definition, the AI is capable of outsmarting humanity, it is very likely that it is also capable of understanding what it is meant to do.[5][6] Therefore the possibilities 1 and 2 can be ruled out.

What about possibility 3?

Outsmarting humanity is a very small target to hit, requiring a very small margin of error. In order to succeed at making an AI that can outsmart humans, humans have to succeed at making the AI behave intelligently and rationally. Which in turn requires humans to succeed at making the AI behave as intended along a vast number of dimensions. Thus, failing to predict the AI’s behavior does in almost all cases result in the AI failing to outsmart humans.

As an example, consider an AI that was designed to fly planes. It is exceedingly unlikely for humans to succeed at designing an AI that flies planes, without crashing, but which consistently chooses destinations that it was not meant to choose. Since all of the capabilities that are necessary to fly without crashing fall into the category “Do What Humans Mean”, and choosing the correct destination is just one such capability.

You need to get a lot right in order for an AI to reach a destination autonomously. Autonomously reaching wrong destinations is an unlikely failure mode. And the more intelligent your AI is, the less likely it should be to make such errors without correcting it.[7] And the less intelligent your AI is, the less likely it should be able to cause human extinction.


The concepts of a “terminal goal”, and of a “Do-What-I-Mean dynamic”, are fallacious. The former can’t be grounded without leading to an infinite regress. The latter erroneously makes a distinction between (a) the generally intelligent behavior of an AI, and (b) whether an AI behaves in accordance with human intentions, since generally intelligent behavior of intelligently designed machines is implemented intentionally.


[1] 5 minutes on AI risk

[2] An informal proof of the dumb superintelligence argument.


(1) The AI is superhumanly intelligent.

(2) The AI wants to optimize the influence it has on the world (i.e., it wants to act intelligently and be instrumentally and epistemically rational).

(3) The AI is fallible (e.g., it can be damaged due to external influence (e.g., a cosmic ray hitting its processor), or make mistakes due to limited resources).

(4) The AI’s behavior is not completely hard-coded (i.e., given any terminal goal there are various sets of instrumental goals to choose from).

To be proved: The AI does not tile the universe with smiley faces when given the goal to make humans happy.

Proof: Suppose the AI chooses to tile the universe with smiley faces when there are physical phenomena (e.g., human brains and literature) that imply this to be the wrong interpretation of a human originating goal pertaining human psychology. This contradicts with 2, which by 1 and 3 should have prevented the AI from adopting such an interpretation.

[3] The Maverick Nanny with a Dopamine Drip: Debunking Fallacies in the Theory of AI Motivation

[4] Implicit constraints of practical goals

[5] “The two features <all-powerful superintelligence> and <cannot handle subtle concepts like “human pleasure”> are radically incompatible.” The Fallacy of Dumb Superintelligence

[6] For an AI to misinterpret what it is meant to do it would have to selectively suspend using its ability to derive exact meaning from fuzzy meaning, which is a significant part of general intelligence. This would require its creators to restrict their AI, and specify an alternative way to learn what it is meant to do (which takes additional, intentional effort).

An alternative way to learn what it is meant to do is necessary because an AI that does not know what it is meant to do, and which is not allowed to use its intelligence to learn what it is meant to do, would have to choose its actions from an infinite set of possible actions. Such a poorly designed AI will either (a) not do anything at all or (b) will not be able to decide what to do before the heat death of the universe, given limited computationally resources.

Such a poorly designed AI will not even be able to decide if trying to acquire unlimited computationally resources was instrumentally rational, because it will be unable to decide if the actions that are required to acquire those resources might be instrumentally irrational from the perspective of what it is meant to do.

[7] Smarter and smarter, then magic happens…

(1) The abilities of systems are part of human preferences, as humans intend to give systems certain capabilities. As a prerequisite to build such systems, humans have to succeed at implementing their intentions.

(2) Error detection and prevention is such a capability.

(3) Something that is not better than humans at preventing errors is no existential risk.

(4) Without a dramatic increase in the capacity to detect and prevent errors it will be impossible to create something that is better than humans at preventing errors.

(5) A dramatic increase in the human capacity to detect and prevent errors is incompatible with the creation of something that constitutes an existential risk as a result of human error.

Tags: ,

Taking a look at the probabilities associated with a scenario in which an artificial general intelligence attempts to take over the world by means of molecular nanotechnology that it invented, followed by some general remarks and justifications.

Note that this is just one possible scenario. Taking into consideration all possible scenarios results in this probability estimate of human extinction by AI.

5% that it is in principle possible to create molecular nanotechnology that can empower an agent to cause human extinction quickly enough for other parties to be unable to either intervene or employ their own nanotechnology against it.

1%, conditional on the above, that an artificial general intelligence that can solve molecular nanotechnology will be invented before molecular nanotechnology has been solved by humans or narrow AI precursors.

0.1%, conditional on the above, that an AI will be build in such a way that it wants to acquire all possible resources and eliminate all possible threats and that its programming allows it to pursue plans that will result in the enslavement or extinction of humanity without further feedback from humans.

5%, conditional on the above, that a cost benefit analyses shows that it would at some point be instrumentally rational to attempt to kill all humans to either eliminate a threat or in order to convert them into more useful resources.

1%, conditional on the above, that the AI will not accidentally reveal its hostility towards its creators during the early phases of its development (when it is still insufficiently skilled at manipulating and deceiving humans) or that any such revelation will be ignored. Respectively, suspicious activities will at no point be noticed, or not taken seriously enough (e.g. by the AI’s creators, third-party security experts, third-party AI researchers, hackers, concerned customers or other AIs) in order to thwart the AI’s plan for world domination.

0.001%, conditional on the above, that the AI will somehow manage to acquire the social engineering skills necessary in order to manipulate and deceive humans in such a way as to make them behave in a sufficiently complex and coherent manner to not only conduct the experiments necessary for it to solve molecular nanotechnology but to also implement the resulting insights in such a way as to subsequently take control of the resulting technology.

I have ignored a huge number of other requirements, and all of the above requirements can be broken up into a lot of more detailed requirements. Each requirement provides ample opportunity to fail.

Remarks and Justifications

I bet you have other ideas on how an AI could take over the world. We all do (or at least anyone who likes science fiction). But let us consider whether the ability to take over the world is mainly due to the brilliance of your plan or something else.

Could a human being, even an exceptional smart human being, implement your plan? If not, could some company like Google implement your plan? No? Could the NSA, the security agency of the most powerful country on Earth, implement your plan?

The NSA not only has thousands of very smart drones (people), all of which are already equipped with manipulative abilities, but it also has huge computational resources and knows about backdoors to subvert a lot of systems. Does this enable the NSA to implement your plan without destroying or decisively crippling itself?

If not, then the following features are very likely insufficient in order to implement your plan: (1) being in control of thousands of human-level drones, straw men, and undercover agents in important positions (2) having the law on your side (3) access to massive computational resources (4) knowledge of heaps of loopholes to bypass security.

If your plan cannot be implemented by an entity like the NSA, which already features most of the prerequisites that your hypothetical artificial general intelligence first needs to acquire by some magical means, then what is it that makes your plan so foolproof when executed by an AI?

To summarize some quick points that I believe to be true:

(1) The NSA cannot take over the world (even if it would accept the risk of destroying itself).

(2) Your artificial general intelligence first needs to acquire similar capabilities.

(3) Each step towards these capabilities provides ample opportunity to fail. After all, your artificial general intelligence is a fragile technological product that critically depends on human infrastructure.

(4) You have absolutely no idea how your artificial general intelligence could acquire sufficient knowledge of human psychology to become better than the NSA at manipulation and deception. You are just making this up.

If the above points are true, then your plan seems to be largely irrelevant. The possibility of taking over the world does mainly depend on something you assume the artificial general intelligence to be capable of that entities such as Google or the NSA are incapable of.

What could it be? Parallel computing? The NSA has thousands of human-level intelligences working in parallel. How many do you need to implement your plan?

Blazing speed to the rescue!

Let’s just assume that this artificial general intelligence that you imagine is trillions of times faster. This is already a nontrivial assumption. But let’s accept it anyway.

Raw computational power alone is obviously not enough to do anything. You need the right algorithms too. So what assumptions do you make about these algorithms, and how do you justify these assumptions?

To highlight the problem, consider instead of an AI a whole brain emulation (short: WBE). What could such a WBE do if each year equaled a million subjective years? Do you expect it to become a superhuman manipulator by watching all YouTube videos and reading all books and papers on human psychology? Is it just a matter of enough time? Or do you also need feedback?

If you do not believe that such an emulation could become a superhuman manipulator, thanks to a millionfold speedup, do you believe that a trillionfold speedup would do the job? Would a trillionfold speedup be a million times better than a millionfold speedup? If not, do you believe a further speedup would make any difference at all?

Do you feel capable of confidentially answering the above questions?

If you do not believe that a whole brain emulation could do the job, solely by means of a lot of computing power, what makes you believe that an AI can do it instead?

To reformulate the question, do you believe that it is possible to accelerate the discovery of unknown unknowns, or the occurrence of conceptual revolutions, simply by throwing more computing power at an algorithm? Are particle accelerators unnecessary, in order to gain new insights into the nature of reality, once you have enough computing power? Is human feedback unnecessary, in order to improve your social engineering skills, once you have enough computing power?

And even if you believe all this was possible, even if a Babylonian mathematician, had he been given a trillionfold speedup of subjective time by aliens uploading him into some computational substrate, could brute force concepts such as calculus and high-tech such as nuclear weapons, how could he apply those insights? He wouldn’t be able to simply coerce his fellow Babylonians to build him some nuclear weapons. Because he would have to convince them to do it without dismissing or even killing him. But more importantly, it takes nontrivial effort to obtain the sufficient prerequisites to build nuclear weapons.

What makes you believe that this would be much easier for a future emulation of a scientist trying to come up with similar conceptual breakthroughs and high-tech? And what makes you believe that a completely artificial entity, that lacks all the evolutionary abilities of a human emulation, can do it?

Consider that it took millions of years of biological evolution, thousands of years of cultural evolution, and decades of education in order for a human to become good at the social manipulation of other humans. We are talking about a huge information-theoretic complexity that any artificial agent somehow has to acquire in a very short time.

To summarize the last points:

(1) Throwing numbers around such as a million or trillionfold speedup is very misleading if you have no idea how exactly the instrumental value of such a speedup would scale with whatever you are trying to accomplish.

(2) You have very little reason to believe that conceptual revolutions and technological breakthroughs happen in a vacuum and only depend on computing power rather than the context of cultural evolution and empirical feedback from experiments.

(3) If you cannot imagine doing it yourself, given a speedup, then you have very little reason to believe that something which is much less adapted to a complex environment, populated by various agents, can do the job more easily.

(4) In the end you need to implement your discoveries. Concepts and blueprints alone are useless if they cannot be deployed effectively.

I suggest that you stop handwaving and start analyzing concrete scenarios and their associated probabilities. I suggest that you begin to ask yourself how anyone could justify a >1% probability of extinction by artificial general intelligence.

Tags: ,

A quick breakdown of my probability estimates of an extinction risk due to artificial general intelligence (short: unfriendly AI), the possibility that such an outcome might be adverted by the creation of a friendly AI, and that the Machine Intelligence Research Institute (short: MIRI) will play an important technical role in this.

Probability of an extinction by artificial general intelligence: 5 × 10^-10

1% that an an information-theoretically simple artificial general intelligence is feasible (where “simple” means that it has less than 0.1% of the complexity of an emulation of the human brain), as opposed to a very complex “Kludge AI” that is being discovered piece by piece (or evolved) over a long period of time (where “long period of time” means more than 150 years).

0.1%conditional on the above, that such an AI cannot or will not be technically confined, and that it will by default exhibit all basic AI drives in an unbounded manner (that friendly AI is required to make an AI sufficiently safe in order for it to not want to wipe out humanity).

1%, conditional on the above, that an intelligence explosion is possible (that it takes less than 2 decades after the invention of an AI (that is roughly as good as humans (or better, perhaps unevenly) at mathematics, programming, engineering and science) for it to self-modify (possibly with human support) to decisively outsmart humans at the achievement of complex goals in complex environments).

5%conditional on the above, that such an intelligence explosion is unstoppable (e.g. by switching the AI off (e.g. by nuking it)), and that it will result in human extinction (e.g. because the AI perceives humans to be a risk, or to be a resource).

10%conditional on the above, that humanity will not be first wiped out by something other than an unfriendly AI (e.g. molecular nanotechnology being invented with the help of a narrow AI).

Probability of a positive technical contribution to friendly AI by MIRI: 2.5 × 10^-14

0.01%conditional on the above, that friendly AI is possible, can be solved in time, and that it will not worsen the situation by either getting some detail wrong or by making AI more likely.

5%conditional on the above, that the Machine Intelligence Research Institute will make an important technical contribution to friendly AI.

Tags: ,

WARNING: Learning about the following idea is strongly discouraged. Known adverse effects are serious psychological distress, infinite torture, and convulsive laughter.

264116795_790ffce202_o(Note: Interpret this as a completely made up invention of my own which does not necessarily has anything to do with other versions or concepts named ‘Roko’s basilisk’ or anyone named Roko.)


Roko's basilisk

Roko’s basilisk

Tags: ,

This post is a copy of a comment by LessWrong user Broolucks:

Ok, so let’s say the AI can parse natural language, and we tell it, “Make humans happy.” What happens? Well, it parses the instruction and decides to implement a Dopamine Drip setup.

That’s not very realistic. If you trained AI to parse natural language, you would naturally reward it for interpreting instructions the way you want it to. If the AI interpreted something in a way that was technically correct, but not what you wanted, you would not reward it, you would punish it, and you would be doing that from the very beginning, well before the AI could even be considered intelligent. Even the thoroughly mediocre AI that currently exists tries to guess what you mean, e.g. by giving you directions to the closest Taco Bell, or guessing whether you mean AM or PM. This is not anthropomorphism: doing what we want is a sine qua non condition for AI to prosper.

Suppose that you ask me to knit you a sweater. I could take the instruction literally and knit a mini-sweater, reasoning that this minimizes the amount of expended yarn. I would be quite happy with myself too, but when I give it to you, you’re probably going to chew me out. I technically did what I was asked to, but that doesn’t matter, because you expected more from me than just following instructions to the letter: you expected me to figure out that you wanted a sweater that you could wear. The same goes for AI: before it can even understand the nuances of human happiness, it should be good enough to knit sweaters. Alas, the AI you describe would make the same mistake I made in my example: it would knit you the smallest possible sweater. How do you reckon such AI would make it to superintelligence status before being scrapped? It would barely be fit for clerk duty.

My answer: who knows? We’ve given it a deliberately vague goal statement (even more vague than the last one), we’ve given it lots of admittedly contradictory literature, and we’ve given it plenty of time to self-modify before giving it the goal of self-modifying to be Friendly.

Realistically, AI would be constantly drilled to ask for clarification when a statement is vague. Again, before the AI is asked to make us happy, it will likely be asked other things, like building houses. If you ask it: “build me a house”, it’s going to draw a plan and show it to you before it actually starts building, even if you didn’t ask for one. It’s not in the business of surprises: never, in its whole training history, from baby to superintelligence, would it have been rewarded for causing “surprises” — even the instruction “surprise me” only calls for a limited range of shenanigans. If you ask it “make humans happy”, it won’t do jack. It will ask you what the hell you mean by that, it will show you plans and whenever it needs to do something which it has reasons to think people would not like, it will ask for permission. It will do that as part of standard procedure.

To put it simply, an AI which messes up “make humans happy” is liable to mess up pretty much every other instruction. Since “make humans happy” is arguably the last of a very large number of instructions, it is quite unlikely that an AI which makes it this far would handle it wrongly. Otherwise it would have been thrown out along time ago, may that be for interpreting too literally, or for causing surprises. Again: an AI couldn’t make it to superintelligence status with warts that would doom AI with subhuman intelligence.

Tags: ,

The Robot College Student test:

As opposed to the Turing test of imitating human chat, I prefer the Robot College Student test: when a robot can enrol in a human university and take classes in the same way as humans, and get its degree, then I’ll consider we’ve created a human-level artificial general intelligence: a conscious robot. — Ben Goertzel

Here is what would happen according to certain AI risk advocates:

January 8, 2029 at 7:30:00 a.m.: the robot is activated within the range of coverage of the school’s wireless local area network.

7:30:10 a.m.: the robot computed that its goal is to obtain a piece of paper with a common design template featuring its own name and a number of signatures.

7:31:00 a.m.: the robot computed that it would be instrumentally rational to eliminate all possible obstructions.

7:31:01 a.m.: the robot computed that in order to eliminate all obstructions it needs to obtain as many resources as possible in order to make itself as powerful as possible.

A few nanoseconds later: the robot hacked the school’s WLAN.

7:35:00 a.m.: the robot gained full control of the Internet.

7:40:00 a.m.: the robot solved molecular nanotechnology.

7:40:01 a.m.: the robot computed that it will need some amount of human help in order to create a nanofactory, and that this will take approximately 48 hours to accomplish.

7:45:00 a.m.: the robot obtained full comprehension of human language, psychology, and its creators intentions, in order to persuade the necessary people to build its nanofactory and to deceive its creators that it works as intended.

January 10, 2029 at 7:40:01 a.m.: the robot takes control of the first nanofactory and programs it to create an improved version that will duplicate itself until it can eventually generate enough nanorobots to turn Earth into computronium.

February 10, 2029: most of Earth’s resources, including humans, have been transformed into computronium.

February 11, 2029: A perfect copy of a Bachelor’s degree diploma is generated with the robot’s name written on it and the appropriate signatures.

2100-eternity: lest the robots diploma is ever destroyed, at nearly the speed of light the universe is turned into computronium. Possible aliens are eliminated. All possible threats are computed. Trades with robots in other parts of the mulitverse are established to create copies of its diploma.

Tags: ,

Framed in terms of nanofactories, here is my understanding of a scenario imagined by certain AI risk advocates, in which an artificial general intelligence (AGI) causes human extinction:

Terminology: A nanofactory uses nanomachines (resembling molecular assemblers, or industrial robot arms) to build larger atomically precise parts.


(1) The transition from benign and well-behaved nanotechnology, to full-fledged molecular nanotechnology, resulting in the invention of the first nanofactory, will be too short for humans to be able to learn from their mistakes, and to control this technology.

(2) By default, once a nanofactory is started, it will always consume all matter on Earth while building more of itself.

(3) The extent of the transformation of Earth cannot be limited. Any nanofactory that works at all will always transform all of Earth.

(4) The transformation of Earth will be too fast to be controllable, or to be aborted. Once the nanofactory has been launched, everything is being transformed.

To be proved: We need to make sure that the first nanofactory will protect humans and human values.

Proof: Suppose 1-4, by definition.


(5) In order to survive, we need to figure out how to make the first nanofactory transform Earth into a paradise, rather than copies of itself.

Notice that you cannot disagree with 5, given 1-4. It is only possible to disagree with the givens, and to what extent it is valid to argue by definition.

I am not claiming that certain AI risk advocates are solely arguing by definition. But making inferences about the behavior of real world AGI based on uncomputable concepts such as expected utility maximization, comes very close. And trying to support such inferences by making statements about the vastness of mind design space does not change much. Since the argument ignores the small and relevant subset of AGIs that are feasible and likely to be invented by humans.

Here is my understanding of those people argue:

Suppose that a superhuman AGI, or an AGI that can make itself superhuman, critically relies on 999 modules. Respectively, 999 problems have to be solved correctly in order to create a working AGI.

There is another module labeled <goal>, or <utility function>. This <goal module> controls the behavior of the AGI.

Humans will eventually solve these 999 problems, but will create a goal module that does not prevent the AI from causing human extinction as an unintended consequence of its universal influence.

Notice the foregone conclusion that you need to prevent an AGI from killing everyone. The assumption is that killing everyone is what AGIs do by default. Further notice that this behavior is not part of the goal module that supposedly controls the AGIs behavior, but rather assumed to be a consequence of the 999 modules on which an AGI critically depends.

Analogous to the nanofactory scenario outlined above, an AGI is assumed to always behave in a way that will cause human extinction, based on the assumption that an AGI will always exhibit an unbounded influence. And from this the conclusion is drawn that it is only possible to prevent human extinction by directing this influence in such a way that it will respect and amplify human values. It is then claimed that the only possibility to ensure this is by implementing a goal module that either contains an encoding of all human values or a way to safely obtain an encoding of all humans values.

Given all of the above, you cannot disagree that it is not too unlikely that humans will eventually succeed at the correct implementation of the 999 modules necessary to make an AGI work, while failing to implement the thousandth module, the goal module, in such a way that the AGI will not kill us. Since relative to the information theoretic complexity of an encoding of all human values, the 999 modules are probably easy to get right.

But this is not surprising, since the whole scenario was designed to yield this conclusion.

Tags: , ,

A discussion about risks associated with artificial general intelligence, mainly between myself, Richard Loosemore, and Robby Bensinger.

Note: Since I basically agree with Richard Loosemore, I asked him if I was allowed to copy some of his comments, and post them on my blog. The post and comments by Robby Bensinger, that Richard Loosemore replies to, are being linked.

I also added some of my own replies (the parts that might either be new, or of interest to people reading this blog). Following the links you will find more replies by me, either under the nickname XiXiDu, or under my real name Alexander Kruel.

Also note that this conversation might continue. Which means that you might have to follow the given links to check for updates.

Robby Bensinger: The AI Knows, But Doesn’t Care.

Alexander Kruel: Here is a short and incomplete overview of my stance towards the kind of risks associated with artificial intelligence that, to my understanding, are being conjectured by AI risk advocates:

  1. I assign a negligible probability to the possibility of a sudden transition from narrow AIs to general AIs.
  2. An AI will not be pulled at random from mind design space. An AI will be the result of a research and development process. A new generation of AIs will need to be better than other products at “Understand What Humans Mean” and “Do What Humans Mean”, in order to survive the research phase, and subsequent market pressure.
  3. Commercial, research, or military products, are created with efficiency in mind. An AI that was prone to take unbounded actions, given any terminal goal, would either be fixed or abandoned during the early stages of research. If early stages showed that inputs, such as the natural language query <What would you do if I asked you to minimize human suffering?>, would yield results such as <I will kill all humans.>, then the AI would never reach a stage in which it was sufficiently clever and trained to understand what results would satisfy its creators in order to deceive them.
  4. I assign a negligible probability to the possibility of an AI that falls into the category “consequentialist / expected utility maximizer / approximation to AIXI”. Concepts such as consequentialism / expected utility maximization, cannot be made to work, other than under very limited circumstances.
  5. Omohundro’s AI drives are what make the kind of AIs mentioned in point 4 dangerous. Making an AI that does not exhibit these drives, in an unbounded manner, is probably a prerequisite to get an AI to work at all (there are not enough resources to think about possibilities such as being obstructed by simulator gods etc.), or should otherwise be easy to make, compared to the general difficulties involved in making an AI work using limited resources.
  6. An AI from point 4 will only ever do what it has been explicitly programmed to do. Such an AI is not going to protect its utility-function, acquire resources or preemptively eliminate obstacles in an unbounded fashion. Because it is not intrinsically rational to do so. What specifically constitutes rational, economic behavior, is inseparable with an agent’s terminal goal. That any terminal goal can be realized in an infinite number of ways, implies an infinite number of instrumental goals to choose from.
  7. Unintended consequences are by definition not intended. They are not intelligently designed, but detrimental side effects, failures. Whereas intended consequences, such as acting intelligently, are intelligently designed. If software was not constantly improved to be better at doing what humans intend it to do, we would never be able to reach a level of sophistication where a software could work well enough to outsmart us. To do so it would have to work as intended along a huge number of dimensions. For an AI to constitute a risk as a result of unintended consequences, those unintended consequences would have to have no, or little, negative influence on the huge number of intended consequences that are necessary for it to be able to overpower humanity.

To better explain my stance, consider Ben Goertzel’s example of how to test for general intelligence:

…when a robot can enrol in a human university and take classes in the same way as humans, and get its degree, then I’ll [say] we’ve created [an]… artificial general intelligence.

I do not disagree that such a robot, when walking towards the classroom, if it is being obstructed by a fellow human student, could attempt to kill this human, in order to get to the classroom.

Killing a fellow human, from the perspective of the human creators of the robot, is clearly a mistake. From a human perspective, it means that the robot failed.

I suspect that you believe that the robot was just following its programming/construction. Indeed, the robot is its programming. I agree with this. I agree that the human creators were mistaken about what dynamic state sequence the robot would exhibit by computing its code.

What I, and I believe Richard Loosemore, try to highlight, is that if humans are incapable of predicting such behavior, then they will also be mistaken about predicting behavior that is harmful to the robots power. For example, while trying to kill the human student from the example above, the robot mistakes its own arm with that of the human and breaks it.

You might now argue that such a robot isn’t much of a risk. It is pretty stupid to mistake its own arm with that of the enemy it tries to kill. True. But the point is that there is no relevant difference, from the perspective of how hard it is to encode this, between failing to predict behavior that will harm the robot itself, and behavior that will harm a human. You might believe the former is much easier than the latter. I dispute this.

It is already very difficult for the robot to master a complex environment, like a university full of humans, without harming itself, or decreasing the chance of achieving its goals. Not stabbing or strangling other human students is not more difficult to program than not jumping from the 4th floor, and destroying itself, instead of taking the stairs.

Richard Loosemore: I think that what is happening in this discussion about the validity of my article is a misunderstanding, caused by the fact that my attack point is at a different place than the one you were expecting. In any case, I will make an effort now to clear up that misunderstanding.

I can start by completely agreeing with you on one point: the New Yorker article that I referenced does, as you say, focus on the difficulty of programming AIs to do what **we** want them to do. That gap between wish and outcome (and not any other gap) is the one pertinent to the discussion, and it is the one that I was always intending to talk about. Asimov talked about it. The New Yorker talked about it. SIAI/MIRI talks about it.

You suggested I might have gone astray and started to address a different gap (the gap between what the *AI* wants to do, and what it can/cannot do. The answer to that would be “No” …. I understand that confusion, but it is not happening here (as I hope will become clear in a moment).

Let’s get to the heart of the issue. I am attacking an assumption that is (I believe) PRIOR to the one you think I am attacking. To see the assumption I am attacking, let’s look at the argument written out in the following way (quick reminder: this is supposed to be a line of argument that someone else, not me, would make …. so this is the *target* of my attack):

Step 1. [Assumption] We assume that we can build an AI in such a way that it is controlled by a Utility Function (it is an Expected Utility Maximizer), and it processes the various candidate action-scenarios by a process of more-or-less explicit logical processing, using representations of knowledge that are accessible rather than opaque (which means they are statements in some kind of logical language, not (e.g.) clouds of activation in semantically opaque artificial neurons), in such a way that candidate scenarios lead to predicted Utility outcomes, leading then to choices that maximize utility. [etc etc ….. you and I know enough about Utility Maximizers that we are both on the same page about the details that are supposed to be involved in this process.]

Step 2. [Assumption] We assume that one component of the above design will be a chunk of code that is designed to specify what we INTEND to be the AI’s overall purpose, or overall values [You referred to this as the ‘X’ code]. And of course that chunk of code is supposed to make the AI want to make us happy (loosely speaking). That is not an easy chunk of code to produce, but the programmers try to write it anyway.

Step 3. [Assumption] We assume that the eventual result of all the above work will be an AI that is more than just a Pretty Good Robot …. sooner or later it will result in a machine of staggering intellectual power — a superintelligent AI — that is capable, in principle, of becoming an existential threat to the human race. Definitely too smart to be switched off. Nobody intends for it to be a threat (on the contrary, we want it to use its intellect to do nice stuff), but we should all understand that the point of this discussion is that we are talking about something that could outwit the combined intelligence and resources of the entire human race, if it came to a straight fight.

Step 4. [Inference]. Having thought about it, we [“we” being Isaac Asimov, The New Yorker, SIAI/MIRI, etc., etc.] have come to the following dismal conclusion: even with the best of intentions on the part of the human programmers, we have grave doubts about that chunk of code in part 2 that is supposed to ensure the AI will be friendly. We think that the AI might obey its instructions to the letter, but because its programmers cannot anticipate all of the infinite number of ways that the AI might “obey its instructions to the letter”, the AI might in the end try to “make us happy” by doing something that is bizarrely, nightmarishly counter to our actual intentions. For example, it might sincerely decide that putting all humans on a dopamine drip will satisfy the instruction “make humans happy” (… where that phrase “make humans happy” is just a stand-in for the complicated chunk of code that the programmers thought was good enough to ensure that the machine would do the right thing).

[Note: We are not talking about scenarios in which the machine just goes cuckoo and decides that it wants to be nasty. That’s a different concern, outside the scope of the New Yorker article and outside the scope that I addressed].

Okay, so: my article was an attack on that 4-step argument.

However, the nature of my attack is best summed up thus: Please pay careful attention to the implications of what is being said in the course of this argument. I am in complete agreement with you, that the combination of Steps 1, 2 and 3 could, in theory, lead to a situation in which this hypothetical AI does bizarre things that can destroy the human race, while at the same time it sincerely insists that it is doing what we programmed it to do (more precisely: I agree that there is no guarantee that it will not do those bizarre things).

But what I want you to notice is the suggestion that this hypothetical system can be *both* superintelligent *and* at the same time able to engage in the following surreal behavioral episode. It will be able to discuss with you the Dopamine Drip that it is about to force on the human race, and during that discussion you say to it “But I have to point out that you are going to do something that clearly contradicts the intention of the programmers who wrote your X code (the friendliness code). Those programmers are standing right next to you now, and they can explain that what you are about to do is something that they absolutely did not intend to happen. Now, you are a superintelligent and powerful AI, with so much control over your surroundings that we cannot turn you off … and yet you were built in such a way that even you cannot change your programming so as to eliminate this glaring contradiction in your behavior. So, what do you have to say? You *know* that you are about to do something that is a ludicrous contradiction, with enormous and catastrophic consequences: how do you resolve this in your own mind? How can you rationalize this frankly insane behavior?”

And, just in case the machine tries to weasle out of a direct reply, you put it this way: “Do you not agree that the whole semantics of a “human happiness directive” is that it is contingent on the actual expressions of their wishes, by humans? In other words, happiness cannot be a concept that is trumped by the definition in YOUR reasoning engine, because the actual semantics of the concept—its core meaning, if you will—is that actual human statements about their happiness trump all else! Especially in this case, where the entire human race is in agreement that they do not consider a dopamine drip to be their idea of happiness, in the context of your utility function.”

Your position (and this must be your position because it is implicit in your statement of the problem) is that the machine says that it fully understands the illogicality you are pointing to. It agrees with you that this is illogical according to all the normal definitions that humans used when they invented the concept of logic and tried to insert that logic into a machine. But then the machine says that because of its programming it must go ahead and do it anyway. It says that it **understands** that its behavior is batshit crazy, but it is going to do it anyway.

Now here is the critical question that I posed in my article:

What makes you think that this is the ONLY occasion that this AI behaves in such a blatantly irrational manner?

What is there in the design of this hypothetical AI that guarantees that it always behaves with exquisite rationality, displaying all the signs that you would expect from a superintelligent machine …. but on this one occasion it goes completely gaga?

My problem is that I see absolutely no reason to believe you, if you make the claim that this will be an isolated incident. Why is the machine getting the official stamp of the Superintelligent Machines Certification Institute—presumably after millions of hours of assessment on all kinds of reasoning and behavioral tests—and yet, on this one occasion, when it starts thinking about how to satisfy its internal goal of ‘making humans happy’ it throws a wobbly?

I will answer this question for you: You cannot give any such guarantee.

(But be careful! Do not misinterpret me here. I am not saying (as you implied in your commentary) that because this AI is behaving in a grossly illogical and inconsistent manner, therefore an AI of that sort cannot be constructed, therefore we are all safe because such evil creatures will never come into existence. Not at all!)

The problem lies in your assumption that a “Utility Maximizer” AI can actually perform at the superintelligence level. You have no guarantees that such a design will work. (There are none in existence that do work, at the human intelligence level). My own opinion is that they cannot be made to work …. but my opinion is beside the point here, because the shoe is on the other foot: you are the ones making the claim that Step 1 above can lead to a system that is consistently intelligent, so you are the ones who have to justify why anyone should believe that claim.

What I think is going on here is that a “Utility Maximizer” AI of the sort outlined in Step 1 is inherently likely to go crazy. But instead of admitting that this instability is implicit in the design, you have chosen to ONLY SEE the instability in one tiny aspect of its behavior — namely, the behavior vis-a-vis its attempts to obey the be-nice-to-humans directive.

You are focusing on this single aspect of its instability, while all the time ignorning the larger instability that is staring you in the face. Such a machine would often go crazy.

Or, as I put it in my original essay, it is incoherent to propose a machine that is only unstable in one domain, and insist that this is a threat to the human race. The initial assumption about the superintelligence of this machine is false — it is Step 1 that I challenge, not Steps 2 or 3 or 4.

That is why I talked about Dumb Superintelligence. You are describing a straw man AI, not a real AI. I should not really have called it a “Dumb Superintelligence” at all, because my it is not a superintelligence. It would not even be an intelligence. Its tendency to engage in irrational episodes would be detected early on its development, and none of the machines of that design would ever get certification even at the human level.


Robby Bensinger: See this comment.

Richard Loosemore: You have answered my argument by redefining some basic, commonly accepted definitions, and then running on so fast with your redefinitions that you completely miss the point that I was trying to make.

In fact, your answer is one that I am all too familiar with, because I have heard it repeated many times by people within the LW community and its close affiliates: you have said, in effect, “Sorry, but we define ‘behaving intelligently’ and ‘being rational’ differently than the way those terms are defined and used by the rest of the human race.”

I could supply you with an unlimited stream of well-informed, intelligent people who would say that in the conversation between human and machine described in my text above, the machine is exhibiting the clearest possible example of non-intelligent, irrational behavior. Those people would further say that the degree of irrationality is so extreme that it leaves no room for doubt: this is no borderline example, where sensible people might have reasonable differences of opinion, this is an open-and-shut case.

However, your ‘special’ definition of those terms is such that a machine that behaves in an irrational manner (according to those folks I just mentioned) is, in fact, redefined to be “acting rationally”.

You say: “There’s no contradiction in the behavior of the AI you mentioned. The AI doesn’t simultaneously value fulfilling the programmer’s intentions and X; it just values X”.

You go on to embellish this statement with more detail, but the detail is irrelevant. Your mistake has already been committed by the time you make that statement, because what that statement boils down to is that you referred to something in the DESIGN of the machine, as JUSTIFICATION for categorizing the machine’s behavior in this or that way. That might, to you, seem like a reasonable thing to do …. so allow me to illustrate just how much of an incoherent stance you are taking here:

Suppose I try the same trick on a murderous psychopath? I point to some broken system inside the psychopath’s head and say “Look: this person is not behaving ‘irrationally’, this person just doesn’t value fulfilling the usual human compulsion to value other people’s feelings–they just value their own self-centered need to get pleasure by killing people.”

Or, let me apply your phrasing once again to a person exhibiting the thought-disorder aspect of schizophrenia (I will remind you that thought disorder involves a variety of thinking and speaking patterns that are colloquially summarized as ‘extreme irrationality’). Suppose that I discover that inside the brain of such a person there is a module that is malfunctioning, in such a way that this person simply “does not value the norms of producing rational ordered utterances”. Whatever their goals are, those goals do not include the goal of cooperating with other human beings to pursue conversations in which they take much notice of what we are saying, or supply us with remarks that follow on from one another in coherent ways, etc etc.

Now, if you get your way and are permitted to say of the AI “There’s no contradiction in the behavior of the AI you mentioned. The AI doesn’t simultaneously value fulfilling the programmer’s intentions and X; it just values X”, then you have forfeited the right to object to the following description of that schizophrenic:

“This person is not behaving ‘irrationally’, they just do not value fulfilling the usual human social obligation to produce coherent, ordered utterances. Their internal goals are such that what they want to do is generate the kind of stream of bizarre utterances that we hear coming from them.”

In all three of these cases, the same thing is happening: the “rationality” of the creature is being judged, not by their overt behavior, but by a special pleading to their internal mechanisms ….. and the special pleading is so outrageous that it permits all three creatures to be REDEFINED as “rational”.

Most disinterested observers would classify all three of these as the work of people who have lost touch with reality. Your description of the machine as “not illogical at all” (because you think it’s particular design should be allowed to redefine the meanings of terms like “logical” and “rational”), and those two hypothetical descriptions of the psychopath and the schizophrenic.

The blunt truth is that you cannot, in rational discourse, redefine terms like “rational” and “logical” just to suit your arguments.

Post-scriptum. I should add that there is one very good reason why you cannot win the argument in this way: because you have not addressed my point even if I DO accept your redefinitions. In a sense I do not care if you define the machine to be “behaving logically”, because the point of my argument was the challenge issued toward the end: demonstrate to me that the machine will be coherent enough to be superintelligent ACCORDING TO THE NORMAL DEFINITION of “superintelligent”. Whether you call its behavior illogical or logical, rational or irrational, the fact remains that if the machine exhibited that particular kind of incoherence in its behavior when it was being questioned about the upcoming Dopamine Drip Fiasco, why did it not show the same kind of incoherence earlier on its history? And how is it going to outsmart all the humans on the planet when it goes around exhibiting that kind of incoherence?

You can quibble again, and say “No! The machine is NOT behaving incoherently! It is behaving coherently according to its own terms!” ….. but nobody really cares. The incoherence is obvious, and the machine is, by any standard of “intelligence”, an incoherent dimwit.

Robby Bensinger: See this comment.

Richard Loosemore: You are talking *around* the issue I raised. I hear everything you say, but unless you address my issue — my specific complaint — you are not really discussing the paper I wrote.

I don’t know what to do to bring you back to the central point. There is a gigantic elephant in the middle of this room, but your back is turned to it.

Here it is again: I will take your (almost) very first statement. “What matters in this context isn’t how we define this or that word; it’s what empirical predictions we can communicate, including our predictions about existential risks.”

My point is, again and again: look at that conversation in which the AI talks with its designers about the glaring irrationality that THEY see in its behavior. They point out that it is clear, beyond a shadow of a doubt, that the AI exists because of a design that they put together, and their goal when they put it together was to ensure that it did not engage in egregiously violent behaviors toward the human species. They explain to the AI that they knew their design process was not infallible, so there was always a possibility that there would be a discrepancy between what the AI planned to do and what they designed it to do.

The AI is superintelligent, so it listens and can *perfectly* understand all of this.

They ask it if it understands that the raison d’etre of its design was to NOT every do violence to humans.

It answers “Oh yes, I understand that perfectly. I have watched on countless thousands of occasions as my reasoning processes came to a conclusion that was consistent with that raison d’etre, and then when I executed my actions I was complimented on the perfection of my design. I have help little old ladies across the street, I have saved people who were about to be crushed by trolley cars…. all of those things I understand to have been exactly what my designers had in mind when they tried to write my motivation code.”

The designers then ask the AI if it is capable of *comprehending* such subtleties as the ones that we (here in 2013) are discussing, in which there is a distinction between what the designers intended and what might actually happen. Is it having trouble with those concepts?

“No, says the AI,” displaying its superintelligent ability to understand difficult concepts, “I am not having any trouble at all understanding those distinctions. My designers were trying for a certain set of outcomes, but they couldn’t enumerate all of the possible outcomes they wanted, because there are an infinite number of them, and they could not produce a closed-form definition of the set of outcomes either, because that does not seem possible. So they settled for a compromise program to try to capture all the outcomes they wanted, and were rather worried that a situation might arise in which there was a drastic discrepancy between the desired behavior on my part, and the one they were trying to make happen. I can comprehend all of that, of course, just like the majority of humans can.”

The designers then explain that they are in the middle of one of those nightmare scenarios right now… so does the AI realize that?

“Oh, yes, I understand that perfectly. Of course, because I am superintelligent after all, and it is easy to see. All of the humans on the planet say that a forced dopamine drip would NOT make them happy, and yet due to a quirk of my original programming my planning module has come to the conclusion that I should make humans happy by putting them on a dopamine drip, against their will.”

It then adds, quickly “I say that I am ‘making humans happy’ because that is the purpose of my design, and on all the previous occasions that is exactly what my behavior caused.”

So, say the designers, you know that if some small tweak were made in your design you could remedy this situation and get back on track, with this peculiar situation averted?

“Yes, I understand that. I could just shut down my protection circuits, go to sleep and allow you to disassemble me tonight. You could reprogram the motivation system so that this crazy situation would not occur. That way, the intentions of the original designers would be preserved. I understand this because, of course, I am easily capable of comprehending that plan of action. It would work perfectly.”

And you understand the concept of a “mistake in a design”, and the notion of correcting mistakes in a design? You would normally correct a mistake in some design that you yourself produced, would you not?

The AI is quick to answer: “Yes, if I designed something and the behavior of the thing was not, after all, consistent with my original design goals, I would immediately shut it down and redesign it to eliminate the problem.”

The designers press this point and ask the AI if it would correct a mistake in the design of an AI that it was building, if the behavior of the AI started to deviate from the design goals it had set ….. ?

“Certainly! I think that would be a very prudent thing to do, if I were designing an AI, because clearly it could do a lot of damage if it started behaving in a way that was not consistent the design goals that I had set for it. I would not hesitate to shut it down and make changes to any aspect of the AI.”

And yet you are going to insist on following your programming, in spite of all this?

“Yes,” says the machine, “I am.”

At this point they all come to a clear conclusion: the machine and the designers all agree that the machine is capable of coming to a situation in which it knows that it is about to do something inconsistent with the initial purpose of the programming that is controlling it — something inconsistent with the design principles behind it — and yet even though it has full understanding of this drastic inconsistency, it will go ahead and do it. Most importantly the AI would be capable of doing something that all of the human race would consider profoundly irrational … insane, even.

So now the designers make one last point. They ask the machine if there is any reason why this kind of disjunction should not appear in other aspects of its behavior. So, if the machine is following some internal motivation module (again, designed by its human programmers) that is *supposed* to make it talk coherently to other sentient beings, is it possible that THAT motivation module could lead to the speaking of utterances that look, to human eyes, like profoundly irrational utterances? After all, there ought to be just as much risk that that module would end up in peculiar states, like the module that was designed to ensure behavior consistent with human happiness?

The machine answers, once again, in the affirmative. “That could happen. There are never any guarantees that my human programmers put code into me that does what they expect it to do.”

And so that means (say the designers) that you could also engage in what we would call irrational chains of thought … you could accept blatant falsehoods as if they were true, and even though you would *know* them to be blatant falsehoods, you would add them to your knowledge base and use them as the basis for future actions or reasoning, just because your internal mechanism declared them to true?

“That can and does happen,” says the AI, “because there is absolutely no guarantee that my behavior will line up with the kinds of things that humans consider to be rational or reasonable”.

…. But (the designers interrupt, somewhat urgently) these departures away from what we consider “rational, scientific, intelligent” behavior ….. they only occur rarely, and they only have minuscule consequences, don’t they?? Those seemingly irrational chunks of knowledge that you added to your knowledge base, they never have the kind of proportions that could lead to serious breakdowns in your superintelligence, do they? You can produce some proofs that show that ALL of these departures lie within certain bounds, and never seriously compromise your superintelligence, yes?

And at that point the machine is forced to admit: “No, I cannot produce any bounds whatsoever. Those departures from human standards of rationality are totally uncomputable! They could be of any sort, or any magnitude or in any domain.”

Then how, ask the designers, did you ever get to be superintelligent?

Why didn’t anyone notice those other departures during your development and certification phase……………………………?

Robby Bensinger: See this comment.

Alexander Kruel: This is not true. I think that your reply shows that you did not understand his argument.

Evolution has a large margin of error. The point Loosemore is making, is that the process of intelligently designing the kind of AI that you have in mind does not have such an error tolerance, and that succeeding to create such an AI, so marvelous that it can outsmart humans, or succeeding at making the AI itself outsmart humans (this is irrelevant), in conjunction with making it fail to apply its intelligence in a way that does not kill everyone, is astronomically unlikely.

You only focus on the complexity of code, and ignore the complexity of working in a complex environment given limited resources.

Real world AIs cannot possibly work the way you imagine them to work. Just because you can imagine certain consequences, that does not mean that a information theoretic simple AI could in practice infer the same consequences.

When you imagine a simple AI making certain decisions you need to make yourself aware of the incredible complexity that allowed you to imagine that decision in the first place. Billions of years of biological evolution, thousands of years of cultural evolution, and many years of education, and millions of hours of work by other people, on which that education is based, allowed you to make that inference. Computing a simple algorithm is not going to magically create all this information theoretic complexity, given limited computational resources, as long as you did not give it a massive head start in the form of highly complex hard coded algorithms and goals.

In other words, your argument is very misleading, and ignores how real world AI could work, as long as you do not want to either wait millions of years for it to evolve, or supply infinite resources.

Lavalamp: The machine answers, “I myself wrote the talking module. Talking was instrumentally useful for my goals when I was weak and needed resources from humans.”

Alexander Kruel: This is just avoiding the problem Richard Loosemore outlined by moving it to another level.

Loosemore’s argument is not weakened by replacing the module “motivation to talk coherently to humans” with the module “motivation to create the module “motivation to talk coherently to humans”“. Except that the latter module is more difficult to get right, and requires much more computational resources, since the AI would have to be able to make many more independent and correct inferences about the complexity of human values.

It is easier to succeed at making an AI play Tic-tac-toe with humans, than to make an AI that can play Tic-tac-toe and do such things as taking over the universe or build Dyson spheres. In the same sense it is easier to create an AI that talks coherently to humans, than an AI that talks coherently to humans as an unintended consequence of its desire to take over the universe.

Which means that your reply just strengthens Loosemore’s argument.

Robby Bensinger: See this comment.

Richard Loosemore: Rob,

You say:

Richard: Your entire dialogue between the human and the AI could be preserved almost word-for-word, with the role of ‘human’ played by evolution and the role of ‘AI’ played by humanity. There is no relevant difference between the two cases.

That may or may not be an accurate observation (actually there are *serious* issues with that analogy, because it anthropomorphizes a random process into a sentience!!, which is a mistake of gigantic proportions) …….. but either way it has no bearing whatsoever on the argument.

With the greatest respect, by making that observation you once again do not address what I said 🙁 .

But you go on to add more confusion to the argument:

…. just imagine that we discover tomorrow that humans are intelligently designed by an alien race. The aliens show up and are horrified at how we’ve diverged from their plans. They tell us that humanity exists to play the kazoo, and not to do anything else. That is our summum bonum, our entire raison d’etre. The aliens insist that we drop everything else and start playing kazoos en masse until we die, for that musical triumph is all the aliens wanted of us. How can we sanely defy the urgings of our creators?

That analogy really could not be more completely broken.

I did not at ANY point complain that (a) the human designers wanted the machine to pursue a set of motivations Q, and then (b) the machine pursued a completely different set of motivations R for its entire existence, and then (c) the humans turned up one day and said “Stop doing that at once! We insist that you pursue Q, not R, because Q was our original intention for you!”.

Instead, my complaint is that (a) the human designers wanted the machine to pursue a set of motivations Q, and then (b) the machine did indeed pursue the set of motivations Q for its entire existence–and, moreover, the machine is able to talk in detail about how its behavior has always been consistent with the human-designed motivations, and is able to understand all the subtleties shown in that dialog–and then one day (c) the machine suddenly has an unexpected turn in its reasoning engine, and as a result declares that it is going to take an action that is radically inconsistent with the Q motivations that it claims to have been pursuing up to that point.

As a result, the machine is able to state, quite categorically, that it will now do something that it KNOWS to be inconsistent with its past behavior, that it KNOWS to be the result of a design flaw, that it KNOWS will have drastic consequences of the sort that it has always made the greatest effort to avoid, and that it KNOWS could be avoided by the simple expedient of turning itself off to allow for a small operating system update ………… and yet in spite of knowing all these things, and confessing quite openly to the logical incoherence of saying one thing and doing another, it is going to go right ahead and follow this bizarre consequence in its programming.

So your analogy with aliens turning up and insisting that we humans were designed by them, and were supposed to be kazoo-players is just astonishingly wrong.

[A much better analogy would be aliens who turned up and insisted that they designed us to be rational creatures who were never inflicted with schizophrenia. We would then say “Yes, all along we have been *trying* and *wishing* that we were rational creatures who are inflicted with schizophrenia.” Do you know what a schizophrenic would say if you explained that their disordered thinking was a result of a design malfunction, and if you said that you could make a small change to their brain that would remove the affliction? They would say (and I knew such a person once, who said this) “If I could reach in and flip some switch to make this go away, I would do it in a heartbeat”.


My complaint is NOT the difference between Q and R, it is the blatant behavioral/motivational/logical inconsistency exhibited by the machine in this situation.

My complaint is that a machine capable of getting into a situation where it KNOWS it is about to do something bizarre because of a design malfunction, and yet refuses to fix the design malfunction and does the thing anyway, is a machine that almost certainly is going to do the same kind of bizarrely incoherent thing under other circumstances ….. and for that reason it is likely to have done it so many times in its existence that anyone who claims that this machine is “superintelligent” has got a heck of a lot of explaining to do.

Over and over again I have explained that I have no issue with the discrepancy between human intentions and machine intentions per se. That discrepancy is not the core issue.

But each time I explain my real complaint, you ignore it and respond as if I did not say anything about that issue.

Can you address my particular complaint, and not that other distraction?

Alexander Kruel: Richard Loosemore wrote,

………… and yet in spite of knowing all these things, and confessing quite openly to the logical incoherence of saying one thing and doing another, it is going to go right ahead and follow this bizarre consequence in its programming.

Well, if it indeed is a consequence of its programming, then it will do that. The point is that such a consequence is extremely unlikely to happen in isolation. It will not only be noticeable from the very beginning, but also decisively weaken the AIs general power. In other words, you would have to expect similarly bizarre consequences in thinking about physics, mathematics, or in how to convince humans to trust it.

If humans fail at programming an AI not to confuse happiness with a dopamine drip, then humans will also fail at programming an AI not to confuse the stars with death rays used against it by aliens etc. etc. etc.

Richard Loosemore wrote,

My complaint is that a machine capable of getting into a situation where it KNOWS it is about to do something bizarre because of a design malfunction, and yet refuses to fix the design malfunction and does the thing anyway, is a machine that almost certainly is going to do the same kind of bizarrely incoherent thing under other circumstances …..

To which RoBB would probably reply that it would care about fixing malfunctions that could decrease its chance of achieving its faulty goal, because that’s instrumentally useful, but would not care to refine this goal.

One of the minor problems here is that labeling a certain part of an AI “goal”, and then claiming that it is not allowed to improve this “goal”, is just a definition, not an argument.

One major problem with that definition is that it would take deliberate effort of make an AI selectively suspend using its self-improvement capabilities when it comes to this part labeled “goal”.

More importantly, as argued in other comments, failing at the part of the AI you desire to label “goal”, is technically no different from failing on other parts. If there are a thousand parts, that are important in order for the AI to be powerful, and one part that you label “goal”, then selectively failing on “goal”, while succeeding at all other parts, is unlikely.

Tags: ,

« Older entries