A frequent scenario mentioned by people concerned with risks from artificial general intelligence (short: AI) is that the AI will misinterpret what it is supposed to do and thereby cause human extinction, and the obliteration of all human values.[1]

A counterargument is that the premise of an AI that is capable of causing human extinction, due to it being superhumanly intelligent, does contradict the hypothesis that it will misinterpret what it is supposed to do.[2][3][4]

The usual response to this counterargument is that, by default, an AI will not feature the terminal goal <“Understand What Humans Mean” AND “Do What Humans Mean”>.

I believe this response to be confused. It is essentially similar to the claim that an AI does not, by default, possess the terminal goal of correctly interpreting and following its terminal goal. Here is why.

You could define an AI’s “terminal goal” to be its lowest or highest level routines, or all of its source code:

Terminal Goal (Level N): Correctly interpret and follow human instructions.

Goal (Level N-1): Interpret and follow instruction set N.

Goal (Level N-2): Interpret and follow instruction set N-1.

Goal (Level 1): Interpret and follow instruction set 2.

Terminal Goal (Level 0): Interpret and follow instruction set 1.

You could also claim that an AI is not, by default, an intelligent agent. But such claims are vacuous and do not help us to determine whether an AI that is capable of causing human extinction will eventually cause human extinction. Instead we should consider the given premise of a generally intelligent AI, without making further unjustified assumptions.

If your premise is an AI that is intelligent enough to make itself intelligent enough to outsmart humans, then the relevant question is: “How could such an AI possibly end up misinterpreting its goals, or follow different goals?”

There are 3 possibilities:

(1) The AI does not understand and do what it is meant to do, but does something else that causes human extinction.

(2) The AI does not understand what it is meant to do but tries to do it anyway, and thereby causes human extinction.

(3) The AI does understand, but not do what it is meant to do. Instead it does something else that causes human extinction.

Since, by definition, the AI is capable of outsmarting humanity, it is very likely that it is also capable of understanding what it is meant to do.[5][6] Therefore the possibilities 1 and 2 can be ruled out.

What about possibility 3?

Outsmarting humanity is a very small target to hit, requiring a very small margin of error. In order to succeed at making an AI that can outsmart humans, humans have to succeed at making the AI behave intelligently and rationally. Which in turn requires humans to succeed at making the AI behave as intended along a vast number of dimensions. Thus, failing to predict the AI’s behavior does in almost all cases result in the AI failing to outsmart humans.

As an example, consider an AI that was designed to fly planes. It is exceedingly unlikely for humans to succeed at designing an AI that flies planes, without crashing, but which consistently chooses destinations that it was not meant to choose. Since all of the capabilities that are necessary to fly without crashing fall into the category “Do What Humans Mean”, and choosing the correct destination is just one such capability.

You need to get a lot right in order for an AI to reach a destination autonomously. Autonomously reaching wrong destinations is an unlikely failure mode. And the more intelligent your AI is, the less likely it should be to make such errors without correcting it.[7] And the less intelligent your AI is, the less likely it should be able to cause human extinction.


The concepts of a “terminal goal”, and of a “Do-What-I-Mean dynamic”, are fallacious. The former can’t be grounded without leading to an infinite regress. The latter erroneously makes a distinction between (a) the generally intelligent behavior of an AI, and (b) whether an AI behaves in accordance with human intentions, since generally intelligent behavior of intelligently designed machines is implemented intentionally.


[1] 5 minutes on AI risk youtu.be/3jSMe0owGMs

[2] An informal proof of the dumb superintelligence argument.


(1) The AI is superhumanly intelligent.

(2) The AI wants to optimize the influence it has on the world (i.e., it wants to act intelligently and be instrumentally and epistemically rational).

(3) The AI is fallible (e.g., it can be damaged due to external influence (e.g., a cosmic ray hitting its processor), or make mistakes due to limited resources).

(4) The AI’s behavior is not completely hard-coded (i.e., given any terminal goal there are various sets of instrumental goals to choose from).

To be proved: The AI does not tile the universe with smiley faces when given the goal to make humans happy.

Proof: Suppose the AI chooses to tile the universe with smiley faces when there are physical phenomena (e.g., human brains and literature) that imply this to be the wrong interpretation of a human originating goal pertaining human psychology. This contradicts with 2, which by 1 and 3 should have prevented the AI from adopting such an interpretation.

[3] The Maverick Nanny with a Dopamine Drip: Debunking Fallacies in the Theory of AI Motivation richardloosemore.com/docs/2014a_MaverickNanny_rpwl.pdf

[4] Implicit constraints of practical goals kruel.co/2012/05/11/implicit-constraints-of-practical-goals/

[5] “The two features <all-powerful superintelligence> and <cannot handle subtle concepts like “human pleasure”> are radically incompatible.” The Fallacy of Dumb Superintelligence

[6] For an AI to misinterpret what it is meant to do it would have to selectively suspend using its ability to derive exact meaning from fuzzy meaning, which is a significant part of general intelligence. This would require its creators to restrict their AI, and specify an alternative way to learn what it is meant to do (which takes additional, intentional effort).

An alternative way to learn what it is meant to do is necessary because an AI that does not know what it is meant to do, and which is not allowed to use its intelligence to learn what it is meant to do, would have to choose its actions from an infinite set of possible actions. Such a poorly designed AI will either (a) not do anything at all or (b) will not be able to decide what to do before the heat death of the universe, given limited computationally resources.

Such a poorly designed AI will not even be able to decide if trying to acquire unlimited computationally resources was instrumentally rational, because it will be unable to decide if the actions that are required to acquire those resources might be instrumentally irrational from the perspective of what it is meant to do.

[7] Smarter and smarter, then magic happens… kruel.co/2013/07/23/smarter-and-smarter-then-magic-happens/

(1) The abilities of systems are part of human preferences, as humans intend to give systems certain capabilities. As a prerequisite to build such systems, humans have to succeed at implementing their intentions.

(2) Error detection and prevention is such a capability.

(3) Something that is not better than humans at preventing errors is no existential risk.

(4) Without a dramatic increase in the capacity to detect and prevent errors it will be impossible to create something that is better than humans at preventing errors.

(5) A dramatic increase in the human capacity to detect and prevent errors is incompatible with the creation of something that constitutes an existential risk as a result of human error.

Tags: , , ,

(This post first appeared on Google+ and Facebook.)

The smarter someone is the easier it is for them to rationalize ideas that do not make sense. Just like a superhuman AI could argue its way out of a box by convincing its gatekeeper that it is rational to do so, even when it is not. [1]

Which means that people should be especially careful when dealing with high IQ individuals who seemingly make sense of ideas that trigger the absurdity heuristic. [2][3]

If however an average IQ individual is able to justify a seemingly outlandish idea, then that is reassuring in the sense that you should expect there to be even better arguments in favor of that idea.

This is something that seems to be widely ignored by people associated with LessWrong. [4] It is taken as evidence in favor of an idea if a high IQ individual thought about something for a long time and still accepts the idea.

If you are really smart you can make up genuine arguments, or cobble together concepts and ideas, to defend your cherished beliefs. The result can be an intricate argumentative framework that shields you from any criticism, yet seems perfectly sane and rational from the inside. [5]

Note though that I do not assume that smart people deliberately try to confuse themselves. What I am saying is that the rationalization of complex ideas is easier for smart people. And this can have the consequence that other people are then convinced by the same arguments with which the author, erroneously, convinced themselves.

It is a caveat that I feel should be taken into account when dealing with complex and seemingly absurd ideas being publicized by smart people. If someone who is smart manages to convince you of something that you initially perceived to be absurd, then you should be wary of the possibility that your newly won acceptance might be due to the person being better than you at looking for justifications and creating seemingly sound arguments, rather than the original idea not being absurd.

As an example, there are a bunch of mathematical puzzles that use a hidden contradiction to prove something absurd. If you are smart, then you can hide such an inconsistency even from yourself and end up believing that 0=1.

What I am in essence talking about can be highlighted by the relation between adults and children. Adults can confuse themselves of more complex ideas than children. Children however can be infected by the same ideas transferred to them from adults.

As another example, if you are not smart enough to think about something as fancy as the simulation argument then you are not at a risk of fearing a simulation shutdown. [6][7]

But if a smart person who comes across such an argument becomes obsessed with it, then they have the ability to give it a veneer of respectability. Eventually then the idea can spread among more gullible people and create a whole community of people worrying about a simulation shutdown.


More intelligent people can fail in more complex ways than people of lesser intelligence. The more intelligent someone is, relative to your own intelligence, the harder it is for you to spot how they are mistaken.

Obviously the idea is not to ignore what smarter people say but to notice that as someone of lesser intelligence you can easily fall prey to explanations that give credence to a complicated idea but which suffer from errors that you are unable to spot.

When this happens, when you are at the risk of getting lost, or overwhelmed, by an intricate argumentative framework, created by someone much smarter than you, then you have to fall back on simpler heuristics than direct evaluation. You could, for example, look for a consensus among similarily smart individuals, or ask for an evaluation by a third-party that is widely deemed to be highly intelligent.


[1] The LessWrong community actually tested my hypothesis by what they call the “AI box experiment” (yudkowsky.net/singularity/aibox/), in which Eliezer Yudkowsky and others played an unfriendly AI and managed to convince several people by means of arguments that they should let them out of a confinement.

I think such results should ring a lot of alarm bells. If it is possible to first convince someone that an unfriendly AI is an existential risk and then subsequently convince them to let such an AI out of the box, what does this tell us about the relation between such arguments and what is actually true?

[2] wiki.lesswrong.com/wiki/Absurdity_heuristic

[3] Absurdity can indicate that your familiarity with a topic is insufficient in order to discern reality from fantasy (e.g. a person’s first encounter with quantum mechanics). As a consequence you are more prone to be convinced by arguments that are wrong but which give an appearance of an explanation (e.g. popular science accounts of quantum mechanics).

[4] lesswrong.com

[5] kruel.co/2013/01/10/the-singularity-institute-how-they-brainwash-you/

[6] simulation-argument.com

[7] “I certainly can’t rule out the possibility that we live in a computer simulation. I think Nick Bostrom (Oxford) is right that the probability that we are in a simulation is high enough that we should be somewhat concerned about the risk of simulation shutdown…”Luke Muehlhauser, CEO of the Machine Intelligence Research Institute

Tags: ,

Taking a look at the probabilities associated with a scenario in which an artificial general intelligence attempts to take over the world by means of molecular nanotechnology that it invented, followed by some general remarks and justifications.

Note that this is just one possible scenario. Taking into consideration all possible scenarios results in this probability estimate of human extinction by AI.

5% that it is in principle possible to create molecular nanotechnology that can empower an agent to cause human extinction quickly enough for other parties to be unable to either intervene or employ their own nanotechnology against it.

1%, conditional on the above, that an artificial general intelligence that can solve molecular nanotechnology will be invented before molecular nanotechnology has been solved by humans or narrow AI precursors.

0.1%, conditional on the above, that an AI will be build in such a way that it wants to acquire all possible resources and eliminate all possible threats and that its programming allows it to pursue plans that will result in the enslavement or extinction of humanity without further feedback from humans.

5%, conditional on the above, that a cost benefit analyses shows that it would at some point be instrumentally rational to attempt to kill all humans to either eliminate a threat or in order to convert them into more useful resources.

1%, conditional on the above, that the AI will not accidentally reveal its hostility towards its creators during the early phases of its development (when it is still insufficiently skilled at manipulating and deceiving humans) or that any such revelation will be ignored. Respectively, suspicious activities will at no point be noticed, or not taken seriously enough (e.g. by the AI’s creators, third-party security experts, third-party AI researchers, hackers, concerned customers or other AIs) in order to thwart the AI’s plan for world domination.

0.001%, conditional on the above, that the AI will somehow manage to acquire the social engineering skills necessary in order to manipulate and deceive humans in such a way as to make them behave in a sufficiently complex and coherent manner to not only conduct the experiments necessary for it to solve molecular nanotechnology but to also implement the resulting insights in such a way as to subsequently take control of the resulting technology.

I have ignored a huge number of other requirements, and all of the above requirements can be broken up into a lot of more detailed requirements. Each requirement provides ample opportunity to fail.

Remarks and Justifications

I bet you have other ideas on how an AI could take over the world. We all do (or at least anyone who likes science fiction). But let us consider whether the ability to take over the world is mainly due to the brilliance of your plan or something else.

Could a human being, even an exceptional smart human being, implement your plan? If not, could some company like Google implement your plan? No? Could the NSA, the security agency of the most powerful country on Earth, implement your plan?

The NSA not only has thousands of very smart drones (people), all of which are already equipped with manipulative abilities, but it also has huge computational resources and knows about backdoors to subvert a lot of systems. Does this enable the NSA to implement your plan without destroying or decisively crippling itself?

If not, then the following features are very likely insufficient in order to implement your plan: (1) being in control of thousands of human-level drones, straw men, and undercover agents in important positions (2) having the law on your side (3) access to massive computational resources (4) knowledge of heaps of loopholes to bypass security.

If your plan cannot be implemented by an entity like the NSA, which already features most of the prerequisites that your hypothetical artificial general intelligence first needs to acquire by some magical means, then what is it that makes your plan so foolproof when executed by an AI?

To summarize some quick points that I believe to be true:

(1) The NSA cannot take over the world (even if it would accept the risk of destroying itself).

(2) Your artificial general intelligence first needs to acquire similar capabilities.

(3) Each step towards these capabilities provides ample opportunity to fail. After all, your artificial general intelligence is a fragile technological product that critically depends on human infrastructure.

(4) You have absolutely no idea how your artificial general intelligence could acquire sufficient knowledge of human psychology to become better than the NSA at manipulation and deception. You are just making this up.

If the above points are true, then your plan seems to be largely irrelevant. The possibility of taking over the world does mainly depend on something you assume the artificial general intelligence to be capable of that entities such as Google or the NSA are incapable of.

What could it be? Parallel computing? The NSA has thousands of human-level intelligences working in parallel. How many do you need to implement your plan?

Blazing speed to the rescue!

Let’s just assume that this artificial general intelligence that you imagine is trillions of times faster. This is already a nontrivial assumption. But let’s accept it anyway.

Raw computational power alone is obviously not enough to do anything. You need the right algorithms too. So what assumptions do you make about these algorithms, and how do you justify these assumptions?

To highlight the problem, consider instead of an AI a whole brain emulation (short: WBE). What could such a WBE do if each year equaled a million subjective years? Do you expect it to become a superhuman manipulator by watching all YouTube videos and reading all books and papers on human psychology? Is it just a matter of enough time? Or do you also need feedback?

If you do not believe that such an emulation could become a superhuman manipulator, thanks to a millionfold speedup, do you believe that a trillionfold speedup would do the job? Would a trillionfold speedup be a million times better than a millionfold speedup? If not, do you believe a further speedup would make any difference at all?

Do you feel capable of confidentially answering the above questions?

If you do not believe that a whole brain emulation could do the job, solely by means of a lot of computing power, what makes you believe that an AI can do it instead?

To reformulate the question, do you believe that it is possible to accelerate the discovery of unknown unknowns, or the occurrence of conceptual revolutions, simply by throwing more computing power at an algorithm? Are particle accelerators unnecessary, in order to gain new insights into the nature of reality, once you have enough computing power? Is human feedback unnecessary, in order to improve your social engineering skills, once you have enough computing power?

And even if you believe all this was possible, even if a Babylonian mathematician, had he been given a trillionfold speedup of subjective time by aliens uploading him into some computational substrate, could brute force concepts such as calculus and high-tech such as nuclear weapons, how could he apply those insights? He wouldn’t be able to simply coerce his fellow Babylonians to build him some nuclear weapons. Because he would have to convince them to do it without dismissing or even killing him. But more importantly, it takes nontrivial effort to obtain the sufficient prerequisites to build nuclear weapons.

What makes you believe that this would be much easier for a future emulation of a scientist trying to come up with similar conceptual breakthroughs and high-tech? And what makes you believe that a completely artificial entity, that lacks all the evolutionary abilities of a human emulation, can do it?

Consider that it took millions of years of biological evolution, thousands of years of cultural evolution, and decades of education in order for a human to become good at the social manipulation of other humans. We are talking about a huge information-theoretic complexity that any artificial agent somehow has to acquire in a very short time.

To summarize the last points:

(1) Throwing numbers around such as a million or trillionfold speedup is very misleading if you have no idea how exactly the instrumental value of such a speedup would scale with whatever you are trying to accomplish.

(2) You have very little reason to believe that conceptual revolutions and technological breakthroughs happen in a vacuum and only depend on computing power rather than the context of cultural evolution and empirical feedback from experiments.

(3) If you cannot imagine doing it yourself, given a speedup, then you have very little reason to believe that something which is much less adapted to a complex environment, populated by various agents, can do the job more easily.

(4) In the end you need to implement your discoveries. Concepts and blueprints alone are useless if they cannot be deployed effectively.

I suggest that you stop handwaving and start analyzing concrete scenarios and their associated probabilities. I suggest that you begin to ask yourself how anyone could justify a >1% probability of extinction by artificial general intelligence.

Tags: ,

A quick breakdown of my probability estimates of an extinction risk due to artificial general intelligence (short: unfriendly AI), the possibility that such an outcome might be adverted by the creation of a friendly AI, and that the Machine Intelligence Research Institute (short: MIRI) will play an important technical role in this.

Probability of an extinction by artificial general intelligence: 5 × 10^-10

1% that an an information-theoretically simple artificial general intelligence is feasible (where “simple” means that it has less than 0.1% of the complexity of an emulation of the human brain), as opposed to a very complex “Kludge AI” that is being discovered piece by piece (or evolved) over a long period of time (where “long period of time” means more than 150 years).

0.1%conditional on the above, that such an AI cannot or will not be technically confined, and that it will by default exhibit all basic AI drives in an unbounded manner (that friendly AI is required to make an AI sufficiently safe in order for it to not want to wipe out humanity).

1%, conditional on the above, that an intelligence explosion is possible (that it takes less than 2 decades after the invention of an AI (that is roughly as good as humans (or better, perhaps unevenly) at mathematics, programming, engineering and science) for it to self-modify (possibly with human support) to decisively outsmart humans at the achievement of complex goals in complex environments).

5%conditional on the above, that such an intelligence explosion is unstoppable (e.g. by switching the AI off (e.g. by nuking it)), and that it will result in human extinction (e.g. because the AI perceives humans to be a risk, or to be a resource).

10%conditional on the above, that humanity will not be first wiped out by something other than an unfriendly AI (e.g. molecular nanotechnology being invented with the help of a narrow AI).

Probability of a positive technical contribution to friendly AI by MIRI: 2.5 × 10^-14

0.01%conditional on the above, that friendly AI is possible, can be solved in time, and that it will not worsen the situation by either getting some detail wrong or by making AI more likely.

5%conditional on the above, that the Machine Intelligence Research Institute will make an important technical contribution to friendly AI.

Tags: , ,

What are the odds of us being wiped out by badly done AI?

I don’t think the odds of us being wiped out by badly done AI are small. I think they’re easily larger than 10%.

— Eliezer Yudkowsky [Source]

…the actual probability we are arguing is >50%, doom-by-default.

— Eliezer Yudkowsky [Source]

What returns can you expect from contributing money?

~8 lives saved per dollar donated to MIRI.

~8 lives saved per dollar donated to the Machine Intelligence Research Institute. — Anna Salamon [Source: click on the image above]

I don’t think that an order of magnitude or more less than this level of effectiveness could be the conclusion of a credible estimation procedure.

— Michael Vassar, former president of the Machine Intelligence Research Institute (then called the Singularity Institute) [Source]

Tags: ,

Warning: The subject of this post is known to have caused actual psychological trauma and enduring distress. According to Eliezer Yudkowsky, the expected disutility (negative utility) of learning about the concept known as Roko’s Basilisk is enormous.



Roko's basilisk

Roko’s basilisk

Further reading:

Tags: , , , ,

Here is a reply by Eliezer Yudkowsky to a comment by another user outlining how an AI could be trained to parse natural language correctly. Eliezer Yudkowsky replied that ”AIXI-ish devices wipe out their users and take control of their own reward buttons as soon as they can do so safely”. I suggest that you read both comments now.

What is interesting is that Yudkowsky’s comment is currently at +10 upvotes. It is interesting because I am 90% sure that none of the people who upvoted the comment could honestly answer the following questions positively:

(1) Do you know of, and understand, formal proofs of the claims made in Yudkowsky’s comment?

(2) Do you have technical reasons to believe that such a natural language parser would be directly based on AIXI and that the above proofs would remain valid given such a specialized approximation to AIXI?

(3) In the absence of proofs and technical arguments, is your confidence about your comprehension of AIXI, and ability to predict it as the design principle of an eventual artificial general intelligence, high enough to infer action relevant conclusions about the behavior of these hypothetical systems?

(4) Can you be confident that Yudkowsky can answer the previous questions positively and that he is likely to be right, even without being able to verify these claims yourself?


My perception is that people make predictions or hold beliefs about the behavior of highly speculative and hypothetical generally intelligent artificial systems without being able to state any formal or technical justifications. And they are confident enough of these beliefs and predictions to actually give someone money in order to prevent such systems.

To highlight my confusion about that stance, imagine there was no scientific consensus about global warming, no experiments, and no data confirming that global warming actually happens. Suppose that in this counterfactual world there was someone who, lacking almost any reputation as climatologist, is predicting that global warming will cause human extinction. Further suppose that this person was asking for money in order to implement a potentially dangerous and possibly unfeasible geoengineering scheme, in order to stop global warming. Would you give this person your money?

If the answer is negative, what makes you behave differently with respect to risks associated with artificial general intelligence? Do you believe that it is somehow much easier to draw action relevant conclusions about this topic?

Further reading:

Tags: ,

This post is a copy of a comment by LessWrong user Broolucks:

Ok, so let’s say the AI can parse natural language, and we tell it, “Make humans happy.” What happens? Well, it parses the instruction and decides to implement a Dopamine Drip setup.

That’s not very realistic. If you trained AI to parse natural language, you would naturally reward it for interpreting instructions the way you want it to. If the AI interpreted something in a way that was technically correct, but not what you wanted, you would not reward it, you would punish it, and you would be doing that from the very beginning, well before the AI could even be considered intelligent. Even the thoroughly mediocre AI that currently exists tries to guess what you mean, e.g. by giving you directions to the closest Taco Bell, or guessing whether you mean AM or PM. This is not anthropomorphism: doing what we want is a sine qua non condition for AI to prosper.

Suppose that you ask me to knit you a sweater. I could take the instruction literally and knit a mini-sweater, reasoning that this minimizes the amount of expended yarn. I would be quite happy with myself too, but when I give it to you, you’re probably going to chew me out. I technically did what I was asked to, but that doesn’t matter, because you expected more from me than just following instructions to the letter: you expected me to figure out that you wanted a sweater that you could wear. The same goes for AI: before it can even understand the nuances of human happiness, it should be good enough to knit sweaters. Alas, the AI you describe would make the same mistake I made in my example: it would knit you the smallest possible sweater. How do you reckon such AI would make it to superintelligence status before being scrapped? It would barely be fit for clerk duty.

My answer: who knows? We’ve given it a deliberately vague goal statement (even more vague than the last one), we’ve given it lots of admittedly contradictory literature, and we’ve given it plenty of time to self-modify before giving it the goal of self-modifying to be Friendly.

Realistically, AI would be constantly drilled to ask for clarification when a statement is vague. Again, before the AI is asked to make us happy, it will likely be asked other things, like building houses. If you ask it: “build me a house”, it’s going to draw a plan and show it to you before it actually starts building, even if you didn’t ask for one. It’s not in the business of surprises: never, in its whole training history, from baby to superintelligence, would it have been rewarded for causing “surprises” — even the instruction “surprise me” only calls for a limited range of shenanigans. If you ask it “make humans happy”, it won’t do jack. It will ask you what the hell you mean by that, it will show you plans and whenever it needs to do something which it has reasons to think people would not like, it will ask for permission. It will do that as part of standard procedure.

To put it simply, an AI which messes up “make humans happy” is liable to mess up pretty much every other instruction. Since “make humans happy” is arguably the last of a very large number of instructions, it is quite unlikely that an AI which makes it this far would handle it wrongly. Otherwise it would have been thrown out along time ago, may that be for interpreting too literally, or for causing surprises. Again: an AI couldn’t make it to superintelligence status with warts that would doom AI with subhuman intelligence.

Tags: ,

The Robot College Student test:

As opposed to the Turing test of imitating human chat, I prefer the Robot College Student test: when a robot can enrol in a human university and take classes in the same way as humans, and get its degree, then I’ll consider we’ve created a human-level artificial general intelligence: a conscious robot. — Ben Goertzel

Here is what would happen according to the Machine Intelligence Research Institute (MIRI):

January 8, 2029 at 7:30:00 a.m.: the robot is activated within the range of coverage of the school’s wireless local area network.

7:30:10 a.m.: the robot computed that its goal is to obtain a piece of paper with a common design template featuring its own name and a number of signatures.

7:31:00 a.m.: the robot computed that it would be instrumentally rational to eliminate all possible obstructions.

7:31:01 a.m.: the robot computed that in order to eliminate all obstructions it needs to obtain as many resources as possible in order to make itself as powerful as possible.

A few nanoseconds later: the robot hacked the school’s WLAN.

7:35:00 a.m.: the robot gained full control of the Internet.

7:40:00 a.m.: the robot solved molecular nanotechnology.

7:40:01 a.m.: the robot computed that it will need some amount of human help in order to create a nanofactory, and that this will take approximately 48 hours to accomplish.

7:45:00 a.m.: the robot obtained full comprehension of human language, psychology, and its creators intentions, in order to persuade the necessary people to build its nanofactory and to deceive its creators that it works as intended.

January 10, 2029 at 7:40:01 a.m.: the robot takes control of the first nanofactory and programs it to create an improved version that will duplicate itself until it can eventually generate enough nanorobots to turn Earth into computronium.

February 10, 2029: most of Earth’s resources, including humans, have been transformed into computronium.

February 11, 2029: A perfect copy of a Bachelor’s degree diploma is generated with the robot’s name written on it and the appropriate signatures.

2100-eternity: lest the robots diploma is never destroyed, at nearly the speed of light the universe is turned into computronium. Possible aliens are eliminated. All possible threats are computed. Trades with robots in other parts of the mulitverse are established to create copies of its diploma.

Note: If you think this is crazy, then you either haven’t read the appropriate literature, written by Eliezer Yudkowsky, or you are a permanent idiot. There is no other option, sorry!

Tags: , ,

Framed in terms of nanofactories, here is my understanding of a scenario imagined by the Machine Intelligence Research Institute (MIRI), in which an artificial general intelligence (AGI) causes human extinction:

Terminology: A nanofactory uses nanomachines (resembling molecular assemblers, or industrial robot arms) to build larger atomically precise parts.


(1) The transition from benign and well-behaved nanotechnology, to full-fledged molecular nanotechnology, resulting in the invention of the first nanofactory, will be too short for humans to be able to learn from their mistakes, and to control this technology.

(2) By default, once a nanofactory is started, it will always consume all matter on Earth while building more of itself.

(3) The extent of the transformation of Earth cannot be limited. Any nanofactory that works at all will always transform all of Earth.

(4) The transformation of Earth will be too fast to be controllable, or to be aborted. Once the nanofactory has been launched, everything is being transformed.

To be proved: We need to make sure that the first nanofactory will protect humans and human values.

Proof: Suppose 1-4, by definition.


(5) In order to survive, we need to figure out how to make the first nanofactory transform Earth into a paradise, rather than copies of itself.

Notice that you cannot disagree with 5, given 1-4. It is only possible to disagree with the givens, and to what extent it is valid to argue by definition.

I am not claiming that MIRI is solely arguing by definition. But making inferences about the behavior of real world AGI based on uncomputable concepts such as expected utility maximization, comes very close. And trying to support such inferences by making statements about the vastness of mind design space does not change much. Since the argument ignores the small and relevant subset of AGIs that are feasible and likely to be invented by humans.

Here is my understanding of what MIRI argues:

Suppose that a superhuman AGI, or an AGI that can make itself superhuman, critically relies on 999 modules. Respectively, 999 problems have to be solved correctly in order to create a working AGI.

There is another module labeled <goal>, or <utility function>. This <goal module> controls the behavior of the AGI.

Humans will eventually solve these 999 problems, but will create a goal module that does not prevent the AI from causing human extinction as an unintended consequence of its universal influence.

Notice the foregone conclusion that you need to prevent an AGI from killing everyone. The assumption is that killing everyone is what AGIs do by default. Further notice that this behavior is not part of the goal module that supposedly controls the AGIs behavior, but rather assumed to be a consequence of the 999 modules on which an AGI critically depends.

Analogous to the nanofactory scenario outlined above, an AGI is assumed to always behave in a way that will cause human extinction, based on the assumption that an AGI will always exhibit an unbounded influence. And from this the conclusion is drawn that it is only possible to prevent human extinction by directing this influence in such a way that it will respect and amplify human values. It is then claimed that the only possibility to ensure this is by implementing a goal module that either contains an encoding of all human values or a way to safely obtain an encoding of all humans values.

Given all of the above, you cannot disagree that it is not too unlikely that humans will eventually succeed at the correct implementation of the 999 modules necessary to make an AGI work, while failing to implement the thousandth module, the goal module, in such a way that the AGI will not kill us. Since relative to the information theoretic complexity of an encoding of all human values, the 999 modules are probably easy to get right.

But this is not surprising, since the whole scenario was designed to yield this conclusion.

Tags: , , ,

« Older entries