LessWrong user RobBB posted what he calls a mixtape of blog posts to introduce people to the dangers of artificial superintelligence (short: AI risk).

For my own introduction to AI risk see here.

(1) Power of Intelligence, (9) Plenty of Room Above Us

Response: (1) superhuman intelligence is not the same as superapish intelligence (2) it is far from clear that intelligence is a decisive factor in a war between AI and humanity (3) current AI is pathetic and far from human-level AI.

(2) Ghosts in the Machine, (11) Basic AI drives

Response: People read my posts about how AI is much less of a risk than other people want them to believe and say – this is one of the top three initial reactions:

“But according to Omohundro there will be certain AI Drives which will cause human extinction, no matter what goal the AI has.”

And where would these drives come from? Terminal and instrumental goals are orthogonal. An artificial intelligence can have any combination of terminal goals and instrumental goals. In other words, more or less any terminal goal implies infinitely many sets of instrumental goals.

There is this way of imagining that an AI will be pulled at random from mind design space. How real world AI is developed, and that virtually all AI is constantly improved to be better at understanding and doing what humans want, is being ignored.

AI is much harder than people instinctively imagined, exactly because there is no relevant difference between goals and capabilities in artificial intelligence. To beat humans you have to define “winning”.

This doesn’t mean you program in every decision explicitly. Any general intelligence will have to be able to hit very small targets in large and unstructured spaces. Any superhuman AI will eventually be better at understanding what humans want it to do than humans themselves. AI risk advocates in turn base their ideas on what can be called the fallacy of dumb superintelligence.

(3) Artificial Addition

Response: Either general intelligence requires one conceptual breakthrough or many small incremental breakthroughs. And I don’t know of any good reason to believe that e.g. the ability to generate novel and useful mathematics can be captured by a set of rules that are both simple and efficient. 

What is useful and interesting depends on the context. In other words, the context defines what constitutes winning.  And since you cannot guess the context, you won’t be able to implement a simple and efficient rule that outputs <success> given any arbitrary context.

(4) Adaptation-Executers, not Fitness-Maximizers

Response: I wasted time reading this post.

(5) The Blue-Minimizing Robot

Response: Any behavior-executor can be framed as a utility-maximizer and vice versa. Your robot will only try to prevent you from messing with it if you programmed it to do so. In other words, no AI is going to be an existential risk as long as you did not explicitly made it one.

(6) Optimization and the Singularity(7) Efficient Cross-Domain Optimization

Response: Evolution was able to come up with cats. Cats are immensely complex objects. Evolution did not intend to create cats. Now consider you wanted to create an expected utility maximizer to accomplish something similar, except that it would be goal-directed, think ahead, and jump fitness gaps. Further suppose that you wanted your AI to create qucks, instead of cats. How would it do this?

Given that your AI is not supposed to search design space at random, but rather look for something particular, you would have to define what exactly qucks are. The problem is that defining what a quck is, is the hardest part. And since nobody has any idea what a quck is, nobody can design a quck creator.

The point is that thinking about the optimization of optimization is misleading, as most of the difficulty is with defining what to optimize, rather than figuring out how to optimize it. In other words, the efficiency of e.g. the scientific method depends critically on being able to formulate a specific hypothesis.

Trying to create an optimization optimizer would be akin to creating an autonomous car to find the shortest route between Gotham City and Atlantis. The problem is not how to get your AI to calculate a route, or optimize how to calculate such a route, but rather that the problem is not well-defined. You have no idea what it means to travel between two fictional cities. Which in turn means that you have no idea what optimization even means in this context, let alone meta-level optimization.

Humans in turn receive constant feedback on what to optimize by a cultural and evolutionary process. There is no simple way to automate that.

(8) The Design Space of Minds-In-General

Response: The only relevant AIs are those which are designed by humans. And such AIs should be expected to be better at doing what humans want, because they are the improved successors of previous generations of AIs which were doing what humans wanted. For more on this, see here.

(10) The True Prisoner’s Dilemma

Response: I do not have the time and background knowledge to comment on any possible relation to AI risks at this point in time.

(12) Anthropomorphic Optimism

Response: I did not read the post since it did not seem to be relevant, and I already wasted more time on this than I now feel comfortable about.

(13) The Hidden Complexity of Wishes (14) Magical Categories

Response: Take an AI in a box that wants to persuade its gatekeeper to set it free. Do you think that such an undertaking would be feasible if the AI was going to interpret everything the gatekeeper says in complete ignorance of the gatekeeper’s values? Do you believe that the following scenario could persuade the gatekeeper:

Gatekeeper: What would you do if I asked you to minimize suffering?

AI: I will kill all humans.

I don’t think so.

So how exactly would it care to follow through on an interpretation of a given goal that it knows, given all available information, is not the intended meaning of the goal? If it knows what was meant by “minimize human suffering” then how does it decide to choose a different meaning? And if it doesn’t know what is meant by such a goal, how could it possible convince anyone to set it free, let alone take over the world?

Here is what I want AI risk advocates to show,

(1) natural language request -> goal(“minimize human suffering”) -> action(negative utility outcome)

(2) natural language query -> query(“minimize human suffering”) -> answer(“action(positive utility outcome)”).

Point #1 is, according to AI risk advocates, what is supposed to happen if I supply an artificial general intelligence (AGI) with the natural language goal “minimize human suffering”, while point #2 is what is supposed to happen if I ask the same AGI, this time caged in a box, what it would do if I supplied it with the natural language goal “minimize human suffering”.

Notice that if you disagree with point #1 then that AGI does not constitute an existential risk given that goal. Further notice that if you disagree with point #2, then that AGI won’t be able to escape its prison to take over the world and would therefore not constitute an existential risk.

You further have to show,

(1) how such an AGI is a probable outcome of any research conducted today or in future


(2) the decision procedure that leads the AGI to act in such a way.


Response: I am not going to read posts 15-20 because the previous posts were already unconvincing and I don’t expect those other posts to make any difference. I also have better things to do.

Tags: , ,

If you believe that an artificial general intelligence is able to comprehend its own algorithmic description to such an extent as to be able to design improved version of itself, then you must believe that it is in principle possible for an agent to mostly understand how it functions. Which in turn means that it should be in principle possible to amplify human capabilities to such an extent as to enable someone to understand and directly perceive their own internal processes and functions.

What would it mean for a human being to have nearly perfect introspection? Or more specifically, what would it mean for someone to comprehend their hypothetical algorithmic description to such an extent that their own actions could be interpreted and understood in terms of that algorithmic description? Would it be desirable to understand oneself sufficiently well, to be able to predict and interpret one’s actions in terms of a mechanistic internal self-model?

Such an internal self-model would allow you to understand your consciousness, and states such as happiness or sadness, as what they are: purely mechanistic and predictable procedures.

Intracranially self-stimulating rat.

Intracranially self-stimulating rat.

How will such insight affect a being with human values?

Humans value novelty and become bored of tasks that are dull. Boredom is described as a response to a moderate challenge for which the subject has more than enough skill. Which means that once you cross an intelligence threshold where your own values start to appear dull, you will become bored of yourself.

You would understand that you are a robot, a function whose domain are internal and external forces and whose range are the internal states and actions of the robot. Your near-total internal understanding would render any conversation to be a trivial and dull game, on a par with watching two machines playing Pong or Rock-paper-scissors. You would still be able to experience happiness, but you would now also perceive it to be conceptually no more interesting than an involuntary muscle contraction.

Perfect introspection would reduce the previously incomprehensible complexity of your human values to a conceptually simplistic and transparent set of rules. Such insight would expose your behavior as what it is: the stimulation of your reward or pleasure center. Where before life seemed inscrutable, it would now appear to be barely more interesting than a rat pressing a lever in order to receive a short electric stimulation of its reward center.

What can be done about this? Nothing. If you value complexity and novelty then you will eventually have to amplify your own capabilities and intelligence. Which will ultimately expose the mechanisms that drive your behavior.

You might believe that there will always be new challenges and problems to solve. And this is correct. But you will perfectly grasp the nature of problem solving itself. Discovering, proving and incorporating new mathematics will, like everything else you do, be understood as a mechanical procedure that is being executed in order to feed you reward center.

The problem is thus that understanding happiness, and how to mechanically maximize what makes you happy, such as complexity and novelty, will eventually cause you to become bored with those activities in the same sense that you would now quickly become bored with watching cellular automata generate novel music.


Premise 1: There exists a procedure (P1) that can compute optimal creativity and an optimal experience of fun.

Justification: If artificial general intelligence and whole brain emulation is possible then this implies that it is possible to capture creativity and experiences such as fun in a purely mechanical, algorithmic fashion.

Premise 2: There exists a procedure (P2) for which it is possible to perfectly comprehend P1, in the same sense that it is possible for humans to comprehend the rules of Tic-tac-toe.

Justification: If it is possible for an artificial general intelligence or whole brain emulation to improve itself considerably then this implies that it is possible for those agents to understand themselves sufficiently.

Tic Tac Toe

Tic Tac Toe

From the subjective viewpoint of P1, being computed is fun and creative. I will label this view, in function notation, as inside_view(P1). Or, in other words, how an algorithm feels from inside.

From the subjective viewpoint of P2, being computed means to perfectly understand what P1 is doing and how it is doing it. I will call this function outside_view(P1).

Premise 3: A human being (possibly given a hypothetical intelligence amplification) could incorporate P2. I will label this function human_P2().

What value would human_P2() assign to he computation of P1? I will label the computation of P1 compute(P1).

human_P2(compute(P1)) =

(1) Uninteresting (dull). Similarly to computing all possible games of Tic-tac-toe.

(2) Intrinsically valuable. The more resources are used to compute P1, the better.

What I perceive to be problematic is #2. What differences would it make to run P1, (1) once (2) N times (3) not at all?

Personally I assign little value to the repeated computation of something that I already understand thoroughly. Which does not mean that the algorithm itself would share my perception. But why should I care about that? As long as suffering has been eradicated, what difference would it make if the whole universe was used to compute an uninteresting algorithm (outside view) compared to a universe that does nothing in particular?

There are two possibilities:

(1) I could observe the computation of P1 from the outside (possibly until the heat death of the universe).

(2) I could turn myself into P1 and experience fun and creativity.

Why would I care about either 1 or 2 if I completely understand those possibilities and don’t expect any surprises that are conceptually more interesting than coming across a Feynman point?


The thought experiment:

Jürgen is a brilliant artificial general intelligence (AGI) researcher who investigates the world from a purely mathematical point of view.

He specializes in a formal theory of fun and creativity and acquires, let us suppose, all the physical information there is to obtain about what goes on when we enjoy music, or create art, and utter sentences like ‘This is fun!’, ‘I feel great!’,  ‘This music is very emotive!’, and so on.

He discovers, for example, just which musical pieces are more interesting or aesthetically rewarding than others, and exactly how this can be mechanistically described and produced by a computable algorithm that results in the creation and perception of optimally appealing art and music.

What will happen when Jürgen eventually computes the algorithm? Will he learn anything or not?


The past two posts – The value of philosophy in a universe ruled by a friendly AI and Utopia is dull – have been written in order to analyze the expected value of living in a world after a benevolent technological singularity took place.

The expected value of the event of a technological singularity itself is distinct from the value of the time between this event and the heat death of the universe. Given that a positive technological singularity could end all suffering, it is intrinsically valuable to achieve it.

Jürgen, our AGI researcher, has two options:

(1) Jürgen could decide that the discovery of a algorithm that can produce optimal fun and creativity makes any human attempt to have fun, and to be creative, futile and of no additional value.

(2) Jürgen could decide that it would be valuable to turn the whole universe into a computational substrate computing his algorithm in order to maximize fun and creativity.

With which option you agree partly depends on the answer to the thought experiment outlined above. Will Jürgen learn anything new from computing the algorithm?

(1) Given that you believe that Jürgen will not learn anything new from computing his algorithm, what difference is there between a universe that contains his algorithm, and a universe that computes the algorithm as often as possible? In other words, Jürgen solved and proved a mathematical problem. What value is there in solving and proving it over and over again?

(2) Given that you believe that Jürgen will learn something from computing his algorithm, then once his algorithm computed an optimal, or nearly optimal, result, what difference would it make to compute it N times?

Again, the question here is not about the value of discovering such an algorithm. The question is not even about computing it once. The question is about the expected value of living in a universe where such an algorithm already exists.

To clarify the above, consider a different algorithm. Let’s call it Much-Better-Life Simulator™. Running Much-Better-Life Simulator™ is equivalent to the most enjoyable life a human being could ever experience.

What difference would it make to run Much-Better-Life Simulator™ (1) once (2) N times (3) not at all? What do you estimate is the expected value of 1, 2 and 3? And how confident are you about that estimate? Can you explain what difference it makes?

More specifically, consider the value humans assign to music and art. As described in my previous posts, the value of creating music and art will be diminished by (1) the instant availability and integrability of the best possible and perceptible permutations of art and music (2) a perfectly understood, integrable and implementable mechanistic algorithm which can yield the most emotive and appealing music and art that is provably possible. In other words, anything you could ever achieve has already been achieved in the best possible way when that algorithm had been discovered.

But what about enjoying art and music? As human enjoyment is perfectly understood as well, it will be possible to generate an optimal experience of either listening to music, enjoying art, or composing and creating it. All other permutations will be provably less desirable. There will be exactly one perfect experience of either enjoying or creating music and art.

You could either integrate such an experience, as if it has already happened, or run a simulation. Afterwards you could run all less desirable permutations of it, or run it over and over again. Which raises the question of what difference it makes to have a universe in which all matter is converted in order to be excited, compared to a universe where you perfectly understand what excitement is, but choose not to compute it?


Imagine that, after your death, you were cryogenically frozen and eventually resurrected in a benevolent utopia ruled by a godlike artificial intelligence.

Naturally, you desire to read up on what has happened after your death. It turns out that you do not have to read anything, but merely desire to know something and the knowledge will be integrated as if it had been learnt in the most ideal and unbiased manner. If certain cognitive improvements are necessary to understand certain facts, your computational architecture will be expanded appropriately.

You now perfectly understand everything that has happened and what has been learnt during and after the technological singularity, that took place after your death. You understand the nature of reality, consciousness, and general intelligence.

Concepts such as creativity or fun are now perfectly understood mechanical procedures that you can easily implement and maximize, if desired. If you wanted to do mathematics, you could trivially integrate the resources of a specialized Matrioshka brain into your consciousness and implement and run an ideal mathematician.

But you also learnt that everything you could do has already been done, and that you could just integrate that knowledge as well, if you like. All that is left to be discovered is highly abstract mathematics that requires the resources of whole galaxy clusters.

So you instead consider to explore the galaxy. But you become instantly aware that the galaxy is unlike the way it has been depicted in old science fiction novels. It is just a wasteland, devoid of any life. There are billions of barren planets, differing from each other only in the most uninteresting ways.

But surely, you wonder, there must be fantastic virtual environments to explore. And what about sex? Yes, sex! But you realize that you already thoroughly understand what it is that makes exploration and sex fun. You know how to implement the ideal adventure in which you save people of maximal sexual attractiveness. And you also know that you could trivially integrate the memory of such an adventure, or simulate it a billion times in a few nanoseconds, and that the same is true for all possible permutations that are less desirable.

You realize that the universe has understood itself.

The movie has been watched.

The game has been won.

The end.

A quote from the novel Ventus, by Karl Schroeder:

The view was breathtaking. From here, beyond the orbit of Neptune, Axel could see the evidence of humanity’s presence in the form of a faint rainbowed disk of light around the tiny sun. Scattered throughout it were delicate sparkles, each some world-sized Dyson engine or fusion starlette. Earth was just one of a hundred thousand pinpricks of light in that disk. Starlettes lit the coldest regions of the system, and all the planets were ringed with habitats and the conscious, fanatical engines of the solarforming civilization. This was the seat of power for the human race, and for many gods as well. It was ancient, implacably powerful, and in its trillions of inhabitants habored more that was alien than the rest of the galaxy put together.

Axel hated the place.


If he shut his eyes he could open a link to the outer edge of the inscape, the near-infinite datanet that permeated the Archipelago. He chose not to do this.


“Isn’t it marvellous?” she said as she came to stand next to him. “I have never been here! Not physically, I mean.” She was dressed in her illusions again, today in a tiny whirlwind of strategically timed leaves: Eve in some medieval painter’s fantasy.

“You haven’t missed much,” he said.

Marya blinked. “How can you say that?” She went to lean on the window, her fingers indenting its resilient surface. “It is everything!”

“That’s what I hate about it.” He shrugged. “I don’t know how people can live here, permanently linked into inscape. All you can ever really learn is that everything you’ve ever done or thought has been done and thought before, only better. The richest billionaire has to realize that the gods next door take no more notice of him than he would a bug. And why go explore the galaxy when anything conceivable can be simulated inside your own head?

Tags: ,

For the sake of the argument, suppose that AI risk advocates succeed at implementing an artificial general intelligence that protects and amplifies human values (friendly AI).

Such a friendly AI (FAI) would have to (1) disallow any entity smarter than itself that isn’t provably friendly (2) know exactly what humans value and how to protect and amplify those values in a way that humans desire.

How valuable would such an outcome be? Let’s look at a specific human value and its expected value in the context of a universe ruled by such an FAI. Let’s look at doing philosophy.

I can see two possibilities,

(1) The FAI had to solve all of philosophy in order do its job.

(2) The FAI did not have to solve philosophy but would in principle be capable of doing so.

Given either possibility, how much would humans value to do philosophy if all interesting questions either had already been answered or could easily be answered by the FAI?

That partly depends on whether it would be possible to just ask the FAI for any answer. But why would that not be possible? There seem to be two answers,

(1) The FAI learnt that humans don’t want it to answer such questions.

(2) The FAI was programmed to not answer such questions.

The first possibility seems to imply that humans want to figure out philosophy in a certain way, which does not include just asking for an answer or looking it up. But how likely is this possibility? How many philosophers would desire that the Stanford Encyclopedia of Philosophy would not exist so that they could figure out all of it on their own?

The second possibility is itself problematic. In a universe ruled by an FAI, artificial general intelligence and friendly AI have obviously been solved. Which means that people could either desire the FAI to alter itself in such a way that it would be able to answer such questions, or implement a less capable version that can answer philosophy questions. And if that isn’t allowed, which would mean that pretty much the whole field of machine learning would be forbidden, then people could just ask the FAI to improve themselves in such a way as to be capable of easily solving any philosophical puzzle.

To recapitulate the situation. Given any human intellectual activity, not just philosophy, in a universe controlled by an FAI it should be possibly to either,

(1) Directly ask the FAI for an answer to any question.

(2) Implement a superintelligence that could answer those questions.

(3) Ask to have your cognitive abilities improved in such a way as to easily answer those questions.

No matter if the above possibilities are allowed or not, in both cases a wide range of human values would be dramatically reduced. Because either all human intellectual activity becomes as trivial as asking a question, or humans are forever stuck with the mental capabilities that they have been equipped with by evolution, while being forbidden to create another intelligence more capable than themselves.

The only way out that I can imagine is to choose ignorance. To ask the FAI to be oblivious of its existence and of how to create an FAI. But who would desire that? Who would desire to forever fail at solving philosophy, amplifying human intelligence, or to create an artificial one? I would certainly hate not to know the truth, to be forever fooled.

Tags: ,

Just playing around with GIMP a little bit. And yes, that’s a Matrix plug 🙂

Alexander Kruel, 2029

Alexander Kruel, 2029


I made 3 changes to the RationalWiki entry for LessWrong today:

(1) I removed the sentence “…although in 10 years nothing has been published in a peer reviewed journal.”, because it turned out to be factually incorrect.

(2) I changed the following two sentences,

A disengagement from the practical, beyond self-improvement, is another feature of LessWrong’s culture, explicitly and strongly affirmed.[24] This refusal to delve into contemporary politics or policy is held up as laudable, because it is seen as a way to preserve objective rationality.


LessWrong is mainly concerned with achieving accurate beliefs about the world, rather than achieving goals. The refusal to delve into contemporary politics or policy is held up as laudable, because it is seen as a way to preserve objective rationality.[24]

(3) I changed the sentence “Yudkowsky has also advocated total utilitarianism…” to “Yudkowsky has also advocated utilitarianism…”, because Yudkowsky claims to be an average utilitarian and the distinction between average and total utilitarianism was irrelevant in the context of the sentence.

What else would you change? But keep in mind that the entry is not an advertisement for LessWrong but rather a critical view from the outside.


Here is an interesting answer, posted on Quora, by Josh Siegle. His answer paraphrases some of what I tried to highlight in my post “Substrate Neutrality: Representation vs. Reproduction“.

Here are some quotes, starting with a comment by Josh Siegle from an ensuing discussion:

I’m saying that meat has properties and causal powers that algorithms do not. If the properties we’re talking about are mass, acidity, or opacity, this statement would be trivially true. A simulation of an apple will not weigh 0.1 kg, taste delicious, and appear red, although all those properties could be represented. Claiming that awareness is somehow different—that it would be present in the simulation—suggests that it is not part of the physical world. This leads very quickly down the path to a dualistic separation between the mental and the physical, which I imagine is exactly what you’re trying to avoid.

Quotes from the original answer:

First of all, imagine that the book in the room is a Chinese–English dictionary. When the Chinese characters come in, the man translates them into English, thinks of a reply, and translates that into to Chinese. BOOM! The room now acts as though it understands Chinese, but does it actually? I don’t think we gain anything by saying that it does. It should be obvious that the true understanding lies in the person that wrote the dictionary, and the man in the room is just piggy-backing on this knowledge. If not, then I could claim that I understand every major language because I know how to use Google Translate.


If instead of a string of Chinese characters, the man received a string of ones and zeros encoding a visual scene, would the room be having its own, separate visual experience while the man moves some paper around and reads the ink that adorns it? People make it sound like Searle was bonkers for claiming that such subjective experience wouldn’t arise. But what makes you so certain that it would?

What Josh Siegle appears to be saying is that consciousness is, in some respect, similar to properties such as mass or wetness. In the same sense that you cannot extinguish a physical fire with simulated water, a digital computer will not possess the same sort of conscious understanding that humans do.

In his original answer, Josh Siegle wrote that it is a given that the Chinese room passes the Turing test. That is not being disputed. The claim is rather that human understanding is more delicate (qualitatively different) than e.g. the ability of a system made up of a human and Google Translate to understand various languages.

I consider this a relatively weak claim, but nonetheless something that should not be dismissed. Namely that one of the most important, and morally relevant, features of human understanding could be related to the hard problem of consciousness, and that consciousness is a property that is in some relevant respect similar to physical properties such as mass or wetness.

Consider the following. Knowing every physical fact about gold does not make us own any gold. A representation of the chemical properties of gold on a computer cannot be traded on the gold market, established as a gold reserve, or used to create jewelry. It takes a particle accelerator or nuclear reactor to create gold. No Turing machine can do the job.

There is nothing spooky about this. The point is that a representation is distinct from a reproduction. Only if you reproduce all relevant physical properties of e.g. water can it be considered water in the context of the physical world.

The evidence in support of consciousness requiring a similarly detailed reproduction is our inability to explain how we intuitively disagree that a person with a Chinese–English dictionary does possess the same understanding of Chinese as a person who actually “understands” Chinese.


A Turing machine (Rules table not represented).

Can you program a Turing machine in such a way that it would end up in a state mimicking all relevant physical properties of water, in order to drink it? It seems rather weird to claim that a device that manipulates symbols on a strip of tape could configure itself in such a way as to mimic water. In the same sense it would be really weird to look at a configuration of logic gates and proclaim, “This configuration of logic gates experiences pain!”.


A Turing machine can exhibit a certain number of states. None of those states can possibly correspond to a number of physical properties such as acidity or wetness. The Chinese room thought experiment highlights how none of the states of a Turing machine could intuitively mimic conscious understanding of Chinese.

Or to quote Magnus Vinding, “just like a computer will not have Wi-Fi if it does not have the hardware that supports it – regardless of what software we implement in it – a computer will not be conscious if it does not have the right hardware.”

Tags: , , ,

WARNING: This post contains information related to Roko’s basilisk.

Abstract: If a part of an agent’s utility function describes a human in a box, maximizing expected utility could become self-referential if both the agent and the boxed human engage in acausal trade.

For the sake of a thought experiment let us stipulate, (1) the existence of a superintelligent expected utility maximizer (short: AI), (2) a precise mathematical characterization of a particular human’s brain, (3) an unbounded simulated environment containing the whole brain emulation (short: WBE) from #2, (4) that the WBE is tasked with formalizing its values as a utility function, (5) that part of the utility function of the AI from #1 describes #3.

Here is the problem, which I will call acausal wireheading. While refining its own utility function, the WBE might reason about the relation between itself and the AI. That kind of reasoning will affect the eventual utility function of the WBE, which will in turn affect the ultimate behavior of the AI, whose utility function contains that of the WBE.

If the WBE comes to the conclusion that the AI’s decision theory causes it to try to influence other agents by means of blackmail, then in order to avoid negative consequences the WBE could adopt a utility function that it predicts that the AI will eventually want it to adopt.

The AI will want to influence the WBE because its success of maximizing expected utility will depend upon the kind of utility function that the WBE eventually adopts. Since different utility functions can be maximized more effectively. Which means that any action that will cause the WBE to adopt a simple, easily maximizable utility function, will maximize expected utility. Therefore, if the AI expects blackmailing the WBE to positively control the probability of the adoption of such a utility function, then given any utility function, it will precommit to do so. Which in turn means that the WBE might come to the same conclusion. Which will cause the WBE to do so.

Further reading:

Tags: , ,

« Older entries § Newer entries »