decision theory

You are currently browsing articles tagged decision theory.

WARNING: This post contains information related to Roko’s basilisk.

Abstract: If a part of an agent’s utility function describes a human in a box, maximizing expected utility could become self-referential if both the agent and the boxed human engage in acausal trade.

For the sake of a thought experiment let us stipulate, (1) the existence of a superintelligent expected utility maximizer (short: AI), (2) a precise mathematical characterization of a particular human’s brain, (3) an unbounded simulated environment containing the whole brain emulation (short: WBE) from #2, (4) that the WBE is tasked with formalizing its values as a utility function, (5) that part of the utility function of the AI from #1 describes #3.

Here is the problem, which I will call acausal wireheading. While refining its own utility function, the WBE might reason about the relation between itself and the AI. That kind of reasoning will affect the eventual utility function of the WBE, which will in turn affect the ultimate behavior of the AI, whose utility function contains that of the WBE.

If the WBE comes to the conclusion that the AI’s decision theory causes it to try to influence other agents by means of blackmail, then in order to avoid negative consequences the WBE could adopt a utility function that it predicts that the AI will eventually want it to adopt.

The AI will want to influence the WBE because its success of maximizing expected utility will depend upon the kind of utility function that the WBE eventually adopts. Since different utility functions can be maximized more effectively. Which means that any action that will cause the WBE to adopt a simple, easily maximizable utility function, will maximize expected utility. Therefore, if the AI expects blackmailing the WBE to positively control the probability of the adoption of such a utility function, then given any utility function, it will precommit to do so. Which in turn means that the WBE might come to the same conclusion. Which will cause the WBE to do so.

Further reading:

Tags: , ,

Background: Newcomb’s problem

Objective: The problem I am trying to highlight with this post is not the difficulty of predicting another agent accurately but (1) the problem of stating precisely what it is that Omega is predicting in the first place (2) that locating and isolating a discrete agent in a continuous universe by e.g. formalizing the boundaries of the physical system in question seems to be nontrivial for complex agents (3) how to think about decision making when decisions are determined not just by the agent (as arbitrarily defined by humans) but by the larger environment.

The ability to accurately predict the decision making of other agents is insufficient if it is not possible to define what is meant by <decision making> and <agent>.

Newcomb’s problem: Ignoring the problems mentioned above, one-boxing is the correct strategy, given that Omega is correct more then 50.05% of the time. Since,


If (prediction == One-box)

return (1000000+1000)


return 1000

Expected value: y = (1-x)*(1000000+1000)+x*1000 = -1000000x+1001000 where 0<x<1 is the probability of a correct prediction.


If (prediction == One-box)

return 1000000


return 0

Expected value: z = x(1000000)+(1-x)*0 = 1000000x where 0<x<1 is the probability of a correct prediction.

Two-boxing versus One-boxing:

y > z

-1000000x+1001000 > 1000000x

1001000 > 2000000x

1001000/2000000 > x

0.5005 > x

As long as the probability of a correct prediction is less than 50.05%, two-boxing has the larger expected value.

Consider the following scenarios:

(S1) Omega predicts that you will end up taking both boxes because, even though at some point you did precommit to one-boxing, you change your mind and take both boxes.

(S2) Omega predicts that you will end up taking both boxes because of a stroke causing brain damage.

(S3) Omega predicts that a sudden wind gust will cause you to stumble and topple over both boxes, even though you did precommit to taking only one box.

(S4) You make up one half of a split brain residing in the same body. You precommit to one-boxing while the other personality sharing the body with you chooses two-boxing. You have no control of the movement of the body except that you are the one who can talk.


If a body harboring two or more personalities with different precommitment strategies about Newcomb-like problems ends up taking both boxes then did all of the agents who reside in that body take two boxes or just the one that happened to control the body during the critical moment?

It seems possible to adopt a wide range of definitions of “agency” when trying to reason about and predict the behavior of other agents. It is possible to define an agent as a global or a local physical system, respectively slice of space. In other words, when examining a system it is possible to either act based on the assumption that the whole system is a coherent entity or to assign the quality of agency to arbitrary sub-procedures of the system and examine them in isolation.

For example, if Omega was to assign the quality of agency to a volume of space approximately the size of the human brain, would then a precommitment to one-boxing satisfy Omega’s condition to put $1,000,000 into box A? Would then a case of two-boxing as a result of e.g. brain damage caused by external factors be ignored?

So what is it that Omega predicts when your actions are ultimately the local behavior of a larger physical system we call the universe?

Can you formalize the difference between what it means to take both boxes due to (1) changing your mind for subtle reasons (e.g. reading a decision theory paper) (2) changing your mind for not so subtle reasons (e.g. brain damage) (3) because you are not in control of the the larger physical system (e.g. sudden strong wind causes you to stumble) or (4) because you do not control “your” body (e.g. multiple personality disorder)?

Tags: , ,

(The following is adapted from a scenario by Graham Priest, depicted in his book ‘Logic: A Very Short Introduction‘.)

Suppose that at some point you find yourself in a posthuman hell. But you have one chance to get out of it. You can toss a coin; if it comes down heads, you are out and go to heaven. If it comes down tails, you stay in hell forever. The coin is not a fair one, however, and the posthuman entity that simulates the hell has control of the odds. If you toss it today, the chance of heads is 1/2 (i.e. 1-1/2). If you wait till tomorrow, the chances go up to 3/4 (i.e. 1-1/2^2). If you wait n days, the chance of going to heaven goes up to 1-1/2^n. How long are you going to wait before tossing the coin?

The associated values of remaining in hell or escaping are constant over time.

Tags: ,