WARNING: This post contains information related to Roko’s basilisk.
Abstract: If a part of an agent’s utility function describes a human in a box, maximizing expected utility could become self-referential if both the agent and the boxed human engage in acausal trade.
For the sake of a thought experiment let us stipulate, (1) the existence of a superintelligent expected utility maximizer (short: AI), (2) a precise mathematical characterization of a particular human’s brain, (3) an unbounded simulated environment containing the whole brain emulation (short: WBE) from #2, (4) that the WBE is tasked with formalizing its values as a utility function, (5) that part of the utility function of the AI from #1 describes #3.
Here is the problem, which I will call acausal wireheading. While refining its own utility function, the WBE might reason about the relation between itself and the AI. That kind of reasoning will affect the eventual utility function of the WBE, which will in turn affect the ultimate behavior of the AI, whose utility function contains that of the WBE.
If the WBE comes to the conclusion that the AI’s decision theory causes it to try to influence other agents by means of blackmail, then in order to avoid negative consequences the WBE could adopt a utility function that it predicts that the AI will eventually want it to adopt.
The AI will want to influence the WBE because its success of maximizing expected utility will depend upon the kind of utility function that the WBE eventually adopts. Since different utility functions can be maximized more effectively. Which means that any action that will cause the WBE to adopt a simple, easily maximizable utility function, will maximize expected utility. Therefore, if the AI expects blackmailing the WBE to positively control the probability of the adoption of such a utility function, then given any utility function, it will precommit to do so. Which in turn means that the WBE might come to the same conclusion. Which will cause the WBE to do so.