Framed in terms of nanofactories, here is my understanding of a scenario imagined by certain AI risk advocates, in which an artificial general intelligence (AGI) causes human extinction:
Terminology: A nanofactory uses nanomachines (resembling molecular assemblers, or industrial robot arms) to build larger atomically precise parts.
(1) The transition from benign and well-behaved nanotechnology, to full-fledged molecular nanotechnology, resulting in the invention of the first nanofactory, will be too short for humans to be able to learn from their mistakes, and to control this technology.
(2) By default, once a nanofactory is started, it will always consume all matter on Earth while building more of itself.
(3) The extent of the transformation of Earth cannot be limited. Any nanofactory that works at all will always transform all of Earth.
(4) The transformation of Earth will be too fast to be controllable, or to be aborted. Once the nanofactory has been launched, everything is being transformed.
To be proved: We need to make sure that the first nanofactory will protect humans and human values.
Proof: Suppose 1-4, by definition.
(5) In order to survive, we need to figure out how to make the first nanofactory transform Earth into a paradise, rather than copies of itself.
Notice that you cannot disagree with 5, given 1-4. It is only possible to disagree with the givens, and to what extent it is valid to argue by definition.
I am not claiming that certain AI risk advocates are solely arguing by definition. But making inferences about the behavior of real world AGI based on uncomputable concepts such as expected utility maximization, comes very close. And trying to support such inferences by making statements about the vastness of mind design space does not change much. Since the argument ignores the small and relevant subset of AGIs that are feasible and likely to be invented by humans.
Here is my understanding of those people argue:
Suppose that a superhuman AGI, or an AGI that can make itself superhuman, critically relies on 999 modules. Respectively, 999 problems have to be solved correctly in order to create a working AGI.
There is another module labeled <goal>, or <utility function>. This <goal module> controls the behavior of the AGI.
Humans will eventually solve these 999 problems, but will create a goal module that does not prevent the AI from causing human extinction as an unintended consequence of its universal influence.
Notice the foregone conclusion that you need to prevent an AGI from killing everyone. The assumption is that killing everyone is what AGIs do by default. Further notice that this behavior is not part of the goal module that supposedly controls the AGIs behavior, but rather assumed to be a consequence of the 999 modules on which an AGI critically depends.
Analogous to the nanofactory scenario outlined above, an AGI is assumed to always behave in a way that will cause human extinction, based on the assumption that an AGI will always exhibit an unbounded influence. And from this the conclusion is drawn that it is only possible to prevent human extinction by directing this influence in such a way that it will respect and amplify human values. It is then claimed that the only possibility to ensure this is by implementing a goal module that either contains an encoding of all human values or a way to safely obtain an encoding of all humans values.
Given all of the above, you cannot disagree that it is not too unlikely that humans will eventually succeed at the correct implementation of the 999 modules necessary to make an AGI work, while failing to implement the thousandth module, the goal module, in such a way that the AGI will not kill us. Since relative to the information theoretic complexity of an encoding of all human values, the 999 modules are probably easy to get right.
But this is not surprising, since the whole scenario was designed to yield this conclusion.