The three pillars of AI safety

How do you guarantee that an artificial intelligence (short: AI) has a positive impact? Here, a positive impact might, for example, be defined as some sort of reflective equilibrium of humanity.

Let us label <friendly> any agent, be it human or artificial, that has a positive impact.

The most important safety measures seem to be the following:

(1) Ensuring that an AI works as intended.

(2) Ensuring that humans, who either create or use AI, are friendly.

(3) Ensuring that an AI is friendly.

Point 1 and 2 are important, but not strictly necessary for point 3. Ideally, point 3 should be achieved by independent oversight (point 2), in combination with an independent verification of the behavior of the AI (point 1).

Note how point 1 is distinct from point 3. You could have an AI that is not friendly, which does not actively pursue a positive impact, but whose overall impact is proven to be limited. As would be the case given a mathematical proof that such an unfriendly AI would, for example, (1) only run for N seconds (2) only use predefined computational resources (3) only communicate with the outside world by outputting mathematical proofs of the behavior of improved versions of itself, which are to be verifiable by humans.

Remarks: It should be much easier to prove an AI to be bounded than to prove that an AI will pursue a complex goal without unintended consequences. Such a confined AI could then be studied and used as a tool, in order to ensure point 3.

The first version of such an unfriendly AI (uFAI_01) would be provably confined to only run for a limited amount of time, using a limited amount of resources, and only output mathematical proofs of its own behavior. Once a sufficient level of confidence about its behavior has been reached, an improved version (uFAI_02) could then be designed. The domain of uFAI_02 would provably be modified versions of its source code (uFAI_N). Its range would provably be human-verifiable mathematical proofs of the behavior of uFAIN_N, which it would provably output using a limited amount of resources. This process would then be iterated up to an arbitrary level of confidence, until eventually a friendly AI is obtained.

Tags: ,