Objective: (1) Outlining how to examine the possibility of the emergence of dangerous goals in generally intelligent systems in the light of practical research and development. (2) Determining what decision procedures would cause generally intelligent systems to exhibit catastrophic side effects.
There are arguments supporting the possibility that an advanced artificial general intelligence (short: AI) might exhibit specific universal drives which could interfere with human matters in catastrophic ways. It is for example argued that <self-protection> is important in order to achieve a wide range of goals. It is not my intention to discuss those arguments in particular but rather to look at various goals and how likely it is that different AI designs might follow decision procedures that cause them to exhibit catastrophic side effects given those goals.
To examine the possibility that a wide range of AI designs might exhibit catastrophic side effects, given a wide range of goals, several factors have to be considered. Factors such as (1) a cost benefit analyses of interfering with human matters (2) the necessity of a spatiotemporal planning horizon, given computationally limited agents, possibly limiting unbounded protectionism (3) how any given goal is interpreted given vagueness and uncertainty.
Simple and complex natural language goals such as <calculate 1+1> and <keep the trains running> should be examined to see if to expect more dangerous outcomes with more complex goals or vice versa.
Various questions should be asked to pinpoint the expected failure mode:
(1.0) How is a goal likely to be interpreted by an AI design: (1) arbitrarily (2) verbatim (3) as a problem in physics and mathematics that needs to be solved correctly?
(1.1) If an AI is interpreting a goal arbitrarily, how does it choose one interpretation over another?
(1.2) What does it mean for an AI to interpret a goal literally? Suppose the goal given is <build a hotel>. Is the terminal goal to create a hotel that is just a few nano meters in size? Is the terminal goal to create a hotel that reaches the orbit?
(1.3) If an AI design is going to interpret a goal as a problem in mathematics and physics, would it make sense to ignore various important facts about the universe such as what its creators intended it to do? Would it make sense to simply assume the most resource expensive interpretation and very likely end up doing more than necessary?
(2.0) What instrumental goals are implied by a terminal goal when interpreted by a specific AI design?
(2.1) Does a cost benefit analysis imply that it would be rational to take over the world?
(2.2) Would taking over the world, or some other far-reaching action, make sense if it is not even clear that it is instrumentally rational to allocate massive resources to do so? Does it for example make sense to build a bunker and kill all humans to make sure that you are unobstructed in calculating 1+1? Or would it make sense to turn everyone into paperclips if you are only supposed to create more paperclips than the best competitor without interfering with the world at large?
(4.0) If a specific AI design does exhibit catastrophic side effects given a goal that present day software tools can master with ease, such as the calculation of a driving route from Los Angeles to San Francisco by Google maps, is it possible to pinpoint what specifically causes that AI design to fail in such a way and how its creators did not foresee that failure mode?
(4.1) If you were to alter a narrow AI expert system such as Google maps and incrementally turned it into a the kind of AI design that you expect to exhibit catastrophic side effects, given the same goal as the expert system, can you locate the tipping-point where on the way towards your AI design the well-behaved expert system starts to act in a catastrophic yet highly complex and intelligent way?
Assume an ultra-advanced version of Google or IBM Watson.
If I was to ask such an answering machine how to prevent human suffering, would it be reasonable to assume that the top result it would return would be to kill all humans? Would any product that returns similarly wrong answers survive even the earliest research phase, let alone any market pressure?
Assume an ultra-advanced version of Siri, an intelligent personal assistant and knowledge navigator which works as an application for Apple’s iOS.
If I tell the present day version of Siri, “Set up a meeting about the sales report at 9 a.m. Thursday.”, then the correct interpretation of that natural language request is to make a calendar appointment at 9 a.m. Thursday. A wrong interpretation would be to e.g. open a webpage about meetings happening Thursday or to shutdown the iPhone.
The question here becomes at which point of technological development there will be a transition from well-behaved systems like Siri, which are able to interpret a limited amount of natural language inputs correctly, to superhuman artificial generally intelligent systems that are in principle capable of understanding any human conversation but which in contrast to their narrow AI counterparts fail in catastrophic ways.