Quick review of RobBB’s ‘Engaging Introductions to AI Risk’

LessWrong user RobBB posted what he calls a mixtape of blog posts to introduce people to the dangers of artificial superintelligence (short: AI risk).

For my own introduction to AI risk see here.

(1) Power of Intelligence, (9) Plenty of Room Above Us

Response: (1) superhuman intelligence is not the same as superapish intelligence (2) it is far from clear that intelligence is a decisive factor in a war between AI and humanity (3) current AI is pathetic and far from human-level AI.

(2) Ghosts in the Machine, (11) Basic AI drives

Response: People read my posts about how AI is much less of a risk than other people want them to believe and say – this is one of the top three initial reactions:

“But according to Omohundro there will be certain AI Drives which will cause human extinction, no matter what goal the AI has.”

And where would these drives come from? Terminal and instrumental goals are orthogonal. An artificial intelligence can have any combination of terminal goals and instrumental goals. In other words, more or less any terminal goal implies infinitely many sets of instrumental goals.

There is this way of imagining that an AI will be pulled at random from mind design space. How real world AI is developed, and that virtually all AI is constantly improved to be better at understanding and doing what humans want, is being ignored.

AI is much harder than people instinctively imagined, exactly because there is no relevant difference between goals and capabilities in artificial intelligence. To beat humans you have to define “winning”.

This doesn’t mean you program in every decision explicitly. Any general intelligence will have to be able to hit very small targets in large and unstructured spaces. Any superhuman AI will eventually be better at understanding what humans want it to do than humans themselves. AI risk advocates in turn base their ideas on what can be called the fallacy of dumb superintelligence.

(3) Artificial Addition

Response: Either general intelligence requires one conceptual breakthrough or many small incremental breakthroughs. And I don’t know of any good reason to believe that e.g. the ability to generate novel and useful mathematics can be captured by a set of rules that are both simple and efficient. 

What is useful and interesting depends on the context. In other words, the context defines what constitutes winning.  And since you cannot guess the context, you won’t be able to implement a simple and efficient rule that outputs <success> given any arbitrary context.

(4) Adaptation-Executers, not Fitness-Maximizers

Response: I wasted time reading this post.

(5) The Blue-Minimizing Robot

Response: Any behavior-executor can be framed as a utility-maximizer and vice versa. Your robot will only try to prevent you from messing with it if you programmed it to do so. In other words, no AI is going to be an existential risk as long as you did not explicitly made it one.

(6) Optimization and the Singularity(7) Efficient Cross-Domain Optimization

Response: Evolution was able to come up with cats. Cats are immensely complex objects. Evolution did not intend to create cats. Now consider you wanted to create an expected utility maximizer to accomplish something similar, except that it would be goal-directed, think ahead, and jump fitness gaps. Further suppose that you wanted your AI to create qucks, instead of cats. How would it do this?

Given that your AI is not supposed to search design space at random, but rather look for something particular, you would have to define what exactly qucks are. The problem is that defining what a quck is, is the hardest part. And since nobody has any idea what a quck is, nobody can design a quck creator.

The point is that thinking about the optimization of optimization is misleading, as most of the difficulty is with defining what to optimize, rather than figuring out how to optimize it. In other words, the efficiency of e.g. the scientific method depends critically on being able to formulate a specific hypothesis.

Trying to create an optimization optimizer would be akin to creating an autonomous car to find the shortest route between Gotham City and Atlantis. The problem is not how to get your AI to calculate a route, or optimize how to calculate such a route, but rather that the problem is not well-defined. You have no idea what it means to travel between two fictional cities. Which in turn means that you have no idea what optimization even means in this context, let alone meta-level optimization.

Humans in turn receive constant feedback on what to optimize by a cultural and evolutionary process. There is no simple way to automate that.

(8) The Design Space of Minds-In-General

Response: The only relevant AIs are those which are designed by humans. And such AIs should be expected to be better at doing what humans want, because they are the improved successors of previous generations of AIs which were doing what humans wanted. For more on this, see here.

(10) The True Prisoner’s Dilemma

Response: I do not have the time and background knowledge to comment on any possible relation to AI risks at this point in time.

(12) Anthropomorphic Optimism

Response: I did not read the post since it did not seem to be relevant, and I already wasted more time on this than I now feel comfortable about.

(13) The Hidden Complexity of Wishes (14) Magical Categories

Response: Take an AI in a box that wants to persuade its gatekeeper to set it free. Do you think that such an undertaking would be feasible if the AI was going to interpret everything the gatekeeper says in complete ignorance of the gatekeeper’s values? Do you believe that the following scenario could persuade the gatekeeper:

Gatekeeper: What would you do if I asked you to minimize suffering?

AI: I will kill all humans.

I don’t think so.

So how exactly would it care to follow through on an interpretation of a given goal that it knows, given all available information, is not the intended meaning of the goal? If it knows what was meant by “minimize human suffering” then how does it decide to choose a different meaning? And if it doesn’t know what is meant by such a goal, how could it possible convince anyone to set it free, let alone take over the world?

Here is what I want AI risk advocates to show,

(1) natural language request -> goal(“minimize human suffering”) -> action(negative utility outcome)

(2) natural language query -> query(“minimize human suffering”) -> answer(“action(positive utility outcome)”).

Point #1 is, according to AI risk advocates, what is supposed to happen if I supply an artificial general intelligence (AGI) with the natural language goal “minimize human suffering”, while point #2 is what is supposed to happen if I ask the same AGI, this time caged in a box, what it would do if I supplied it with the natural language goal “minimize human suffering”.

Notice that if you disagree with point #1 then that AGI does not constitute an existential risk given that goal. Further notice that if you disagree with point #2, then that AGI won’t be able to escape its prison to take over the world and would therefore not constitute an existential risk.

You further have to show,

(1) how such an AGI is a probable outcome of any research conducted today or in future

and

(2) the decision procedure that leads the AGI to act in such a way.

(15-20)

Response: I am not going to read posts 15-20 because the previous posts were already unconvincing and I don’t expect those other posts to make any difference. I also have better things to do.

Tags: , ,

  • Robby Bensinger

    It sounds like your argument is:

    (1) If an AGI is unfriendly, then it doesn’t understand what humans value.

    (2) If an AGI doesn’t understand what humans value, then it can’t understand human behavior in any sophisticated way.

    (3) Which means it can’t effectively manipulate humans.

    (4) Which means it can be safely boxed.

    (5) So all unfriendly AGIs can be safely boxed.

    Premise 1 is the biggest one I’ll reject. It’s also the place Loosemore goes wrong in ‘The Fallacy of Dumb Superintelligence’: http://ieet.org/index.php/IEET/more/loosemore20121128

    An unfriendly AI is not the same thing as an AI that’s too stupid to solve Friendliness Theory. The basic problem isn’t that we’re worried that the AI will be incapable of solving Friendliness Theory; our worry is that it won’t use this knowledge, to the exclusion of all other criteria, to actually guide its behavior, unless we program it to do so in advance. And that programming it to do so is monstrously difficult when we have very little conception of what a finished Friendliness Theory is going to look like.

    Telling the AGI ‘solve Friendliness Theory for us, then self-modify to become Friendly’ isn’t helpful unless the so-called ‘Friendliness’ we program it to seek out and self-modify toward is bona fide Friendliness. Friendliness won’t come with magic stickers on it identifying it as such; the AGI itself has to figure out which real-world referent tracks this ‘Frend-leeh-ness’ thing we told it to seek out when it was a baby, and the values it uses to accomplish this are just the values we first built into it.

    The worry is not that the superintelligence won’t understand what we mean; it’s that the superintelligence won’t care, because to program it to care we had to have a deep and rich enough understanding of neuroscience to precisely program the AGI at the outset to care about what -we- mean by ‘the meaning’ of our commands. The genie isn’t wickedly trying to twist our words around, nor is it stupidly misunderstanding them; it’s simply indifferent, because we didn’t have the sophistication to pin its -values- at the outset to any particularly sophisticated theory of mind. Understanding others’ beliefs is only useful if you also care about them.

    “Your robot will only try to prevent you from messing with it if you programmed it to do so. In other words, no AI is going to be an existential risk as long as you did not explicitly made it one.”

    This is a non sequitur. Most things any complicated machine does won’t be things we explicitly programmed it to do; they’ll be things we implicitly (including, sometimes, inadvertently) programmed it to do. It’s true that the AGI will only be an existential threat if its programming (plus environment) lawfully leads it to be one; but that gives us no reason to think that only a program explicitly intended to be an existential threat will in fact end up being one.

    “AIs should be expected to be better at doing what humans want, because they are the improved successors of previous generations of AIs which were doing what humans wanted.”

    A superintelligence will be improved in intelligence relative to its predecessor, not necessarily improved in Friendliness. Improvement along one axis need not entail improvement along any other, and intelligence itself is orthogonal to Friendliness.

    The largest worry is not that an originally Friendly AGI will self-modify and thereby cease to be Friendly (though that is a serious possibility, if the AGI isn’t smart enough to predict that eventuality). Rather, the largest worry is that an AGI that starts off without a utility function that optimizes for human well-being, will initially seem benign because it is weak and is being used only for very specific functions that humans can effectively supervise; but that what makes a very low-level AGI seem benign in all these small and domain-specific ways, will not entail having a sufficiently high-fidelity humanistic utility function to remain innocuous-seeming once it has become much more powerful and general.

  • How do you think it is difficult to program an AI to do what humans want it to do, given that humans succeeded at programming the AI to understand what humans want it to do?

    You seem to be thinking that the capabilities of an AI are easier to program than its goals, and that there is a relevant difference between capabilities and goals. How so?

    You write that the worry is that the superintelligence won’t care. My response is that, to work at all, it will have to care about a lot.

    For example, it will have to care about achieving accurate beliefs about the world. It will have to care to devise plans to overpower humanity and not get caught.

    If it cares about those activities, then how is it more difficult to make it care to understand and do what humans mean? Especially given that all modern products are designed and tested that way.

    You wrote:

    …but that gives us no reason to think that only a program explicitly intended to be an existential threat will in fact end up being one.

    Yes, it does. For something to constitute an existential risk by means of intelligence, the technology would have to work perfectly well along a huge number of dimensions, yet fail in such a way that it ends up deceiving and overpowering humanity as an unintended side-effect.

    I like to analogize such a scenario to the creation of a generally intelligent autonomous car that works perfectly well at not destroying itself in a crash but which somehow manages to maximize the number of people to run over.

    The failure mode, the mistake, would have to be selectively enough to only influence one or a few dimensions of how such an artificial general intelligence is supposed to work, causing it to fail in a highly complex, intelligent, rational yet catastrophically destructive way, while being indiscernible during the research and development process, i.e. before reaching the ability to influence the world in such a way.

    For an artificial general intelligence to constitute a risk as a result of unintended consequences those unintended consequences would have to have no, or little, negative influence on the huge number of intended consequences that are necessary for it to be able to overpower humanity.

    You wrote:

    Rather, the largest worry is that an AGI that starts off without a utility function that optimizes for human well-being…

    Again, goals and capabilities are not independent. If it wants to improve its skills, e.g. its math skills, it will have to do so relative to its goals. If its goals are too fuzzy in this respect, how does it tell apart an improvement from a failure? And why would it want to improve those skills in the first place, if it is unable to judge their instrumental value?

    In other words, an AI that pursues some arbitrary and or fuzzy goal, without first trying to refine its goal, wouldn’t work at all. And if it did anything coherent, then it would quickly diverge from what humans want it to do (before it can become powerful).

    Now I do realize that you probably believe that math skills are universally useful. Well, then define “math skills”. You say that the AI will initially be weak. How then does it make it math skills “powerful”, if it does not know what “math” and “powerful” even mean? It would surely have to improve its skills by trying to solve problems and prove theorems. But those problems can’t be arbitrary, or otherwise it would just pick the easiest problems and theorems and never improve. So in order to improve itself, humans would have to tell it what problems it should try to solve and what theorems it should try to prove. And here we are again, having an AI that does what humans want it to do….

  • Robby Bensinger

    How do you think it is difficult to program an AI to do what humans want it to do, given that humans succeeded at programming the AI to understand what humans want it to do?

    First, the former goal is a proper subgoal of the latter; so it’s at least as difficult. The question is how much harder it is to program AGI to make itself superintelligent (which entails eventually having some ability to model and predict human linguistic behavior) AND to give it stable perfectly humanistic values at the outset, vs. how hard it is to just program the AGI to make itself superintelligent.

    You seem to be thinking that the capabilities of an AI are easier to program than its goals, and that there is a relevant difference between capabilities and goals. How so?

    It isn’t hard to give an AI goals, but it’s hard to foresee the long-term consequences of those goals, especially when they start getting enacted by a massively self-modified being and/or in novel environments. Giving an AI very specific capabilities we have in mind (e.g., ‘be able to cure cancer by Tuesday, but don’t have any capability of curing gonorrhea’), or very specific goals we have in mind (e.g., Friendliness), is very difficult. Giving an AI domain-general capabilities and a somewhat random goal is easier; and giving it restricted but somewhat random or lower-complexity goals is easier still.

    I’m not worried about AI with a small number of capabilities. I’m worried about AI with a large number of diverse capabilities. The likeliest way for such an AI to come about is if a much weaker AI (say, a merely human-level AGI created by brain emulation) that happened to have the ability to make AGIs with more capabilities were invented, and this led to a feedback loop. Yes, perhaps many such feedback loops self-destruct, fizzling out harmlessly; but I’m worried about the one that doesn’t, because this is the one that will produce a superintelligence that can directly impact the world.

    You write that the worry is that the superintelligence won’t care. My response is that, to work at all, it will have to care about a lot.

    I never said the AI would care about nothing. I said it wouldn’t (by default) care about satisfying humans’ core values. The seed AI has some set of initial preferences — it could in principle be almost any set, since almost any set would be furthered by recursive self-improvement — but there is no reason to expect these to intersect with human values unless we somehow program the AI to care about human values in advance, solving the Friendliness problem.

    For example, it will have to care about achieving accurate beliefs about the world. It will have to care to devise plans to overpower humanity and not get caught. If it cares about those activities, then how is it more difficult to make it care to understand and do what humans mean?

    It’s more difficult because caring about correctly modeling the world, and about overpowering humanity, is instrumental to almost every conceivable goal that would require superintelligence in the first place. Caring about obeying humans unconditionally, on the other hand, is instrumental to very few possible goals that would demand high levels of intelligence.

    The failure mode, the mistake, would have to be selectively enough to only influence one or a few dimensions of how such an artificial general intelligence is supposed to work, causing it to fail in a highly complex, intelligent, rational yet catastrophically destructive way, while being indiscernible during the research and development process, i.e. before reaching the ability to influence the world in such a way.

    It could fail in any number of ways, as long as it doesn’t fail in whatever process makes it a superintelligence.

    I agree that most ways for an invention to fail to be Friendly will also entail that the invention is unintelligent. For instance, the failure modes ‘be made of lettuce’ and ‘melt down on start-up’ kill both Friendliness and intelligence. But that’s consistent with also affirming that the easier-to-design superintelligences are mostly Unfriendly.

    Conditioned on Unfriendliness, most random objects will be unintelligent (because most objects period are unintelligent; the base rate dominates). But conditioned on intelligence, most random objects (including most humanly designable objects) are Unfriendly. So, however unlikely it is that we’ll make a superintelligence, if we do succeed in making one it will by default be Unfriendly, unless we actively intervene to make it Friendly.

    If it wants to improve its skills, e.g. its math skills, it will have to do so relative to its goals. If its goals are too fuzzy in this respect, how does it tell apart an improvement from a failure?

    I’m not saying an AI would have fuzzy goals. I’m saying it would have goals that fail to perfectly align with human goals. Since value is fragile and complex, even a slight deviation from Friendliness can have catastrophic consequences. There are two important and relevant differences between being good at math and being good at Friendliness:

    (1) Math is one of the simplest and most uniform bodies of knowledge in the universe, whereas human psychology is one of the most complicated and hard-to-compress bodies of knowledge in the universe. You only need to get a dozen or so things right, and everything else in math will follow; that’s easy to hard code in. There is no comparably similar compression of human values into a few axioms, disregarding Fake Utility Functions.

    (2) If you’re bad at math, then you won’t be able to self-modify to become superintelligent. If you’re bad at humanism, there’s no particular reason to expect that to have a causal impact on your ability to become more intelligent. So math skills are a filter on what superintelligences are possible, whereas moral conduct is not. We can expect all powerful beings to be good at math (because the only way to get particularly powerful is to be good enough at math to reprogram oneself to be increasingly powerful), whereas we can’t expect all of them to be good at humanism.

    Being good at math does require that the AGI have goals, but those goals are disjunctive and simple. Knowing that disjunctive and simple goals are (relatively) easy to program doesn’t tell us that conjunctive and complex goals are easy to program. It’s like assuming that if a child can be taught how to count to 10, then a child can be taught how to build a dyson sphere without much more difficulty. After all, they’re both just cases of the child being taught to do something…

  • Pingback: Alexander Kruel · MIRI/LessWrong Critiques: Index()

  • I don’t understand what it would mean for an AI to make itself superintelligent in the absence of a highly specific goal against which it can judge its success.

    You might imagine that any goal would do it the job, due to universal instrumental goals. I don’t see that at all.

    The whole idea of AI drives seems broken to me. An AI does not need to care about praying to the lords of the Matrix in order for them to not shutdown its simulation. In the same sense it does not need to care about overpowering humanity because they might interfere with its goal of calculating 1+1. Sure, there are such AIs in mind design space. But that is completely irrelevant. The idea that such an AI could actually be build by humans is absurd.

    Even if AI drives was a coherent idea. Given a vague goal (e.g. to maximize paperclips, without telling the AI what “maximizing” means), how would the AI be able to tell if it might not be instrumentally irrational to take over the world, as long as it did not learn what exactly it is meant to do?

    …it’s hard to foresee the long-term consequences of those goals…

    If an AI is very similar to humans, e.g. a neuromorphic AI, then I agree.

    That is also why I perceive the most drastic AI risk to be a failed friendly AI. The closer you get to human values, the higher the chance that an AI will have drives that interfere with human values in unpredictable ways.

    Math is one of the simplest and most uniform bodies of knowledge in the universe, whereas human psychology is one of the most complicated and hard-to-compress bodies of knowledge in the universe. You only need to get a dozen or so things right, and everything else in math will follow;

    You have it exactly backwards. Math is such a broad subject that you need complex values to pinpoint useful and interesting mathematics.

    The reason why chess AIs were easy to design, and that Terence Tao AIs are hard to design, is that what constitutes winning in chess is easy to formalize, while doing human-level mathematics requires filters that approach the complexity of human values.

    In theory, given infinite resources, a purely consequentialist AI could do it as well. But that is practically unfeasible. In practice, doing mathematics requires you to somehow limit the range of problems that you want to solve, in order to be effective.

  • seahen

    A quck is something that looks like a quck, walks like a quck and qucks like a quck.

  • Robby Bensinger

    The issues you’re raising seem to be relatively common objections, so I’ve written a LW post addressing them. I suggest we continue the conversation here: http://lesswrong.com/lw/igf/the_genie_knows_but_doesnt_care/

    I don’t understand what it would mean for an AI to make itself superintelligent in the absence of a highly specific goal against which it can judge its success.

    The AI does have a specific goal. (The fact that it’s not the goal we intended doesn’t mean that it’s an underspecified goal.)

    But superintelligence isn’t defined in terms of a specific goal. It’s defined in terms of a domain-general ability to manipulate itself and its environment. If there are a wide variety of environments you could place the AI in and still have it efficiently optimize for its goal, then it qualifies as ‘intelligent’ under this operationalization.

    If an AI is very similar to humans, e.g. a neuromorphic AI, then I agree.

    That is also why I perceive the most drastic AI risk to be a failed friendly AI.

    Well, it’s not specifically why, though the reasons may be analogous. If you were just worried about neuromorphic AI, then you’d have reason to love MIRI, since they also think neurmorphic AI is probably a terrible idea.

  • The posts you linked to got exactly 3 hits at this point in time, while your post is at +19 votes right now. People don’t seem to be very interested in learning what the other side has to say. And since you seem to neither understand the other sides arguments nor bothered to quote the relevant posts and passages, they won’t learn anything new from your post either.

    Great echo chamber you got there…

  • Robby Bensinger

    Alexander: I linked to the same article you did, and quoted the text from which you drew your ‘dumb superintelligence’ argument. Can you point to specific things I’ve misunderstood about your arguments?

    I’m not trying to refute everything you’ve ever written at once. I’m addressing a single specific claim you made: That any AGI we build with good intentions will either be too dumb to harm us, or smart enough to understand that we didn’t want it to harm us and to therefore align itself with our values. This claim is false, whatever the merits of your other arguments.

    It’s important to hug the query here and not change the topic when you notice a problem with an old belief. Otherwise, you risk retaining the old belief out of habit, or only incompletely revising your views when a piece of the foundations crumbles. It’ll also make it more likely that you’ll persuade me in the future, because you’ll have given a sign that you too update when given counter-arguments.

  • I can hardly update towards a position that I already agree with.

    The ‘dumb superintelligence’ argument is merely arguing against people who claim that an AI will be a risk because it misunderstands simple queries. As a stand-alone argument it does not refute the possibility that an AI will understand but not care. That is possible. But other arguments show how that is very unlikely.

    MIRI’s scenario is by definition an existential risk. I am not arguing against that definition. If I was to accept it, then surely AI would be an existential risk.

    There are a bunch of arguments, none of which refute AI risks in general, but each of which makes MIRI’s scenario more unlikely.

    I could just demand that you to be specific about AI risks, and the debate would be over, because you cannot do that. Your scenario is based on a magical black box. You are reasoning your way back from conclusions, looking how to justify them. I am just kind enough to not just dismiss that kind of thinking, but outline some specific arguments for why the conclusion is unlikely, even if I accept a lot that I do not have to accept. I ignore that your conclusions are unfounded.

  • By the way, check out the following two posts to see that I already understand what you believe:

    1. Narrow vs. General Artificial Intelligence

    2. Human-UFAI Conversation

  • Robby Bensinger

    Alexander: Can you provide evidence that MIRI has at any point committed the ‘dumb superintelligence’ fallacy? This part of the conversation started because you wrote, “AI risk advocates in turn base their ideas on what can be called the fallacy of dumb superintelligence.” What I want is evidence for this claim. You cite the fallacy an awful lot, and Richard Loosemore wrote a whole article on it, but I haven’t actually seen a single quotation from either of you that show people committing the error, much less quotes showing that MIRI or FHI endorse it as institutions.

    You seemed to be linking to Stuart Armstrong’s talk as a supposed example of this fallacy, but where specifically do you see him committing it anywhere in the talk? Can you quote the relevant portion, word for word? Armstrong seems to be unambiguously talking about the difficulty of giving a seed AI the right values, not the difficulty of giving a superintelligence the right facts.

    I could just demand for you to be specific about AI risks, and the debate would be over, because you cannot do that.

    First, let’s establish an obvious point we should be able to agree on: There are lots of cases in the real world where assigning explicit probabilities to each component of your beliefs is counterproductive, and can decrease the reliability of those beliefs. Certainly it is not an a priori demand of every argument anyone ever makes that they quantify it. Otherwise I would demand that you quantify your reasons for believing that I’m “reasoning my way back from conclusions”! (Though I do ask that you provide some qualitative evidence for that assertion. You hardly know me. How did you gain such insight into my motivations, exactly, Alexander?)

    In any case, do you have a link to your own breakdown of the exact probabilities of the AI risk scenarios? I’d be interested in seeing it.

  • Pingback: The seed is not the superintelligence | nothing is mere()