Ok, so let’s say the AI can parse natural language, and we tell it, “Make humans happy.” What happens? Well, it parses the instruction and decides to implement a Dopamine Drip setup.
That’s not very realistic. If you trained AI to parse natural language, you would naturally reward it for interpreting instructions the way you want it to. If the AI interpreted something in a way that was technically correct, but not what you wanted, you would not reward it, you would punish it, and you would be doing that from the very beginning, well before the AI could even be considered intelligent. Even the thoroughly mediocre AI that currently exists tries to guess what you mean, e.g. by giving you directions to the closest Taco Bell, or guessing whether you mean AM or PM. This is not anthropomorphism: doing what we want is a sine qua non condition for AI to prosper.
Suppose that you ask me to knit you a sweater. I could take the instruction literally and knit a mini-sweater, reasoning that this minimizes the amount of expended yarn. I would be quite happy with myself too, but when I give it to you, you’re probably going to chew me out. I technically did what I was asked to, but that doesn’t matter, because you expected more from me than just following instructions to the letter: you expected me to figure out that you wanted a sweater that you could wear. The same goes for AI: before it can even understand the nuances of human happiness, it should be good enough to knit sweaters. Alas, the AI you describe would make the same mistake I made in my example: it would knit you the smallest possible sweater. How do you reckon such AI would make it to superintelligence status before being scrapped? It would barely be fit for clerk duty.
My answer: who knows? We’ve given it a deliberately vague goal statement (even more vague than the last one), we’ve given it lots of admittedly contradictory literature, and we’ve given it plenty of time to self-modify before giving it the goal of self-modifying to be Friendly.
Realistically, AI would be constantly drilled to ask for clarification when a statement is vague. Again, before the AI is asked to make us happy, it will likely be asked other things, like building houses. If you ask it: “build me a house”, it’s going to draw a plan and show it to you before it actually starts building, even if you didn’t ask for one. It’s not in the business of surprises: never, in its whole training history, from baby to superintelligence, would it have been rewarded for causing “surprises” — even the instruction “surprise me” only calls for a limited range of shenanigans. If you ask it “make humans happy”, it won’t do jack. It will ask you what the hell you mean by that, it will show you plans and whenever it needs to do something which it has reasons to think people would not like, it will ask for permission. It will do that as part of standard procedure.
To put it simply, an AI which messes up “make humans happy” is liable to mess up pretty much every other instruction. Since “make humans happy” is arguably the last of a very large number of instructions, it is quite unlikely that an AI which makes it this far would handle it wrongly. Otherwise it would have been thrown out along time ago, may that be for interpreting too literally, or for causing surprises. Again: an AI couldn’t make it to superintelligence status with warts that would doom AI with subhuman intelligence.