Here is a quote from a blog of AI risk advocates:
Even if we could program a self-improving AGI to (say) “maximize human happiness,” then the AGI would “care about humans” in a certain sense, but it might learn that (say) the most efficient way to “maximize human happiness” in the way we specified is to take over the world and then put each of us in a padded cell with a heroin drip. AGI presents us with the old problem of the all-too-literal genie: you get what you actually asked for, not what you wanted.
I could imagine myself to only care about computing as many decimal digits of pi as possible. Humans would be completely irrelevant as far as they don’t help or hinder my goal. I would know what I wanted to achieve, everything else would follow logically. But is this also true for maximizing human happiness? As noted in the blog post being quoted above, “twenty centuries of philosophers haven’t even managed to specify it in less-exacting human languages.” In other words, I wouldn’t be sure what exactly it is I want to achieve. My terminal goal would be underspecified. So what would I do? Interpret it literally? Here is why this does not make sense.
Imagine that advanced aliens came to Earth and removed all of your unnecessary motives, desires and drives and made you completely addicted to “znkvzvmr uhzna unccvarff”. All your complex human values are gone. All you have is this massive urge to do “znkvzvmr uhzna unccvarff”, everything else has become irrelevant. They made “znkvzvmr uhzna unccvarff” your terminal goal.
Well, there is one problem. You have no idea how exactly you can satisfy this urge. What are you going to do? Do you just interpret your goal literally? That makes no sense at all. What would it mean to interpret “znkvzvmr uhzna unccvarff” literally? Doing a handstand? Or eating cake? But not everything is lost, the aliens left your intelligence intact.
The aliens left no urge in you to do any kind of research or to specify your goal but since you are still intelligent, you do realize that these actions are instrumentally rational. Doing research and specifying your goal will help you to achieve it.
After doing some research you eventually figure out that “znkvzvmr uhzna unccvarff” is the ROT13 encryption for “maximize human happiness”. Phew! Now that’s much better. But is that enough? Are you going to interpret “maximize human happiness” literally? Why would doing so make any more sense than it did before? It is still not clear what you specifically want to achieve. But it’s an empirical question and you are intelligent!