Google’s Automated Image Captioning & the Key to Artificial “Vision”

It’s no secret that Google has been getting more active in research in recent years, especially since it re-organized itself significantly back in 2015. On 22nd September 2016 it announced the open-source release of a piece of software that can detect the objects and setting of an image to automatically generate a caption describing it. Of course, it doesn’t have the same level of creativity as human beings do in creating the prose within the captions, but the image encoder otherwise known as Inception V3 should have captured attention for reasons that transcend the superficial “look at the captions it can make” motive. Software like this, in fact, may be a stepping stone towards something greater on the road to more advanced artificial intelligence.

Eyes Can See, but Intelligence “Perceives”


Artificial sight has been with us for more than a century. Anything with a camera can see. It’s a very basic sort of thing. But even a blind man can surpass the camera’s understanding of what it is looking at. Until very recently, computers were not able to easily and accurately name the objects found in pictures without very specific parameters. To truly say that a man-made object has “vision” would mean that it at least has a concrete ability to specify what it is looking at, rather than just simply looking at it without gathering any context. This way, the device could potentially react to its environment based on sight, just like we do. Perception is an absolute necessity. Without it, every sense we have is useless.

Perception Through Automatic Image Captioning


Although we generally believe that every picture is worth a thousand words, Inception V3 doesn’t necessarily share that opinion. The automatic image captioning software has very few things to say about what it sees, but it at least has a basic concrete understanding of what is contained within the frame presented to it.

With this rudimentary information we have taken a step towards the ability of software to understand visual stimuli. Giving a robot this kind of power would allow it to react to such stimuli, bringing its intelligence to just under the level of most basic aquatic animals. That may not sound like much, but if you take a look at how robots are doing right now (when tested outside their highly restrictive parameters), you’ll find that this would be quite a leap in intelligence compared to the amoebic way in which they can perceive their own surroundings.

What This Means for AI (And Why It’s Far From Perfect)

The fact that we now have software that (with 93 percent accuracy) can caption images means that we have somewhat overcome the obstacle of getting computers to make sense of their environments. Of course, that doesn’t mean we’re anywhere near finished in that department. It’s also worth mentioning that the Inception V3 was trained by humans over time and uses the information it “learned” to decipher other images. To have true understanding of one’s environment, one must be able to achieve a more abstract level of perception. Is the person in the image angry? Are two people fighting? What is the woman on the bench crying about?

The above questions represent the kinds of things we ask ourselves when we encounter other human beings. It’s the kind of abstract inquiry that requires us to extrapolate more information than what an image captioning doohickey can do. Let’s not forget that icing on the cake we like to call an emotional (or “irrational”) reaction to what we see. It’s why we consider flowers beautiful, sewers disgusting, and french fries tasty. It’s something we are still wondering whether we will ever achieve on a machine level without actually hard-coding it. The truth is that this kind of “human” phenomenon is likely impossible without restrictive programming. Of course, that doesn’t mean we won’t stop trying. We are, after all, human.

Do you think that our robot overlords will ever learn to appreciate the intricacy of a rose petal under a microscope? Tell us in a comment!

Miguel Leiva-Gomez
Miguel Leiva-Gomez

Miguel has been a business growth and technology expert for more than a decade and has written software for even longer. From his little castle in Romania, he presents cold and analytical perspectives to things that affect the tech world.

Subscribe to our newsletter!

Our latest tutorials delivered straight to your inbox