Extract Embedded Images from a PDF File in Ubuntu

While we already know how to edit existing PDF files in Ubuntu, there are times when the requirement is to use all or some of the images contained in a PDF file. Manual copy-pasting is definitely an option, but it’s not a time-saving one, especially when the PDF file contains a large number of images.

A tool exists, dubbed PDFImages, that makes image extraction from PDF files a cakewalk. In this article we will discuss this tool using easy-to-understand examples. Note that all the examples used in the article are tested on Ubuntu 14.04 LTS using version 0.24.5 of the tool.

What is PDFImages?

As already discussed, PDFImages is a command line tool that you can use to extract images from a PDF file. The tool’s man page says that it reads the input PDF file, scans it, and produces one Portable Pixmap (PPM), Portable Pixmap (PBM), or JPEG file for each image it encounters in the PDF file.

Download and Install

If the tool isn’t already installed on your Ubuntu box, you can download and install it using the following command:

sudo apt-get install poppler-utils

In addition to PDFImages, the package “poppler-utils” also contains several other command line utilities for getting information from PDF documents, converting them to other formats, or manipulating them.

Usage

The command line tool PDFImages, in its most basic form, requires two arguments: input PDF file and the path to the directory in which you want the tool to save the images. For example, in my case I tried extracting images from a PDF file named “christmas_story.pdf” and saving them to a directory named “pdfimages”.

pdfimages /home/himanshu/Downloads/christmas_story.pdf /home/himanshu/Downloads/pdfimages/

The above command produced the following files in the target directory:

ls /home/himanshu/Downloads/pdfimages/
-000.ppm  -001.ppm  -002.ppm  -003.ppm  -004.ppm  -005.ppm  -006.ppm  -007.ppm

As you can see in the output above, the name of the files begins with a hyphen (-) followed by a number. For those wondering why the name begins with a hyphen, the tool gives you the flexibility to prefix any word before the hyphen so that you can create custom names for the output images. You can do this by adding that particular word to the path of the destination directory while running the command.

For example, I added the word “image” to the path of the destination directory:

pdfimages /home/himanshu/Downloads/christmas_story.pdf /home/himanshu/Downloads/pdfimages/image

And the output files produced in this case carried the following name:

ls /home/himanshu/Downloads/pdfimages/
image-000.ppm  image-001.ppm  image-002.ppm  image-003.ppm  image-004.ppm  image-005.ppm  image-006.ppm  image-007.ppm

It’s worth mentioning that contrary to what the tool’s man page says, two images are produced for each image in the PDF file of which one is blank while the other is usable. In my case, the odd numbered images were blank:

Moving on, you can also change the output image file format from “ppm” to “jpeg,” which you can do by using the -j option. Keep in mind, however, that with this option, only images in DCT format are saved as JPEG files – all non-DCT images are saved in PBM/PPM format as usual.

You can also specify which pages you want the tool to scan. This way you’ll only get those images in output which are there on these pages. To enable this option, you have to use the -f option (followed by the page number) and -l (followed by the page number) to specify start and end pages respectively.

For example, I wanted the tool to only extract images present on the first page of the PDF file, so I used the following command:

pdfimages -f 1 -l 1 /home/himanshu/Downloads/christmas_story.pdf /home/himanshu/Downloads/pdfimages/

And in the destination directory, only two images (total of four including the blank ones) were produced:

ls /home/himanshu/Downloads/pdfimages/
-000.ppm  -001.ppm  -002.ppm  -003.ppm

Conclusion

PDFImages is definitely a handy tool if your work involves dealing with PDF files and the images they contain, and as you might have observed by now, it’s easy to learn as well as simple to use. To learn more about the tool, head to its man page.