Find and Remove Duplicate Files in Linux

It might seem unnecessary to worry about duplicate files when you have terabytes of storage. However, if you care about file organization, you’ll want to avoid duplicates on your Linux system. You can find and remove duplicate files either via the command line or with a specialized desktop app.

duplicates-find-command

In case you’re not familiar with this powerful command, you can learn about it in our guide. By combining find with other essential Linux commands, like xargs, we can get a list of duplicate files in a folder (and all its subfolders). The command first compares files by size, then checks their MD5 hashes, which are unique bits of information about every file. To scan for duplicate files, open your console, navigate to the desired folder and type:

find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

This one-liner does the following:

find -not -empty -type f -printf "%s\n" – looks for regular files which are not empty and prints their size. If you care about file organization, you can easily find and remove duplicate files either via the command line or with a specialized desktop app.

sort -rn – sorts the file sizes in reverse order.

uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 – prints only duplicate lines. In this case, names of duplicate files.

xargs -0 md5sum | sort | – sorts the MD5 hashes of scanned files.

uniq -w32 --all-repeated=separate – compares the first 32 characters of MD5 hashes and prints those which are duplicates.

Note that this command doesn’t automatically remove duplicates – it only outputs a list, and you can delete files manually if you want. If you prefer to manage your files in an application that offers more options at once, the next solution might suit you.

DupeGuru is a cross-platform application that comes in three editions: Standard (SE), Music and Picture. It’s designed to find duplicate files based on multiple criteria (file names, file size, MD5 hashes) and uses fuzzy-matching to detect similar files. Windows and OS X users can download the installation files from the official website, and Ubuntu users can pull dupeGuru from the repository:

sudo add-apt-repository ppa:hsoft/ppa
sudo apt-get update
sudo apt-get install dupeguru

duplicates-dupeguru-search

To search for duplicates, first add some folders by pressing the “+” button. Setting a folder state to “Reference” means that other folders’ contents are compared to it. Before clicking “Scan,” check the “View -> Preferences” dialog to ensure that everything is properly set up.

duplicates-dupeguru-preferencesIf you care about file organization, you can easily find and remove duplicate files either via the command line or with a specialized desktop app.

“Scan Type” varies across dupeGuru editions; in Standard, you can compare files and folders by contents and filename. Picture edition offers comparison by EXIF timestamp and “Picture blocks” – a time-consuming option that divides each picture into a grid and calculates the average color for every tile. In Music edition, you can analyze “Fields,” “Tags” and “Audio content.” Some settings depend on the scan type: “Word weighting” and “Match similar words” work only when you search for file names. Conversely, “Filter Hardness” doesn’t apply when you perform a “Contents” scan.

DupeGuru can ignore small files and links (shortcuts) to a file, and lets you use regular expressions to further customize your query. You can also save search results to work on them later. Apple fans will love the fact that dupeGuru supports iPhoto and Aperture libraries and can manage iTunes libraries.

duplicates-dupeguru-details

When dupeGuru finds duplicates, a new window opens with reference files colored in blue and their duplicates listed below. The toolbar displays basic information, and you can see more about every file if you select it and click the “Details” button.

duplicates-dupeguru-actions

You can manage duplicate files directly from dupeGuru – the “Actions” menu shows everything you can do. Select files by ticking the checkbox or clicking their name; you can select all or multiple files using keyboard shortcuts (hold Shift/Ctrl and click on desired files). If you’re interested in differences between duplicate files, toggle Delta Values. The results can be re-prioritized (so the files listed as dupes become references) and sorted according to various criteria like modification date and size. The official dupeGuru user guide is helpful and clearly written, so you can rely on it if you ever get stuck.

Naturally, it would be more practical if dupeGuru wasn’t split into three editions – after all, most users love one-stop solutions. Still, if you don’t want to use the find command, dupeGuru provides a neat and quick way to eradicate dupes from your filesystem. Can you recommend some other tools for removing duplicate files? Do you prefer the command line for this task? Tell us in the comments.