Find and Remove Duplicate Files in Linux

It might seem unnecessary to worry about duplicate files when you have terabytes of storage. However, if you care about file organization, you’ll want to avoid duplicates on your Linux system. You can find and remove duplicate files either via the command line or with a specialized desktop app.

duplicates-find-command

In case you’re not familiar with this powerful command, you can learn about it in our guide. By combining find with other essential Linux commands, like xargs, we can get a list of duplicate files in a folder (and all its subfolders). The command first compares files by size, then checks their MD5 hashes, which are unique bits of information about every file. To scan for duplicate files, open your console, navigate to the desired folder and type:

This one-liner does the following:

find -not -empty -type f -printf "%s\n" – looks for regular files which are not empty and prints their size. If you care about file organization, you can easily find and remove duplicate files either via the command line or with a specialized desktop app.

sort -rn – sorts the file sizes in reverse order.

uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 – prints only duplicate lines. In this case, names of duplicate files.

xargs -0 md5sum | sort | – sorts the MD5 hashes of scanned files.

uniq -w32 --all-repeated=separate – compares the first 32 characters of MD5 hashes and prints those which are duplicates.

Note that this command doesn’t automatically remove duplicates – it only outputs a list, and you can delete files manually if you want. If you prefer to manage your files in an application that offers more options at once, the next solution might suit you.

DupeGuru is a cross-platform application that comes in three editions: Standard (SE), Music and Picture. It’s designed to find duplicate files based on multiple criteria (file names, file size, MD5 hashes) and uses fuzzy-matching to detect similar files. Windows and OS X users can download the installation files from the official website, and Ubuntu users can pull dupeGuru from the repository:

duplicates-dupeguru-search

To search for duplicates, first add some folders by pressing the “+” button. Setting a folder state to “Reference” means that other folders’ contents are compared to it. Before clicking “Scan,” check the “View -> Preferences” dialog to ensure that everything is properly set up.

duplicates-dupeguru-preferencesIf you care about file organization, you can easily find and remove duplicate files either via the command line or with a specialized desktop app.

“Scan Type” varies across dupeGuru editions; in Standard, you can compare files and folders by contents and filename. Picture edition offers comparison by EXIF timestamp and “Picture blocks” – a time-consuming option that divides each picture into a grid and calculates the average color for every tile. In Music edition, you can analyze “Fields,” “Tags” and “Audio content.” Some settings depend on the scan type: “Word weighting” and “Match similar words” work only when you search for file names. Conversely, “Filter Hardness” doesn’t apply when you perform a “Contents” scan.

DupeGuru can ignore small files and links (shortcuts) to a file, and lets you use regular expressions to further customize your query. You can also save search results to work on them later. Apple fans will love the fact that dupeGuru supports iPhoto and Aperture libraries and can manage iTunes libraries.

duplicates-dupeguru-details

When dupeGuru finds duplicates, a new window opens with reference files colored in blue and their duplicates listed below. The toolbar displays basic information, and you can see more about every file if you select it and click the “Details” button.

duplicates-dupeguru-actions

You can manage duplicate files directly from dupeGuru – the “Actions” menu shows everything you can do. Select files by ticking the checkbox or clicking their name; you can select all or multiple files using keyboard shortcuts (hold Shift/Ctrl and click on desired files). If you’re interested in differences between duplicate files, toggle Delta Values. The results can be re-prioritized (so the files listed as dupes become references) and sorted according to various criteria like modification date and size. The official dupeGuru user guide is helpful and clearly written, so you can rely on it if you ever get stuck.

Naturally, it would be more practical if dupeGuru wasn’t split into three editions – after all, most users love one-stop solutions. Still, if you don’t want to use the find command, dupeGuru provides a neat and quick way to eradicate dupes from your filesystem. Can you recommend some other tools for removing duplicate files? Do you prefer the command line for this task? Tell us in the comments.

12 comments

  1. Title should be “Remove Duplicate Files in UBUNTU” The install commands for dupeGURU do not work on any distro other than Ubuntu and derivatives.

    • The find command can be used on any Linux distribution, so this fact alone justifies the title (IMHO).

      As for dupeGuru, there’s a link to the official website in the article, where even Windows and Mac users can download installation packages. You can also build dupeGuru from source on any distribution; downloads and explanation here: https://github.com/hsoft/dupeguru
      Hope this helps! :)

    • Great choice! FSlint can do much more than just remove duplicate files, so it’s a really good tool to have.

  2. LOVE the use of the md5 in the first command. I think that is something that is very underused by sysadmins. When I first learned about what it was and how to use, I found many ways to use md5, especially when downloading files, copying files, rsync files, backup oracle dumps, etc (link below on an article I wrote re: using hashing as a sysadmin). Also, definitely a good idea to just get an output so you can eyeball everything before haphazardly removing something that somehow is a false positive duplicate.

    http://geekswing.com/geek/the-magic-of-hash-and-i-mean-of-the-md5-and-sha-1-vintage/

    • Thanks for your comment! I agree – md5 is quite a reliable method for finding duplicates. Even the aforementioned FSlint uses it :)

      • Yup. And thanks for writing this article. While I am a command line guy (and for scripting and cron scripts it is very helpful), the gui tool you mentioned (dupeguru) is going to be very helpful for a lot of users esp since it seems quite powerful too. Thanks again for this writeup!

  3. You make things much harder on the command line than you need to. Every Linux distro I have tried out has fdupes available in the repositories which takes care of the locating, hashing and comparing operations by itself and is much faster than anything that is piping through so many other utilities AND uses xargs. Talk about an inefficient time-suck. If you just want a list, you can redirect the output or you can have it automatically handle the deletions for you.

    • I wanted to show a way to do this by specifically using the find command since we’ve recently had an article about it. Of course, it’s just *a way* – one of many – to find duplicates; I never claimed it was the best one! :)

      Fdupes is probably a more elegant solution, but having two command-line tools and just one GUI tool in such a short text would be a bit unbalanced. Perhaps we could take a look at Fdupes and FSlint in one of our future texts. :)

  4. An example of a good tool that could have been used is the DuplicateFilesDeleter program

Comments are closed.

Sponsored Stories