Scripts Index
Python File Management  

Dupinator

download download home home   report broken
important script information
company name:
code.activestate.com
license: Free
minimum requirements: Python
functional limitations:
Dupinator description
Point this script at a folder or several folders and it will find and delete all duplicate files within the folders, leaving behind the first file found of any set of duplicates. It is designed to handle hundreds of thousands of files of any size at a time and to do so quickly.It was written to eliminate duplicates across several photo libraries that had been shared between users. As the script was a one-off to solve a very particular problem, there are no options nor is it refactoring into any kind of modules or reusable functions. The script uses a multipass approach to finding duplicate files. First, it walks all of the directories pass in and groups all files by size. In the next pass, the script walks each set of files of the same size and checksums the first 1024 bytes. Finally, the script walks each set of files that are the same size with the same hash of the first 1024 bytes and checksums each file in its entirety.

The very last step is to walk each set of files of the same length/hash and delete all but the first file in the set.

It ran against a 3.5 gigabyte set of files composed of about 120,000 files, of which there were about 50,000 duplicates, most of which were over 1 megabyte. The total run took about 2 minutes on a 1.33ghz G4 powerbook. Fast enough for me and fast enough without actually optimizing anything beyond the obvious.
Similar scripts
Convert PDF to TIFF (Popularity: ) : This script is a very short code snippet illustrating how to convert individual pages of PDF documents to TIFF files, one TIFF file per page. It works only on Mac OS X with PyObjC installed.As a recipe the code is ...
Count PDF pages (Popularity: ) : Count PDF pages script is a simple way to count the pages of a PDF the pure Python way.
Iterate over .MP4 atoms (Popularity: ) : This script yields the atoms contained in an MP4 file. Mostly is used for extracting the tags contained in it (artist, title etc) using a convenience class (M4ATags). This script could be implemented as an generator.
Counting pages of PDF documents on Mac OS X (Popularity: ) : Given that PDF is a "native" data format on Mac OS X, it is very easy to get access to some properties of such documents. One is the number of pages. Using Python the necessary code to do this is ...
Counting pages of PDF documents on Mac O (Popularity: ) : Given that PDF is a "native" data format on Mac OS X, it is very easy to get access to some properties of such documents. One is the number of pages. Using Python the necessary code to do this is ...
Disk (Popularity: ) : This script provides a simple simulation of secondary memory and is primarily designed to provide a driver interface to a virtual hard drive. The interface is simple and allows the simulation of IO errors. Also provided are methods that allow ...
Cross Platform Excel Parsing With Xlrd (Popularity: ) : This script easily extract data from microsoft excel files using this wrapper class for xlrd. The class allows you to create a generator which returns excel data one row at a time as either a list or dictionary. This script ...
A Singleton log file creator (Popularity: ) : This class is a basic Singleton log file creator. It allows separate classes/modules to log their activities to the same file (even the same line if they want to).This is a quite basic log file creator, intended to assist in ...
Backup your files (Popularity: ) : Backup your files script makes backup versions for your files.It can be used for non-python source code also.
A Python script to test download mirrors (Popularity: ) : The concept of the script is straightforward: read the mirrors page from RedHat's web site, make a list of all the mirrors, test how long it takes to download from each, and present a sorted list of the results.

The first ...

User reviews

Write a review:
1 2 3 4 5 6 7 8 9 10
1=poor 10=excellent
Write review*
Your name*
Email*
  (Comments are moderated, and will not appear on this site until the editor has approved them)
 
ad


Rate me
supported os
All
stats
downloads 14
version 1.0
size in Kb
popularity   
1066/371346
user rating 5/10
New Scripts
Popular Scripts
Latest Reviews