API Documentation

Find Dupes Fast By Stephan Sokolow (ssokolow.com)

A simple script which identifies duplicate files several orders of magnitude more quickly than fdupes by using smarter algorithms.


Figure out how to do ePyDoc-style grouping here without giving up automodule-level comfort.

fastdupes.CHUNK_SIZE = 65536

Size for chunked reads from file handles

fastdupes.DEFAULTS = {'min_size': 25, 'exclude': ['*/.svn', '*/.bzr', '*/.git', '*/.hg'], 'delete': False}

Default settings used by optparse and some functions

fastdupes.HEAD_SIZE = 16384

Limit how many bytes will be read to compare headers

class fastdupes.OverWriter(fobj)[source]

Bases: object

Output helper for handling overdrawing the previous line cleanly.

write(text, newline=False)[source]

Use \r to overdraw the current line with the given text.

This function transparently handles tracking how much overdrawing is necessary to erase the previous line when used consistently.

  • text (str) – The text to be outputted
  • newline (bool) – Whether to start a new line and reset the length count.
fastdupes.compareChunks(handles, chunk_size=65536)[source]

Group a list of file handles based on equality of the next chunk of data read from them.

  • handles – A list of open handles for file-like objects with otentially-identical contents.
  • chunk_size – The amount of data to read from each handle every time this function is called.

Two lists of lists:

  • Lists to be fed back into this function individually
  • Finished groups of duplicate paths. (including unique files as single-file lists)

Return type:

(list, list)


File handles will be closed when no longer needed


Discard chunk contents immediately once they’re no longer needed

fastdupes.delete_dupes(groups, prefer_list=None, interactive=True, dry_run=False)[source]

Code to handle the --delete command-line option.

  • groups (iterable) – A list of groups of paths.
  • prefer_list – A whitelist to be compiled by multiglob_compile() and used to skip some prompts.
  • interactive (bool) – If False, assume the user wants to keep all copies when a prompt would otherwise be displayed.
  • dry_run (bool) – If True, only pretend to delete files.


Add a secondary check for symlinks for safety.

fastdupes.find_dupes(paths, exact=False, ignores=None, min_size=0)[source]

High-level code to walk a set of paths and find duplicate groups.


A list of groups of files with identical contents

Return type:

[[path, ...], [path, ...]]

fastdupes.getPaths(roots, ignores=None)[source]

Recursively walk a set of paths and return a listing of contained files.

  • roots (list of str) – Relative or absolute paths to files or folders.
  • ignores (list of str) – A list of fnmatch globs to avoid walking and omit from results

Absolute paths to only files.

Return type:

list of str


Try to optimize the ignores matching. Running a regex on every filename is a fairly significant percentage of the time taken according to the profiler.

fastdupes.groupBy(groups_in, classifier, fun_desc='?', keep_uniques=False, *args, **kwargs)[source]

Subdivide groups of paths according to a function.

  • groups_in (dict of iterables) – Grouped sets of paths.
  • classifier (function(list, *args, **kwargs) -> str) – Function to group a list of paths by some attribute.
  • fun_desc (str) – Human-readable term for what the classifier operates on. (Used in log messages)
  • keep_uniques (bool) – If False, discard groups with only one member.

A dict mapping classifier keys to groups of matches.

Return type:



Grouping functions generally use a set groups as extra protection against accidentally counting a given file twice. (Complimentary to use of os.path.realpath() in getPaths())


Find some way to bring back the file-by-file status text


Byte-for-byte comparison on an arbitrary number of files in parallel.

This operates by opening all files in parallel and comparing chunk-by-chunk. This has the following implications:

  • Reads the same total amount of data as hash comparison.
  • Performs a lot of disk seeks. (Best suited for SSDs)
  • Vulnerable to file handle exhaustion if used on its own.
Parameters:paths (iterable) – List of potentially identical files.
Returns:A dict mapping one path to a list of all paths (self included) with the same contents.


Start examining the while handles: block to figure out how to minimize thrashing in situations where read-ahead caching is active. Compare savings by read-ahead to savings due to eliminating false positives as quickly as possible. This is a 2-variable min/max problem.


Look into possible solutions for pathological cases of thousands of files with the same size and same pre-filter results. (File handle exhaustion)


Decorator to convert a function which takes a single value and returns a key into one which takes a list of values and returns a dict of key-group mappings.

Parameters:function (function(value) -> key) – A function which takes a value and returns a hash key.
Return type:
function(iterable) ->
    {key: set ([value, ...]), ...}
fastdupes.hashClassifier(paths, *args, **kwargs)[source]

Sort a file into a group based on its SHA1 hash.

  • paths – See fastdupes.groupify()
  • limit (__builtins__.int) – Only this many bytes will be counted in the hash. Values which evaluate to False indicate no limit.

See fastdupes.groupify()

fastdupes.hashFile(handle, want_hex=False, limit=None, chunk_size=65536)[source]

Generate a hash from a potentially long file. Digesting will obey CHUNK_SIZE to conserve memory.

  • handle – A file-like object or path to hash from.
  • want_hex (bool) – If True, returned hash will be hex-encoded.
  • limit (int) – Maximum number of bytes to read (rounded up to a multiple of CHUNK_SIZE)
  • chunk_size (int) – Size of read() operations in bytes.
Return type:



A binary or hex-encoded SHA1 hash.


It is your responsibility to close any file-like objects you pass in


The main entry point, compatible with setuptools.

fastdupes.multiglob_compile(globs, prefix=False)[source]

Generate a single “A or B or C” regex from a list of shell globs.

  • globs (iterable of str) – Patterns to be processed by fnmatch.
  • prefix (bool) – If True, then match() will perform prefix matching rather than exact string matching.
Return type:



Pretty-print the contents of DEFAULTS

fastdupes.pruneUI(dupeList, mainPos=1, mainLen=1)[source]

Display a list of files and prompt for ones to be kept.

The user may enter all or one or more numbers separated by spaces and/or commas.


It is impossible to accidentally choose to keep none of the displayed files.

  • dupeList (list) – A list duplicate file paths
  • mainPos (int) – Used to display “set X of Y”
  • mainLen (int) – Used to display “set X of Y”

A list of files to be deleted.

Return type:


fastdupes.sizeClassifier(paths, *args, **kwargs)[source]

Sort a file into a group based on on-disk size.

  • paths – See fastdupes.groupify()
  • min_size (__builtins__.int) – Files smaller than this size (in bytes) will be ignored.

See fastdupes.groupify()


Rework the calling of stat() to minimize the number of calls. It’s a fairly significant percentage of the time taken according to the profiler.