API Documentation¶

Find Dupes Fast By Stephan Sokolow (ssokolow.com)

A simple script which identifies duplicate files several orders of magnitude more quickly than fdupes by using smarter algorithms.

Todo

Figure out how to do ePyDoc-style grouping here without giving up automodule-level comfort.

fastdupes.CHUNK_SIZE = 65536¶: Size for chunked reads from file handles

fastdupes.DEFAULTS = {'min_size': 25, 'exclude': ['*/.svn', '*/.bzr', '*/.git', '*/.hg'], 'delete': False}¶: Default settings used by optparse and some functions

fastdupes.HEAD_SIZE = 16384¶: Limit how many bytes will be read to compare headers

class fastdupes.OverWriter(fobj)[source]¶

Bases: object

Output helper for handling overdrawing the previous line cleanly.

write(text, newline=False)[source]¶

Use \r to overdraw the current line with the given text.

This function transparently handles tracking how much overdrawing is necessary to erase the previous line when used consistently.

Parameters:	text (`str`) – The text to be outputted newline (`bool`) – Whether to start a new line and reset the length count.

fastdupes.compareChunks(handles, chunk_size=65536)[source]¶

Group a list of file handles based on equality of the next chunk of data read from them.

Parameters:

handles – A list of open handles for file-like objects with otentially-identical contents.
chunk_size – The amount of data to read from each handle every time this function is called.

Returns:

Two lists of lists:

Lists to be fed back into this function individually
Finished groups of duplicate paths. (including unique files as single-file lists)

Return type:

(list, list)

Attention

File handles will be closed when no longer needed

Todo

Discard chunk contents immediately once they’re no longer needed

fastdupes.delete_dupes(groups, prefer_list=None, interactive=True, dry_run=False)[source]¶

Code to handle the --delete command-line option.

Parameters:	groups (iterable) – A list of groups of paths. prefer_list – A whitelist to be compiled by `multiglob_compile()` and used to skip some prompts. interactive (`bool`) – If `False`, assume the user wants to keep all copies when a prompt would otherwise be displayed. dry_run (`bool`) – If `True`, only pretend to delete files.

Todo

Add a secondary check for symlinks for safety.

fastdupes.find_dupes(paths, exact=False, ignores=None, min_size=0)[source]¶

High-level code to walk a set of paths and find duplicate groups.

Parameters:	exact (`bool`) – Whether to compare file contents by hash or by reading chunks in parallel. paths – See `getPaths()` ignores – See `getPaths()` min_size – See `sizeClassifier()`
Returns:	A list of groups of files with identical contents
Return type:	`[[path, ...], [path, ...]]`

fastdupes.getPaths(roots, ignores=None)[source]¶

Recursively walk a set of paths and return a listing of contained files.

Parameters:	roots (`list` of `str`) – Relative or absolute paths to files or folders. ignores (`list` of `str`) – A list of `fnmatch` globs to avoid walking and omit from results
Returns:	Absolute paths to only files.
Return type:	`list` of `str`

Todo

Try to optimize the ignores matching. Running a regex on every filename is a fairly significant percentage of the time taken according to the profiler.

fastdupes.groupBy(groups_in, classifier, fun_desc='?', keep_uniques=False, *args, **kwargs)[source]¶

Subdivide groups of paths according to a function.

Parameters:	groups_in (`dict` of iterables) – Grouped sets of paths. classifier (`function(list, args, kwargs) -> str`) – Function to group a list of paths by some attribute. fun_desc* (`str`) – Human-readable term for what the classifier operates on. (Used in log messages) keep_uniques (`bool`) – If `False`, discard groups with only one member.
Returns:	A dict mapping classifier keys to groups of matches.
Return type:	`dict`
Attention:	Grouping functions generally use a `set` `groups` as extra protection against accidentally counting a given file twice. (Complimentary to use of `os.path.realpath()` in `getPaths()`)

Todo

Find some way to bring back the file-by-file status text

fastdupes.groupByContent(paths)[source]¶

Byte-for-byte comparison on an arbitrary number of files in parallel.

This operates by opening all files in parallel and comparing chunk-by-chunk. This has the following implications:

Reads the same total amount of data as hash comparison.

Performs a lot of disk seeks. (Best suited for SSDs)

Vulnerable to file handle exhaustion if used on its own.

Parameters:	paths (iterable) – List of potentially identical files.
Returns:	A dict mapping one path to a list of all paths (self included) with the same contents.

Todo

Start examining the while handles: block to figure out how to minimize thrashing in situations where read-ahead caching is active. Compare savings by read-ahead to savings due to eliminating false positives as quickly as possible. This is a 2-variable min/max problem.

Todo

Look into possible solutions for pathological cases of thousands of files with the same size and same pre-filter results. (File handle exhaustion)

fastdupes.groupify(function)[source]¶

Decorator to convert a function which takes a single value and returns a key into one which takes a list of values and returns a dict of key-group mappings.

Parameters:	function (`function(value) -> key`) – A function which takes a value and returns a hash key.
Return type:	function(iterable) -> {key: `set` ([value, ...]), ...}

fastdupes.hashClassifier(paths, *args, **kwargs)[source]¶

Sort a file into a group based on its SHA1 hash.

Parameters:	paths – See `fastdupes.groupify()` limit (`__builtins__.int`) – Only this many bytes will be counted in the hash. Values which evaluate to `False` indicate no limit.
Returns:	See `fastdupes.groupify()`

fastdupes.hashFile(handle, want_hex=False, limit=None, chunk_size=65536)[source]¶

Generate a hash from a potentially long file. Digesting will obey CHUNK_SIZE to conserve memory.

Parameters:	handle – A file-like object or path to hash from. want_hex (`bool`) – If `True`, returned hash will be hex-encoded. limit (`int`) – Maximum number of bytes to read (rounded up to a multiple of `CHUNK_SIZE`) chunk_size (`int`) – Size of `read()` operations in bytes.
Return type:	`str`
Returns:	A binary or hex-encoded SHA1 hash.

Note

It is your responsibility to close any file-like objects you pass in

fastdupes.main()[source]¶: The main entry point, compatible with setuptools.

fastdupes.multiglob_compile(globs, prefix=False)[source]¶

Generate a single “A or B or C” regex from a list of shell globs.

Parameters:	globs (iterable of `str`) – Patterns to be processed by `fnmatch`. prefix (`bool`) – If `True`, then `match()` will perform prefix matching rather than exact string matching.
Return type:	`re.RegexObject`

fastdupes.print_defaults()[source]¶: Pretty-print the contents of DEFAULTS

fastdupes.pruneUI(dupeList, mainPos=1, mainLen=1)[source]¶

Display a list of files and prompt for ones to be kept.

The user may enter all or one or more numbers separated by spaces and/or commas.

Note

It is impossible to accidentally choose to keep none of the displayed files.

Parameters:	dupeList (`list`) – A list duplicate file paths mainPos (`int`) – Used to display “set X of Y” mainLen (`int`) – Used to display “set X of Y”
Returns:	A list of files to be deleted.
Return type:	`int`

fastdupes.sizeClassifier(paths, *args, **kwargs)[source]¶

Sort a file into a group based on on-disk size.

Parameters:	paths – See `fastdupes.groupify()` min_size (`__builtins__.int`) – Files smaller than this size (in bytes) will be ignored.
Returns:	See `fastdupes.groupify()`

Todo

Rework the calling of stat() to minimize the number of calls. It’s a fairly significant percentage of the time taken according to the profiler.