API Documentation¶
Find Dupes Fast By Stephan Sokolow (ssokolow.com)
A simple script which identifies duplicate files several orders of magnitude more quickly than fdupes by using smarter algorithms.
Todo
Figure out how to do ePyDoc-style grouping here without giving up automodule-level comfort.
-
fastdupes.CHUNK_SIZE= 65536¶ Size for chunked reads from file handles
-
fastdupes.DEFAULTS= {'min_size': 25, 'exclude': ['*/.svn', '*/.bzr', '*/.git', '*/.hg'], 'delete': False}¶ Default settings used by
optparseand some functions
-
fastdupes.HEAD_SIZE= 16384¶ Limit how many bytes will be read to compare headers
-
class
fastdupes.OverWriter(fobj)[source]¶ Bases:
objectOutput helper for handling overdrawing the previous line cleanly.
-
write(text, newline=False)[source]¶ Use
\rto overdraw the current line with the given text.This function transparently handles tracking how much overdrawing is necessary to erase the previous line when used consistently.
Parameters: - text (
str) – The text to be outputted - newline (
bool) – Whether to start a new line and reset the length count.
- text (
-
-
fastdupes.compareChunks(handles, chunk_size=65536)[source]¶ Group a list of file handles based on equality of the next chunk of data read from them.
Parameters: - handles – A list of open handles for file-like objects with otentially-identical contents.
- chunk_size – The amount of data to read from each handle every time this function is called.
Returns: Two lists of lists:
- Lists to be fed back into this function individually
- Finished groups of duplicate paths. (including unique files as single-file lists)
Return type: (list, list)Attention
File handles will be closed when no longer needed
Todo
Discard chunk contents immediately once they’re no longer needed
-
fastdupes.delete_dupes(groups, prefer_list=None, interactive=True, dry_run=False)[source]¶ Code to handle the
--deletecommand-line option.Parameters: - groups (iterable) – A list of groups of paths.
- prefer_list – A whitelist to be compiled by
multiglob_compile()and used to skip some prompts. - interactive (
bool) – IfFalse, assume the user wants to keep all copies when a prompt would otherwise be displayed. - dry_run (
bool) – IfTrue, only pretend to delete files.
Todo
Add a secondary check for symlinks for safety.
-
fastdupes.find_dupes(paths, exact=False, ignores=None, min_size=0)[source]¶ High-level code to walk a set of paths and find duplicate groups.
Parameters: - exact (
bool) – Whether to compare file contents by hash or by reading chunks in parallel. - paths – See
getPaths() - ignores – See
getPaths() - min_size – See
sizeClassifier()
Returns: A list of groups of files with identical contents
Return type: [[path, ...], [path, ...]]- exact (
-
fastdupes.getPaths(roots, ignores=None)[source]¶ Recursively walk a set of paths and return a listing of contained files.
Parameters: - roots (
listofstr) – Relative or absolute paths to files or folders. - ignores (
listofstr) – A list offnmatchglobs to avoid walking and omit from results
Returns: Absolute paths to only files.
Return type: listofstrTodo
Try to optimize the ignores matching. Running a regex on every filename is a fairly significant percentage of the time taken according to the profiler.
- roots (
-
fastdupes.groupBy(groups_in, classifier, fun_desc='?', keep_uniques=False, *args, **kwargs)[source]¶ Subdivide groups of paths according to a function.
Parameters: - groups_in (
dictof iterables) – Grouped sets of paths. - classifier (
function(list, *args, **kwargs) -> str) – Function to group a list of paths by some attribute. - fun_desc (
str) – Human-readable term for what the classifier operates on. (Used in log messages) - keep_uniques (
bool) – IfFalse, discard groups with only one member.
Returns: A dict mapping classifier keys to groups of matches.
Return type: dictAttention: Grouping functions generally use a
setgroupsas extra protection against accidentally counting a given file twice. (Complimentary to use ofos.path.realpath()ingetPaths())Todo
Find some way to bring back the file-by-file status text
- groups_in (
-
fastdupes.groupByContent(paths)[source]¶ Byte-for-byte comparison on an arbitrary number of files in parallel.
This operates by opening all files in parallel and comparing chunk-by-chunk. This has the following implications:
- Reads the same total amount of data as hash comparison.
- Performs a lot of disk seeks. (Best suited for SSDs)
- Vulnerable to file handle exhaustion if used on its own.
Parameters: paths (iterable) – List of potentially identical files. Returns: A dict mapping one path to a list of all paths (self included) with the same contents. Todo
Start examining the
while handles:block to figure out how to minimize thrashing in situations where read-ahead caching is active. Compare savings by read-ahead to savings due to eliminating false positives as quickly as possible. This is a 2-variable min/max problem.Todo
Look into possible solutions for pathological cases of thousands of files with the same size and same pre-filter results. (File handle exhaustion)
-
fastdupes.groupify(function)[source]¶ Decorator to convert a function which takes a single value and returns a key into one which takes a list of values and returns a dict of key-group mappings.
Parameters: function ( function(value) -> key) – A function which takes a value and returns a hash key.Return type: function(iterable) -> {key:set([value, ...]), ...}
-
fastdupes.hashClassifier(paths, *args, **kwargs)[source]¶ Sort a file into a group based on its SHA1 hash.
Parameters: - paths – See
fastdupes.groupify() - limit (
__builtins__.int) – Only this many bytes will be counted in the hash. Values which evaluate toFalseindicate no limit.
Returns: - paths – See
-
fastdupes.hashFile(handle, want_hex=False, limit=None, chunk_size=65536)[source]¶ Generate a hash from a potentially long file. Digesting will obey
CHUNK_SIZEto conserve memory.Parameters: - handle – A file-like object or path to hash from.
- want_hex (
bool) – IfTrue, returned hash will be hex-encoded. - limit (
int) – Maximum number of bytes to read (rounded up to a multiple ofCHUNK_SIZE) - chunk_size (
int) – Size ofread()operations in bytes.
Return type: strReturns: A binary or hex-encoded SHA1 hash.
Note
It is your responsibility to close any file-like objects you pass in
-
fastdupes.multiglob_compile(globs, prefix=False)[source]¶ Generate a single “A or B or C” regex from a list of shell globs.
Parameters: Return type:
-
fastdupes.pruneUI(dupeList, mainPos=1, mainLen=1)[source]¶ Display a list of files and prompt for ones to be kept.
The user may enter
allor one or more numbers separated by spaces and/or commas.Note
It is impossible to accidentally choose to keep none of the displayed files.
Parameters: - dupeList (
list) – A list duplicate file paths - mainPos (
int) – Used to display “set X of Y” - mainLen (
int) – Used to display “set X of Y”
Returns: A list of files to be deleted.
Return type: int- dupeList (
-
fastdupes.sizeClassifier(paths, *args, **kwargs)[source]¶ Sort a file into a group based on on-disk size.
Parameters: - paths – See
fastdupes.groupify() - min_size (
__builtins__.int) – Files smaller than this size (in bytes) will be ignored.
Returns: Todo
Rework the calling of
stat()to minimize the number of calls. It’s a fairly significant percentage of the time taken according to the profiler.- paths – See