Finding unique files between 2 folders based on hash.

Discussion related to "Everything" 1.5 Alpha.
Post Reply
Celestial92
Posts: 4
Joined: Thu Mar 28, 2024 9:34 am

Finding unique files between 2 folders based on hash.

Post by Celestial92 »

Hi, from what I understand, the filter provided below, from this thread, will allow you to bring up all files that exist in only one of two folders. In order to be regarded as a non-duplicate, a file in 'folder A' would need to have a combination of all three characteristics: name, size, and date modified, which don’t occur any file contained in folder B.
NotNull wrote: Tue Feb 06, 2024 9:41 pm
- Create the following filter:
Name = Difference between 2 folders (or whatever you like to name it)
Search =

Code: Select all

file:   regex:#regex-escape:<#element:<$param:,;,1>>(.*$) | regex:#regex-escape:<#element:<$param:,;,2>>(.*$)    unique:regmatch1;size;dm
Macro = diff

Back in the main window:
- Set the filter to Everything
- Search for:

Code: Select all

diff:"X:\first folder\";"Y:\second folder\"
The resultlist will show which files are either not on one of the 'sides' or have a different name, date modified or size
I’m wondering if there’s a way to do this with everything, but a file in folder A is only regarded as unique when it has a hash value, like SHA-1, which doesn’t occur in any file contained in folder B, and vice versa? I know that hashing all the files each time is time consuming, but there are a few rare instances where I would find it useful. The main reason is to not display results which are hash-duplicates but have different filenames.

Otherwise, is there any other software which would be better suited to those needs?

thanks
kazzybash
Posts: 98
Joined: Mon Mar 02, 2020 9:55 pm

Re: Finding unique files between 2 folders based on hash.

Post by kazzybash »

Not sure if I’m helping you with this, but maybe you could try ‘Alldup’ (you can search for duplicates by hash OR byte-by-byte AND/OR name). Regards, kazzy
NotNull
Posts: 5270
Joined: Wed May 24, 2017 9:22 pm

Re: Finding unique files between 2 folders based on hash.

Post by NotNull »

I think there is a solution using just Everything. When tested, it is largely working but there is an issue with the current iomplementation of the distinct: function. Or much more likely: my understanding of it.
When that is sorted out, will reply here (might take a couple of days).


The (planned) mechanism in short:
- filter out duplicates of hashes in the same folder ("c:\folderA\file.txt" and "c:\folderA\copy of file.txt" will have the same hash)
- list only unique hashes.
Celestial92
Posts: 4
Joined: Thu Mar 28, 2024 9:34 am

Re: Finding unique files between 2 folders based on hash.

Post by Celestial92 »

@NotNull Thanks for looking into it! Actually, I just remembered that there’s another feature I’d find useful with this. If a file is duplicated 2 or more times in folderA, but that same file doesn’t occur anywhere in folderB, then I’d also like that to show up as a unique result in the search. The main purpose is to make sure that all files which exist in one directory also exist in the other other directory, but some files have different filenames. Is that possible?

Thanks.
NotNull
Posts: 5270
Joined: Wed May 24, 2017 9:22 pm

Re: Finding unique files between 2 folders based on hash.

Post by NotNull »

Celestial92 wrote: Mon Apr 01, 2024 5:40 am The main purpose is to make sure that all files which exist in one directory also exist in the other other directory, but some files have different filenames
And that is exactly what it should do.
Will require an Everything update though. When that comes out, I will post a solution here.
klarah
Posts: 8
Joined: Mon Aug 18, 2014 12:33 pm

Re: Finding unique files between 2 folders based on hash.

Post by klarah »

Celestial92 wrote: Sat Mar 30, 2024 8:47 am Otherwise, is there any other software which would be better suited to those needs?
have a look at beyondcompare (https://www.scootersoftware.com/). it integrates into the context menu and is (as I find) the best "ADDON" to Everything to compare files and folders in detail.
horst.epp
Posts: 1346
Joined: Fri Apr 04, 2014 3:24 pm

Re: Finding unique files between 2 folders based on hash.

Post by horst.epp »

klarah wrote: Fri Apr 05, 2024 2:08 am
Celestial92 wrote: Sat Mar 30, 2024 8:47 am Otherwise, is there any other software which would be better suited to those needs?
have a look at beyondcompare (https://www.scootersoftware.com/). it integrates into the context menu and is (as I find) the best "ADDON" to Everything to compare files and folders in detail.
The requested function is not anything BeyondCompare provides.
It's a very good compare tool, but not applicable here.
Celestial92
Posts: 4
Joined: Thu Mar 28, 2024 9:34 am

Re: Finding unique files between 2 folders based on hash.

Post by Celestial92 »

I had a look at beyond compare. It seems very close to what I was looking for, as you can show all the unique (orphan) files between 2 folders.
But yes, it doesn't seem to allow you to display the orphan files solely based on their hash value. Still looks very handy.

thanks
horst.epp
Posts: 1346
Joined: Fri Apr 04, 2014 3:24 pm

Re: Finding unique files between 2 folders based on hash.

Post by horst.epp »

Celestial92 wrote: Fri Apr 05, 2024 10:12 am I had a look at beyond compare. It seems very close to what I was looking for, as you can show all the unique (orphan) files between 2 folders.
That is a function which most compare tool and file managers provide.
klarah
Posts: 8
Joined: Mon Aug 18, 2014 12:33 pm

Re: Finding unique files between 2 folders based on hash.

Post by klarah »

Celestial92 wrote: Fri Apr 05, 2024 10:12 am But yes, it doesn't seem to allow you to display the orphan files solely based on their hash value. Still looks very handy.
its a bit hidden, but under "Actions" is a function "compare Contents.." wich offers comparision by CRC or binary or other rules.
NotNull
Posts: 5270
Joined: Wed May 24, 2017 9:22 pm

Re: Finding unique files between 2 folders based on hash.

Post by NotNull »

I think the idea is to compare files, just based on the hash, so 'file1.txt' and 'copy of file1.txt' are considered the same, even when their names differ.
I vaguely remember that Beyond Compare will always include the name when comparing. Or is it possible to ignore the names?
horst.epp
Posts: 1346
Joined: Fri Apr 04, 2014 3:24 pm

Re: Finding unique files between 2 folders based on hash.

Post by horst.epp »

NotNull wrote: Sun Apr 07, 2024 4:07 pm I think the idea is to compare files, just based on the hash, so 'file1.txt' and 'copy of file1.txt' are considered the same, even when their names differ.
I vaguely remember that Beyond Compare will always include the name when comparing. Or is it possible to ignore the names?
In my older version, you can only set to ignore the extensions,
and use size and CRC as criteria.
klarah
Posts: 8
Joined: Mon Aug 18, 2014 12:33 pm

Re: Finding unique files between 2 folders based on hash.

Post by klarah »

i see... as i also have use of it i asked chatgpt this question:

"i have two folders with files. the name of the first is "1" and lays in the same path as the script i want you to write. the second folder i provide as argument. the script in python shell loop trought all files in both folders and calculates their checksum. as second step it shell remove all files in the folder "1" wich checksum does not exists in the second folder respectivly."

with only litle adoption i got a working script with the following usage:

Usage: python remove_crc_dupes.py <second_folder_path>

1. put a copy or a softlink of the files, you want to check for existance in <second_folder_path>, into <first_folder_path>
2. drag n drop the <second_folder_path> onto the script (or use commandline)
3. the sript will compare all files in both folders by crc and remove all file duplicates in <first_folder_path>

as result only the files wich dont have crc duplicates in <second_folder_path> stay in <first_folder_path>.
for softlink creation i use this: https://schinagl.priv.at/nt/hardlinkshe ... nsion.html

its maybe not the most convenient solution yet, but i find it quite usable as first shot...
i put the script+folder stucture as attachment.
Attachments
crc_dupes.zip
(1.17 KiB) Downloaded 17 times
NotNull
Posts: 5270
Joined: Wed May 24, 2017 9:22 pm

Re: Finding unique files between 2 folders based on hash.

Post by NotNull »

@klarah:

It is possible in Everything itself too (tested), but that requires a fix in Everything and that fix likely comes with the next update.
Will post the all the details when that fix is ready.
Celestial92
Posts: 4
Joined: Thu Mar 28, 2024 9:34 am

Re: Finding unique files between 2 folders based on hash.

Post by Celestial92 »

I’m still interested in what the next update to 1.5 can do, but I found a nice workaround which involves a few steps.

1a. find the distinct files in folder A, based on hash
e.g.

Code: Select all

"C:\folder A" distinct:MD5
1b. save the search results as a .efu file. e.g. folderA_distinct.efu

2a. find distinct files in folder B, based on hash
e.g.

Code: Select all

"C:\folder B" distinct:MD5
2b. save the search results as a .efu file. e.g. folderB_distinct.efu

3. combine the results of both .efu files into 1 efu file (e.g. FolderA+B_distinct.efu), through copy and pasting them together in notepad

4. open FolderA+B_distinct.efu file and find all the unique results, using

Code: Select all

unique:MD5
This all seems to work quite well so far, and using the efu files means I don’t have to recalculate the hashes for directories that don’t change.
NotNull
Posts: 5270
Joined: Wed May 24, 2017 9:22 pm

Re: Finding unique files between 2 folders based on hash.

Post by NotNull »

There is a bug in the distinct: function that causes the first file to be left out if multiple criteria are specified (i.e. distinct:name;md5)
That is planned to be fixed in the next update.
link

Your workaround -- using a single criterion -- is not affected by that bug.
NotNull
Posts: 5270
Joined: Wed May 24, 2017 9:22 pm

Re: Finding unique files between 2 folders based on hash.

Post by NotNull »

Try the following:
  • Create a new Filter:

    Code: Select all

    Name =HashDiff (or anything you like)
    Search = file:   regex:([regex-escape:[element:$param:,;,1]])(.*$) | regex:([regex-escape:[element:$param:,;,2]])(.*$)      distinct:regmatch1;size;md5   unique:size;md5
    Macro = hashdiff
  • Back in the main Everything window, set the active filter to Everything
  • Search for:

    Code: Select all

    hashdiff:"c:\the first\folder\";"c:\another folder\"
    (don't forget the backslashes at the end of the foldernames)
  • Let Everything calculate all MD5 hashes (which might take a while when not indexed)
    Progress will be shown in the statusbar
  • Check the results.
  • Report back

This will report:
- Files in "c:\the first\folder\" that do not exist anywhere in "c:\another folder\", based on MD5-hash and size.
- Files in "c:\another folder\" that do not exist anywhere in "c:\the first\folder\", based on MD5-hash and size.

These folders do not need to have the same folderstructure.


Some remarks:
- requires Everything 1.5.0.1372a or later
- you can replace MD5 with any other supported hash in the search query. MD5 was just less typing than CRC64 ;)
- Did not have much time to test, so consider this a beta version.
- Size is not strictly needed, but as collisions (2 different files having the same hash value) can theoretically happen, the chance that 2 colliding files *also* have the same size is very, very neglectible.
Specular
Posts: 16
Joined: Sat Dec 27, 2014 5:46 am

Re: Finding unique files between 2 folders based on hash.

Post by Specular »

NotNull wrote: Fri Apr 19, 2024 9:23 pm Try the following:
Thanks for this. Had to perform a checksum only comparison between two directories the other day, prior to discovering this, so instead had to use a different, more convoluted method via Powershell.
Post Reply