identifying duplicates from checksums

Off-topic posts of interest to the "Everything" community.
Post Reply
jimspoon
Posts: 161
Joined: Tue Apr 26, 2011 11:39 pm

identifying duplicates from checksums

Post by jimspoon »

I was wondering if duplicate files could be quickly identified by generating and storing checksums for all files. I saw a response that this would be too much of a burden on system resources, particularly with respect to files that are constantly changing. This seems like a good point with respect to constantly changing files, but many files have content that changes rarely if ever. So maybe checksums could be calculated and stored for all files after a certain time has passed without data change. Does anybody know if there is already such a mechanism in place in a filesystem to do this, or if there is utility software for this purpose?
horst.epp
Posts: 1344
Joined: Fri Apr 04, 2014 3:24 pm

Re: identifying duplicates from checksums

Post by horst.epp »

AFAIK no file system has such an integrated feature.
therube
Posts: 4610
Joined: Thu Sep 03, 2009 6:48 pm

Re: identifying duplicates from checksums

Post by therube »

There is voidhash by someone we know ;-).
(Search will turn up at least 2 threads.)

---

Burden? Not sure it would be any more of a burden then anything else?
Now, it would/could be a burden during the actual scanning phase (IOW, Everything would be busy during that time).
And then the fact that hashes are not particularly ? expected to be, not lost, per say.

---

Programs, two of the best IMO:
AllDup & DuplicateCleaner.

---

Hash programs. Oh, that's a real tough one - depending on your needs.
Fast, xxHash.

ramble (& still a work in progress)...

Code: Select all

---


hash:
	-r, recursive (directory tree)
	-c, check (of .md5 .sha1 ...)
	-xxh, xxh hash & if not, then what?
	file, individual file(s)
	dir, directory(s)
	LFN & Unicode ?

vhash (voidhash)
	-r,   but not "standard"
	-c,   no
	-xxh, no, but has MD5 & up
	file, NO - only on directories, not individual file(s) !
	dir,  YES
	LFN & Unicode, most likely are not an issue	

	-r, (is by default)
		BUT its' recursive writes a hash file into EVERY directory (WITHOUT touching date/time ;-))
		rather then a single hash file containing all hashed files
	output is standard, in a DOS manner (so no, '*filename', only 'filename')
	works with Salamander (& anything else that would deal with "standard" hash files) [of which there is no "standard" ;-)]
	LFN & Unicode, most likely are not an issue

hash (FcHash)
	-r,   but not "standard"
	-c,   NO
	-xxh, YES + MD5 & up
	file, YES
	dir,  YES
	LFN & Unicode, most likely are not an issue
	
	no output is "standard"
	- so only good for checking against "other" of its own output
	  (like [externally] comparing 2 output files of its' creation)
	  IOW, not easily working with other tools
	  
	  hash --recurs --non_stop /1/ > out1
	  hash --recurs --non_stop /2/ > out2
	  compare out1 out2
	
	doing something like DEL a dir_tree is NOT really feasible

	-r	BUT it but only /demarks/ directories with a directory /HEADER/ - NOT as, dir/name !
			c:/out/ :
				file 1
				file 2
				file 3
			c:/out/X/ :
				xfile 1
				xfile 2
				xfile 3
	so... you cannot then manipulate all the files in the list
	like if you wanted to delete them all
		%s/^/DEL "/
		%s/$/"/
		del "file 1"
		del "file 2"
		del "file 3"
		del "x/xfile 1" - NOT FOUND
		del "x/xfile 2" - NOT FOUND
		del "x/xfile 3" - NOT FOUND

	if all you wanted is something to COMPARE directory trees
	then this is fine (& it even defaults to XXH3 which is theoretically the quickest...)

fsum
	-r,   YES
	-c,   YES
	-xxh, no (OK, so be it, but does have many other hashes, incl. sha512, so /kind of/ futureproof)
	file, YES
	dir,  YES, but... <FAIL, also>!
	
	UNIX-like, so includes * in its output
		which Salamander does not like! (%s/\s\*/ /), but a switch to turn it off... ;-)

	default output is MD5
		& (with -jnc switch) is exactly like md5sum
	but, -sha1 (& most all else)
		then oddly "says" it is sha1, so, *file1, becomes ?SHA1*file1 (or ?SHA256*file1)
			(again, if there were a switch to turn it off...)

	dir <FAIL> !!!
		so if /test/X/ & /test/X/1.txt & /test/X/X.txt
		fsum X/* says:
			hashof 1.txt
		*BUT* any file name that starts with an X (like X.txt) gives:
			NOT FOUND ******* X.txt <------- HUH ?

777 (7zip)
	-r,   YES, by default
	-c,   NO
	-xxh, NO, also no MD5 - essentially... only SHA1 !
	file, YES
	dir,  YES
	LFN & Unicode, most likely are not an issue
	
	output is non-standard, but workable

xxh (xxhsum)
	-r,   NO !!! - a great limitation (so anything done, can only deal with a single directory)
	-c,   YES (but given it doesn't recurse... fine if only dealing with 1 directory, but otherwise...)
	-xxh, of course ;-) - though ONLY xxh, NOTHING else

	for checking/comparing hashes of sets of files (passed to it), it's fine :-)

md5sum | sha1sum
	-r,   NO
	-c,   YES
	-xxh, ONLY does MD5 (SHA1) - NOTHING MORE !

	- only thing different here is you can do stuff like pipe to it or use STDIN
	  so you can, DIR | md5sum - not that i'm sure what value there is in that ?

	(& both of these are rather slow) [so in general, a no-go]
	(both functions are available in 'busybox')

hashmyfiles (nirsoft)
	didn't seem particularly feasible (for the command line)


-


void
	for dir (ONLY)
		so long as you're OK with a hash <sidecar, is that what he calls it ?>
		in /EVERY/ directory... (& if that's what you want, probably no easier
		way to do it...)
	
	(but, i'm still missing the use case?
	for static, relatively static directory structures, for easy comparison /between
	such/, hashtree.bat is fine, i'd think & does all that void does, difference
	being a single file with all the hashes vs. a hash in every directory.)

hash
	in general, is fine
	but output is non-standard
		so if you wanted to manipulate... it'll be hard
fsum
	in general, is fine
	standard output (with addition of '*')
	- odd issue with dir
		fsum -r X, is fine, finding X/x.txt
		fsum -r X/*, or fsum X/x*, FAILS !!
	AH!
		/d specifies the "working directory"
		fsum /dX *, sets the "working directory" to X, then searches for *
		            ^--- WORKS
		fsum   X *, (seeming) does an fsum on X, & as X is a directory * then finds the files within
		            ^--- WORKS
		fsum /dX/*, presumably ? sets..., oh, it must be looking for /directories/ within X,
		            rather then the /files/ in the directory X
		            ^--- this fails
777
	limited to sha1 ONLY (essentially)
	in general, is fine
	standard output (with addtion of file 'Size')
		but is workable
		does (seemingly) CWD, then subdirectories
		- where fsum does subdirectories then CWD
		  (so file ordering between the 2 do differ)
	AH!
		777 does NOT output its file listing in a consistent order as it traverses a tree?
		- that makes it far more difficult to compare (what should be, relatively, the same outputs)!
			/unix/lss	/unix/lss
			/unix/sed	/unix/sed
			/unix/TAR/	/unix/unz
			/unix/TAR/gzip	/unix/zip
			/unix/unz	/unix/TAR/
		well, it does - kind of, but not in a way that is beneficial for being able to compare
		2 trees... /each/ tree can be output, consistently, but 2 of the same trees - presumably
		cause of the way, order ?, they happened to be written to, will be /traversed/ in a
		different order, resulting in differing order listings for the same "data" trees.
		- if it were simply a (alpha) sequential, top down, all would be fine, but as it is,
		  without consistent output, "value" drops for 777...
	fsum IS consistent in its output!
xxh
	limited to xxh ONLY
	limited to files, or directories (whatever can be passed to it)
		but does NOT recurse !
	does do -c
	/probably/ will have issues with LFN ?
	so, for what it does, it is fine
md5sum/sha1sum
	too slow (compared to the others)
	too limited in scope
	- though... they are what all else sprung from ;-)

hashtree.bat:
:: HashTree - SjB 03-08-2022
FcHash.exe --xxh --recurs --non_stop  .  2>&1 | tee 0hashtree


HARDWARE, BABY !!!

	K:\fcp>tail 00*
	==> 0007 <==
	WARNING: Cannot open 1 file
	
	TimeThis :  Command Line :  7hash
	TimeThis :    Start Time :  Thu Oct 20 16:35:01 2022
	
	
	TimeThis :  Command Line :  7hash
	TimeThis :    Start Time :  Thu Oct 20 16:35:01 2022
	TimeThis :      End Time :  Thu Oct 20 16:39:16 2022
	TimeThis :  Elapsed Time :  00:04:15.147
	
	==> 0007-home <==
	WARNING: Cannot open 1 file
	
	TimeThis :  Command Line :  7hash
	TimeThis :    Start Time :  Thu Oct 20 23:50:41 2022
	
	
	TimeThis :  Command Line :  7hash
	TimeThis :    Start Time :  Thu Oct 20 23:50:41 2022
	TimeThis :      End Time :  Fri Oct 21 00:02:17 2022
	TimeThis :  Elapsed Time :  00:11:35.923
	
	K:\fcp>
	
	K:\ffc\LIB>tail 00*
	==> 0007-home <==
	WARNING: Cannot open 1 file
	
	TimeThis :  Command Line :  777.exe h * -r -scrcsha1
	TimeThis :    Start Time :  Thu Oct 20 23:28:42 2022
	
	
	TimeThis :  Command Line :  777.exe h * -r -scrcsha1
	TimeThis :    Start Time :  Thu Oct 20 23:28:42 2022
	TimeThis :      End Time :  Thu Oct 20 23:42:43 2022
	TimeThis :  Elapsed Time :  00:14:01.374
	
	==> 0007-office <==
	WARNING: Cannot open 1 file
	
	TimeThis :  Command Line :  7hash
	TimeThis :    Start Time :  Thu Oct 20 16:26:48 2022
	
	
	TimeThis :  Command Line :  7hash
	TimeThis :    Start Time :  Thu Oct 20 16:26:48 2022
	TimeThis :      End Time :  Thu Oct 20 16:32:01 2022
	TimeThis :  Elapsed Time :  00:05:13.716
	
	K:\ffc\LIB>

we're talking HUGE time diff here...
is it /simply/ hardware (CPU, & guess, mainly ?)
or is there something more involved ?
(like... 1 USB port vs another, both USB 2.0 AFAIK
- all are black, none are blue...)

"slow" (which i've never considered "slow")
	i5-3570k, 4-core 3.4 GHz, 16 GB RAM, HDD's (but i don't see where that would matter, as i'm accessing USB flash drive)
"fast" dell AIO, i7-?, 8 GB RAM, SSD (but again, i'm accessing USB flash drive)


fsum can specifiy a "base" /directory/, followed /file/ spec to hash
fsum -d/tmp/ *.bat
	- hash *.bat files in /tmp/ directory

rhash - CANNOT ???
rhash /tmp/*.bat
	- /tmp/*.bat: no such file or directory ???

fsum & exf are ALIKE, with diffs
	hash methods
	/d directory spec, /dX vs. /d X (space or not)

exf is more "forgiving" & more interobable - maybe
	exf -c can check fsum checksum files
	exf -c /may/ be able to check void checksum files
	exf can do sha1 in 2 ways; ?SHA1* & just '*'
	can ouput fullpath (rather then none or just relative)
		K:/lib/dac vs /lib/dac
	switches can be in any order, basically
		exf * -sha1 vs. exf -sha1 *
	-c can be verbose or not

most all support foward-slash
	sha1 c:/tmp/out vs. sha1 c:\tmp\out

directory traversal is NOT the same between fsum & exf !!!

exf may ? be using more RAM then fsum, 15% vs. 5% PERHAPS <--- NEED 2 VERIFY !?

***NONE*** of them traverse "the same" trees - with different parents, equally - potentially !!!
	/LIB/ETC2/BASIS/JCS vs /FFC/ETC2/BASIS/JCS
	           ...file1               ...file1
	           ...file2               ...file4
	           ...file3               ...file2
	           ...file4               ...file3
---

I believe FastCopy might (or might have option) to store hashes in ADS?
harryray2
Posts: 1050
Joined: Sat Oct 15, 2016 9:56 am

Re: identifying duplicates from checksums

Post by harryray2 »

I've been looking for something to give me a single hash of an entire directory, not just individual hashes of each file.
Any ideas please?
horst.epp
Posts: 1344
Joined: Fri Apr 04, 2014 3:24 pm

Re: identifying duplicates from checksums

Post by horst.epp »

therube wrote: Mon Oct 31, 2022 6:22 pm ---
I believe FastCopy might (or might have option) to store hashes in ADS?
Yes, Fastcopy has the option Add VerifyInfo
which stores checksums after an successfull verify in the ADS stream :fc_verify
horst.epp
Posts: 1344
Joined: Fri Apr 04, 2014 3:24 pm

Re: identifying duplicates from checksums

Post by horst.epp »

harryray2 wrote: Mon Oct 31, 2022 6:27 pm I've been looking for something to give me a single hash of an entire directory, not just individual hashes of each file.
Any ideas please?
May be the following:

HashCheck Shell Extension (archive) can be used to get a hash of a directory. This can be done by:
Using HashCheck on the directory.
This will generate a .md5 file which contains a listing of the hashes of each file in that directory, including all files in sub-directories.
Use HashCheck again on the .md5 file it generated above.
This final generated .md5 file contains a hash of the entire directory.

http://code.kliu.org/hashcheck/
therube
Posts: 4610
Joined: Thu Sep 03, 2009 6:48 pm

Re: identifying duplicates from checksums

Post by therube »

7-zip gives you "something", just not sure what it is, but maybe it is a "directory" (for data:, for data and names:) hash?
(And you could parse its' output, if that is in fact the expected data...)

Code: Select all

7-Zip 22.01 (x64) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15

Scanning
7 files, 267227 bytes (261 KiB)

SHA1                                              Size  Name
---------------------------------------- -------------  ------------
5a2e3423e5d4890871810a8153fedb4a32fe85ac           283  0hash - Copy.md5
bc70620da9b1f2824624d43b707a30685b3daaf4           335  0hash.md5
e7726eefe954042687f527c7e3a4a36dc21f71f9          3210  fcp-home-LIB-ntfs.TXT
6a5bd73a3906e4aed2e384b4c1c4e34b96f8b20c        112188  UsbTreeView381-home.TXT
c2a006a4aed4226325eac306d31a53cf4ee1555a        149650  UsbTreeView381-office.TX
T
381c8c1a61a774d6dffbed64119ca170ce2d84e5           869  UsbTreeView381.ini
7e668b379e2314a5c784bb8b7b808881c8b37da1           692  xxx
---------------------------------------- -------------  ------------
df90fa505f81103fdee9f82fc919112ecc16ac88-00000004        267227

Files: 7
Size: 267227

SHA1   for data:              df90fa505f81103fdee9f82fc919112ecc16ac88-00000004
SHA1   for data and names:    0589c73b768e62094d2d98d26fffb81523b4cc39-00000004
[code]
harryray2
Posts: 1050
Joined: Sat Oct 15, 2016 9:56 am

Re: identifying duplicates from checksums

Post by harryray2 »

Thanks, there seem t be a few ways..I'm using a programme called file hashes, which appears to be similar to hash check I'm using 7zip on the directory to compress, or using the hash facility in 7zip to generate a file, then saving the hash file. Or I'm just zipping the directory into two different locations and using the hash tab on the properties menu to compare.

Seems to work, but it's a lot of fuffing around, there doesn't appear to be an easy way. I'm quite surprised that there isn't anything to hash directories easily.
raccoon
Posts: 1017
Joined: Thu Oct 18, 2018 1:24 am

Re: identifying duplicates from checksums

Post by raccoon »

See this link for a list of the active and inactive HashTab/HashCheck project forks, and pick the one to your liking. They're all effectively the same design and system, with some differences. HashCheck (not HashTab) is the only fork I know of that gives the ability to generate and verify checksum digest files for individual files, selected groups of files, or entire folder structures.

https://github.com/gurnec/HashCheck/iss ... -886823066
harryray2
Posts: 1050
Joined: Sat Oct 15, 2016 9:56 am

Re: identifying duplicates from checksums

Post by harryray2 »

Thanks, I'll look at the latest version of Hashcheck.

Completely off subject, over the past couple of years I've subscribed to various Github projects to tell me whwn there is an update. Do you have any idea if I can get a list of everything I've subscribed to?
raccoon
Posts: 1017
Joined: Thu Oct 18, 2018 1:24 am

Re: identifying duplicates from checksums

Post by raccoon »

harryray2 wrote: Mon Oct 31, 2022 10:25 pm How I can get a list of everything I've subscribed to?
Visit https://github.com/watching and https://github.com/notifications/subscriptions while logged in.

You can find this link by going to your Settings > Notifications (https://github.com/settings/notifications)
harryray2
Posts: 1050
Joined: Sat Oct 15, 2016 9:56 am

Re: identifying duplicates from checksums

Post by harryray2 »

That's brilliant, thanks so much. I've been looking for that for ages...
jimspoon
Posts: 161
Joined: Tue Apr 26, 2011 11:39 pm

Re: identifying duplicates from checksums

Post by jimspoon »

Thanks all for the great information and links!
void
Developer
Posts: 15351
Joined: Fri Oct 16, 2009 11:31 pm

Re: identifying duplicates from checksums

Post by void »

To find potential duplicates in Everything, include the following in your search:

dupe:size

This search is instant.



To find duplicates in Everything, include the following in your search:

dupe:size;sha256

This search will calculate the checksum for files that share the same size.



ReFS supports CRC64

I'll look into adding this property to Everything.
jimspoon
Posts: 161
Joined: Tue Apr 26, 2011 11:39 pm

Re: identifying duplicates from checksums

Post by jimspoon »

That's great, Void! Is the sha256 calculation persistent so that it doesn't have to be recalculated if that search is repeated?
void
Developer
Posts: 15351
Joined: Fri Oct 16, 2009 11:31 pm

Re: identifying duplicates from checksums

Post by void »

The sha256 calculation will last until you close the search window.

You can change the search and instantly go back to your previously completed dupe:size;sha256 search.



I recommend using sha256sum to create a .sha256 file in each folder.
Everything can quickly pull the sha256 values from the .sha256 file with the sha256sum SHA-256 property.

dupe:size;sha256sum-sha256
harryray2
Posts: 1050
Joined: Sat Oct 15, 2016 9:56 am

Re: identifying duplicates from checksums

Post by harryray2 »

Is it possible to get a one hash for an entire folder, rather than the individual files?
void
Developer
Posts: 15351
Joined: Fri Oct 16, 2009 11:31 pm

Re: identifying duplicates from checksums

Post by void »

I'll consider a property to do this.

Thank you for the suggestion.

For now, please consider 7zip to calculate the folder CRC/SHA.
harryray2
Posts: 1050
Joined: Sat Oct 15, 2016 9:56 am

Re: identifying duplicates from checksums

Post by harryray2 »

Thanks, a folder hash would be useful.
jimspoon
Posts: 161
Joined: Tue Apr 26, 2011 11:39 pm

Re: identifying duplicates from checksums

Post by jimspoon »

harryray2 wrote: Fri Nov 04, 2022 10:41 am Is it possible to get a one hash for an entire folder, rather than the individual files?
Harryray, do you mean a single checksum calculated from the data in all the files in the folder? If so, how can that be used? I guess it just tells you that there has been no file in the folder has been added / deleted / changed? Or maybe you mean a single checksum file that contains multiple checksums, one for each file in the folder?
harryray2
Posts: 1050
Joined: Sat Oct 15, 2016 9:56 am

Re: identifying duplicates from checksums

Post by harryray2 »

Various reasons, for example, if I want to copy a folder and make sure it's been copied correctly, or just to check whether the folders are identical.
therube
Posts: 4610
Joined: Thu Sep 03, 2009 6:48 pm

Re: identifying duplicates from checksums

Post by therube »

Is it possible to get a one hash for an entire folder, rather than the individual files?
Be careful with that - as the way a directory is parsed will make a difference.
If you always use the same tool, & that tool does not change, it shouldn't matter. Otherwise...
if I want to copy a folder and make sure it's been copied correctly, or just to check whether the folders are identical
Without thinking too much into that, I'm not sure there is a good enough reason for doing like that as compared to comparing the hashes of the individual files within (which obviously you can write to a singular check file & with that you only need to compare that one check file [against another output check file or against the output copied files]).

Code: Select all

T:\K-ORSAIR\Log>777 h * -scrcSHA1

7-Zip 22.01 (x64) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15

Scanning
8 files, 1319415 bytes (1289 KiB)

SHA1                                              Size  Name
---------------------------------------- -------------  ------------
ee86395d36c52773a7ae0aa6a13bcbdcc3068bf0       1313166  20221115-195929-0.log
01750351dca6f91b37bf305744d90dfbdad9dbf5           979  20221116-001646-0.log
11ecb2eca08a228ef468868fc99352395b1c81f8          1239  20221116-001744-0.log
6297de9cff646c0be16080423f00d8b6d8efa626          1239  20221116-002009-0.log
236396fa87738d0be7f85d67a6acd6442507e93b          1327  20221116-002103-0.log
72dd99c5993e1e575217c6af727e827fa9dc38a8           388  Log.sha1
4d2875229f5de8e627f38de486560d015e4f2b23           385  Log2.sha1
411282674d0f66f3bd79bc78d59fdcb918646b71           692  log3.sha1
---------------------------------------- -------------  ------------
85faf581c17aaa65d3b4b04364ca46471884477e-00000004       1319415

Files: 8
Size: 1319415

SHA1   for data:              85faf581c17aaa65d3b4b04364ca46471884477e-00000004
SHA1   for data and names:    4ccf490bdad59382671d3b883a3ba545b73bfea9-00000004

Everything is Ok

T:\K-ORSAIR\Log>dir /b | sha1sum
77b804d9ebcb71e3660cfce17f09b3c630f1e815 *-

T:\K-ORSAIR\Log>
so, just what is dir /b | sha1sum returning
does the /ordering/ of the files make a difference ?

ah, it does!
so depending on how a particular utility /lists/ a directory
output may differ !!!

Code: Select all

T:\K-ORSAIR\Log>dir /b | sha1sum
105be8f554fd4cc877d06cf327dc37d5f0b7b0fe *-

T:\K-ORSAIR\Log>dir /b /os | sha1sum
26de98c63110a755bf17107e635b2970a861cf1f *-
see...

Code: Select all

T:\K-ORSAIR\Log\X>777 h -scrcsha1

7-Zip 22.01 (x64) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15

Scanning
1 file, 334 bytes (1 KiB)

SHA1                                              Size  Name
---------------------------------------- -------------  ------------
a7cd0d7356f553cec274181707afc78a1bfe29f4           334  x
---------------------------------------- -------------  ------------
a7cd0d7356f553cec274181707afc78a1bfe29f4           334

Size: 334

SHA1   for data:              a7cd0d7356f553cec274181707afc78a1bfe29f4

Everything is Ok

T:\K-ORSAIR\Log\X>dir/b | sha1sum
b57bc04cd1994067ff0f49c776e5f7553d3aeed4 *-

T:\K-ORSAIR\Log\X>dir
 Volume in drive T is TOSH_8TB
 Volume Serial Number is 6044-BACB

 Directory of T:\K-ORSAIR\Log\X

11/16/2022  09:30 AM    <DIR>          .
11/16/2022  09:30 AM    <DIR>          ..
11/16/2022  09:30 AM               334 x
               1 File(s)            334 bytes
in this last example, the "directory" hash returned by 7-zip is the same as that computed by sha1sum - because they directories were parsed the same way. easy to do with only one file in there ;-). but throw a second file in the directory & the results returned by the two utilities may or may not be the same.

and also as you can see, a "directory" hash of a directory with only a single file is the same hash as the file itself.
Post Reply