Find all PDFs that are not searchable (image-based, non-OCRd)?

General discussion related to "Everything".
Post Reply
spedinfargo
Posts: 2
Joined: Thu Oct 21, 2021 1:51 pm

Find all PDFs that are not searchable (image-based, non-OCRd)?

Post by spedinfargo »

I scanned a bunch of documents before I figured out that I could have it OCR them automatically. I know that I can go back and OCR them so they are content-searchable, but I'm wondering if there is a way in Everything to be able to give me a list of all of the PDFs that are like this.

Any ideas? If not in Everything, any other tools out there?

Thanks.
horst.epp
Posts: 1351
Joined: Fri Apr 04, 2014 3:24 pm

Re: Find all PDFs that are not searchable (image-based, non-OCRd)?

Post by horst.epp »

In XYplorer I have script which uses the xpdftools to do that.
It looks like this and returns an S if a PDF is searchable.

$tool = "C:\Tools\xpdf-tools\pdftotext.exe";
$output = trim(runret("""$tool"" -simple -nopgbrk ""<cc_item>"" -", %TEMP%, 65001), <crlf>, "R");
if ($output) { return "S"; }
NotNull
Posts: 5298
Joined: Wed May 24, 2017 9:22 pm

Re: Find all PDFs that are not searchable (image-based, non-OCRd)?

Post by NotNull »

raccoon
Posts: 1017
Joined: Thu Oct 18, 2018 1:24 am

Re: Find all PDFs that are not searchable (image-based, non-OCRd)?

Post by raccoon »

If we assume that every readable PDF contains the letter "e", but image PDFs do not, then this search term should do ya. Seems to work on my end.

*.pdf !content:"e"

There's more to read about content indexing in Everything 1.5 Alpha to speed things up across multiple queries.

(This probably won't work in Windows 7 with no PDF iFilter. I don't know where to get a PDF iFilter in Windows 7.)

((I looked through several examples of PDF 1.3, 1.4, 1.5 and 1.6 to determine any catchall verb in the specification that identifies the presence of printable text, but I could find none, even staring at a hex editor. But, there's probably a commonality between your non-OCR'd and post-OCR'd PDFs that they could be identified by other common signatures left behind by the authoring software. Try *.pdf ansicontent:%PDF-1.3 and *.pdf ansicontent:%PDF-1.4 to find PDF files of different protocol versions.))

(((You could also look at the date-created or date-modified times to determine if you created this PDF before or after you started using OCR software.)))
Last edited by raccoon on Fri Oct 22, 2021 9:02 am, edited 5 times in total.
horst.epp
Posts: 1351
Joined: Fri Apr 04, 2014 3:24 pm

Re: Find all PDFs that are not searchable (image-based, non-OCRd)?

Post by horst.epp »

raccoon wrote: Fri Oct 22, 2021 8:09 am If we assume that every readable PDF contains the letter "e", but image PDFs do not, then this search term should do ya. Seems to work on my end.

*.pdf !content:"e"

There's more to read about content indexing in Everything 1.5 Alpha to speed things up across multiple queries.
Works perfect together with content indexing :D
spedinfargo
Posts: 2
Joined: Thu Oct 21, 2021 1:51 pm

Re: Find all PDFs that are not searchable (image-based, non-OCRd)?

Post by spedinfargo »

Thanks all! I'll give these ideas a shot.
Post Reply