Find all PDFs that are not searchable (image-based, non-OCRd)?

spedinfargo · Post by **spedinfargo** » Thu Oct 21, 2021 1:56 pm

I scanned a bunch of documents before I figured out that I could have it OCR them automatically. I know that I can go back and OCR them so they are content-searchable, but I'm wondering if there is a way in Everything to be able to give me a list of all of the PDFs that are like this.

Any ideas? If not in Everything, any other tools out there?

Thanks.

horst.epp · Post by **horst.epp** » Thu Oct 21, 2021 9:42 pm

In XYplorer I have script which uses the xpdftools to do that.
It looks like this and returns an S if a PDF is searchable.

$tool = "C:\Tools\xpdf-tools\pdftotext.exe";
$output = trim(runret("""$tool"" -simple -nopgbrk ""<cc_item>"" -", %TEMP%, 65001), <crlf>, "R");
if ($output) { return "S"; }

Post by **NotNull** » Thu Oct 21, 2021 10:18 pm

XYplorer forum thread: https://www.xyplorer.com/xyfc/viewtopic.php?f=3&t=22803
Total Commander forum thread: https://www.ghisler.ch/board/viewtopic.php?t=73928
Everything forum thread: viewtopic.php?f=5&t=9621

raccoon · Post by **raccoon** » Fri Oct 22, 2021 8:09 am

If we assume that every readable PDF contains the letter "e", but image PDFs do not, then this search term should do ya. Seems to work on my end.

*.pdf !content:"e"

There's more to read about content indexing in Everything 1.5 Alpha to speed things up across multiple queries.

(This probably won't work in Windows 7 with no PDF iFilter. I don't know where to get a PDF iFilter in Windows 7.)

((I looked through several examples of PDF 1.3, 1.4, 1.5 and 1.6 to determine any catchall verb in the specification that identifies the presence of printable text, but I could find none, even staring at a hex editor. But, there's probably a commonality between your non-OCR'd and post-OCR'd PDFs that they could be identified by other common signatures left behind by the authoring software. Try *.pdf ansicontent:%PDF-1.3 and *.pdf ansicontent:%PDF-1.4 to find PDF files of different protocol versions.))

(((You could also look at the date-created or date-modified times to determine if you created this PDF before or after you started using OCR software.)))

horst.epp · Post by **horst.epp** » Fri Oct 22, 2021 8:25 am

raccoon wrote: ↑Fri Oct 22, 2021 8:09 am If we assume that every readable PDF contains the letter "e", but image PDFs do not, then this search term should do ya. Seems to work on my end.

*.pdf !content:"e"

There's more to read about content indexing in Everything 1.5 Alpha to speed things up across multiple queries.

Works perfect together with content indexing

spedinfargo · Post by **spedinfargo** » Fri Oct 22, 2021 1:46 pm

Thanks all! I'll give these ideas a shot.

voidtools forum

Find all PDFs that are not searchable (image-based, non-OCRd)?

Find all PDFs that are not searchable (image-based, non-OCRd)?

Re: Find all PDFs that are not searchable (image-based, non-OCRd)?

Re: Find all PDFs that are not searchable (image-based, non-OCRd)?

Re: Find all PDFs that are not searchable (image-based, non-OCRd)?

Re: Find all PDFs that are not searchable (image-based, non-OCRd)?

Re: Find all PDFs that are not searchable (image-based, non-OCRd)?