Need javascript code to identify non-searchable PDF

Rojoj · May 13, 2010, 7:15pm

Hi,

Can someone give me a javascript code or macro that, when run on a PDF file, can identify whether it is a searchable or non-searchable PDF?

KittyZheng · May 17, 2010, 8:42am

PDF documents do not have an attribute that prevents them from being searched. However the content is often encoded or encrypted in such a way that would make plain text searches impossible.

You will find the easiest way to decode a document is to use a third-party PDF component. I use ABCpdf from webSupergoo, which supports CSharp, VB.NET, VBScript, ASP, ASP.NET. It also provides a COM interface for interoperation with other languages.

Given the complexity of PDF files it is very unlikely you will find JavaScript code that’ll do this for you.

The following VBScript example shows how to decompress content streams in a PDF using ABCpdf. Copy the code into a text file and change the file extension from ‘.txt’ to ‘.vbs’

theFile = WScript.Arguments.Item(0)

Set theDoc = CreateObject("ABCpdf7.Doc")
theDoc.Read theFile
theCount = theDoc.GetInfo(0, "Count")
For i = 1 to theCount
  theDoc.GetInfo i, "Decompress"
Next
theDoc.SaveOptions.Linearize = false
theDoc.Save theFile & "_dec.pdf"

MsgBox "Done"

Simply drop your PDF file on to the VBScript file to decompress.

Be aware that text in PDF files is typically broken into short arbitary fragments and might require additional work to reconstruct. ABCpdf has a GetText function that simplifies this task.

The contents of a PDF document can also be protected by encryption. ABCpdf supports a number of encryption standards and can decrypt these files for you, if you know the password. See the documentation on the Encryption object for further details.