Extracting text from pdf

shanor · January 19, 2003, 11:05am

i’ve been tring to solve this for quit some time and searched the web and read RFC’s (1950,1951) and adobe documentation, but still… can’t figure it out…
any help will be greatly apprecheated.

well, i have a PDF file.
in the file itself among all other objects u have the content object.
The Content object and infact many other objects in a pdf document are compressed using the “FlateDecode” encoding which is as i understand it somekind of opensource and somewhat better compression than LZW.
now, i have done some reading and i know that the compressed data contains a tree of lengths where the most common char is the one with the shortest length from the tree root…and so on.
what i dont understand is how form a specific pack of compressed data that looks like lots of garbage, can u extract the tree and the characters…

this is a copy paste section.
length 71 bytes.

X…-Æ1
€@ À~_±/ÈmrÈ A{!o°,ü£…L3ÖBÛEgp·Ì¤>S¦ÞƒcJ `Ý˜åK]Ø
a^¡ õ

how from this can i by using FlateDecode extract the tree, the chars and the actuall data that is encoded there?

now for the problems, i cant install anything that doesnt comes with the basic php pack… that is no zlib and no any other compression lib…
any1?

thnx in advanced
Shanor.

tuppas2 · January 5, 2008, 2:36am

Did you ever get an answer on this post?

silentcollision · January 5, 2008, 3:35am

Whoa. Old thread.

Are you wanting to extract text from PDF files?

I’ve been trying to find a solution for a while now. Nothing concrete. There’s the function below which works with PDF’s with the version 1.4, but nothing else.

function pdf2string($sourcefile) {
	/*
    $fp = fopen($sourcefile, 'rb'); 
    $content = fread($fp, filesize($sourcefile)); 
    fclose($fp); 
	*/
	$content = file_get_contents($sourcefile);
    $searchstart = 'stream'; 
    $searchend = 'endstream'; 
    $pdfText = ''; 
    $pos = 0; 
    $pos2 = 0; 
    $startpos = 0; 
    while ($pos !== false && $pos2 !== false) { 
        $pos = strpos($content, $searchstart, $startpos); 
        $pos2 = strpos($content, $searchend, $startpos + 1); 
        if ($pos !== false && $pos2 !== false){ 
            if ($content[$pos] == 0x0d && $content[$pos + 1] == 0x0a) { 
                $pos += 2; 
            } else if ($content[$pos] == 0x0a) { 
                $pos++; 
            } 
            if ($content[$pos2 - 2] == 0x0d && $content[$pos2 - 1] == 0x0a) { 
                $pos2 -= 2; 
            } else if ($content[$pos2 - 1] == 0x0a) { 

                $pos2--; 
            } 
            $textsection = substr( 
                $content, 
                $pos + strlen($searchstart) + 2, 
                $pos2 - $pos - strlen($searchstart) - 1 
            ); 
            $data = @gzuncompress($textsection); 
            $pdfText .= pdfExtractText($data); 
            $startpos = $pos2 + strlen($searchend) - 1; 
        } 
    } 
    return preg_replace('/(\\s)+/', ' ', $pdfText); 
}

There’s quite a few more functions here, but most of them don’t work.

If you happen to come across a solution, please let me know.

Edit: Have you seen this thread?

lorenw · January 5, 2008, 1:36pm

I use this and it does the job.


function pdf2string($sourcefile) {

    $fp = fopen($sourcefile, 'rb');
    $content = fread($fp, filesize($sourcefile));
    fclose($fp);

    $searchstart = 'stream';
    $searchend = 'endstream';
    $pdfText = '';
    $pos = 0;
    $pos2 = 0;
    $startpos = 0;

    while ($pos !== false && $pos2 !== false) {

        $pos = strpos($content, $searchstart, $startpos);
        $pos2 = strpos($content, $searchend, $startpos + 1);

        if ($pos !== false && $pos2 !== false){

            if ($content[$pos] == 0x0d && $content[$pos + 1] == 0x0a) {
                $pos += 2;
            } else if ($content[$pos] == 0x0a) {
                $pos++;
            }

            if ($content[$pos2 - 2] == 0x0d && $content[$pos2 - 1] == 0x0a) {
                $pos2 -= 2;
            } else if ($content[$pos2 - 1] == 0x0a) {
                $pos2--;
            }

            $textsection = substr(
                $content,
                $pos + strlen($searchstart) + 2,
                $pos2 - $pos - strlen($searchstart) - 1
            );
            $data = @gzuncompress($textsection);
            $pdfText .= pdfExtractText($data);
            $startpos = $pos2 + strlen($searchend) - 1;

        }
    }

    return preg_replace('/(\\s)+/', ' ', $pdfText);

}

function pdfExtractText($psData){

    if (!is_string($psData)) {
        return '';
    }

    $text = '';

    // Handle brackets in the text stream that could be mistaken for
    // the end of a text field. I'm sure you can do this as part of the
    // regular expression, but my skills aren't good enough yet.
    $psData = str_replace('\\)', '##ENDBRACKET##', $psData);
    $psData = str_replace('\\]', '##ENDSBRACKET##', $psData);

    preg_match_all(
        '/(T[wdcm*])[\\s]*(\\[([^\\]]*)\\]|\\(([^\\)]*)\\))[\\s]*Tj/si',
        $psData,
        $matches
    );
    for ($i = 0; $i < sizeof($matches[0]); $i++) {
        if ($matches[3][$i] != '') {
            // Run another match over the contents.
            preg_match_all('/\\(([^)]*)\\)/si', $matches[3][$i], $subMatches);
            foreach ($subMatches[1] as $subMatch) {
                $text .= $subMatch;
            }
        } else if ($matches[4][$i] != '') {
            $text .= ($matches[1][$i] == 'Tc' ? ' ' : '') . $matches[4][$i];
        }
    }

    // Translate special characters and put back brackets.
    $trans = array(
        '...'                => '&#8230;',
        '\\205'                => '&#8230;',
        '\\221'                => chr(145),
        '\\222'                => chr(146),
        '\\223'                => chr(147),
        '\\224'                => chr(148),
        '\\226'                => '-',
        '\\267'                => '&#8226;',
        '\\('                => '(',
        '\\['                => '[',
        '##ENDBRACKET##'    => ')',
        '##ENDSBRACKET##'    => ']',
        chr(133)            => '-',
        chr(141)            => chr(147),
        chr(142)            => chr(148),
        chr(143)            => chr(145),
        chr(144)            => chr(146),
    );
    $text = strtr($text, $trans);

    return $text;

}
$sourcefile = 'February.pdf';
$get = pdf2string($sourcefile);
echo $get;

silentcollision · January 5, 2008, 9:55pm

lorenw:

I use this and it does the job.


function pdf2string($sourcefile) {

    $fp = fopen($sourcefile, 'rb');
    $content = fread($fp, filesize($sourcefile));
    fclose($fp);

    $searchstart = 'stream';
    $searchend = 'endstream';
    $pdfText = '';
    $pos = 0;
    $pos2 = 0;
    $startpos = 0;

    while ($pos !== false && $pos2 !== false) {

        $pos = strpos($content, $searchstart, $startpos);
        $pos2 = strpos($content, $searchend, $startpos + 1);

        if ($pos !== false && $pos2 !== false){

            if ($content[$pos] == 0x0d && $content[$pos + 1] == 0x0a) {
                $pos += 2;
            } else if ($content[$pos] == 0x0a) {
                $pos++;
            }

            if ($content[$pos2 - 2] == 0x0d && $content[$pos2 - 1] == 0x0a) {
                $pos2 -= 2;
            } else if ($content[$pos2 - 1] == 0x0a) {
                $pos2--;
            }

            $textsection = substr(
                $content,
                $pos + strlen($searchstart) + 2,
                $pos2 - $pos - strlen($searchstart) - 1
            );
            $data = @gzuncompress($textsection);
            $pdfText .= pdfExtractText($data);
            $startpos = $pos2 + strlen($searchend) - 1;

        }
    }

    return preg_replace('/(\\s)+/', ' ', $pdfText);

}

function pdfExtractText($psData){

    if (!is_string($psData)) {
        return '';
    }

    $text = '';

    // Handle brackets in the text stream that could be mistaken for
    // the end of a text field. I'm sure you can do this as part of the
    // regular expression, but my skills aren't good enough yet.
    $psData = str_replace('\\)', '##ENDBRACKET##', $psData);
    $psData = str_replace('\\]', '##ENDSBRACKET##', $psData);

    preg_match_all(
        '/(T[wdcm*])[\\s]*(\\[([^\\]]*)\\]|\\(([^\\)]*)\\))[\\s]*Tj/si',
        $psData,
        $matches
    );
    for ($i = 0; $i < sizeof($matches[0]); $i++) {
        if ($matches[3][$i] != '') {
            // Run another match over the contents.
            preg_match_all('/\\(([^)]*)\\)/si', $matches[3][$i], $subMatches);
            foreach ($subMatches[1] as $subMatch) {
                $text .= $subMatch;
            }
        } else if ($matches[4][$i] != '') {
            $text .= ($matches[1][$i] == 'Tc' ? ' ' : '') . $matches[4][$i];
        }
    }

    // Translate special characters and put back brackets.
    $trans = array(
        '...'                => '…',
        '\\205'                => '…',
        '\\221'                => chr(145),
        '\\222'                => chr(146),
        '\\223'                => chr(147),
        '\\224'                => chr(148),
        '\\226'                => '-',
        '\\267'                => '•',
        '\\('                => '(',
        '\\['                => '[',
        '##ENDBRACKET##'    => ')',
        '##ENDSBRACKET##'    => ']',
        chr(133)            => '-',
        chr(141)            => chr(147),
        chr(142)            => chr(148),
        chr(143)            => chr(145),
        chr(144)            => chr(146),
    );
    $text = strtr($text, $trans);

    return $text;

}
$sourcefile = 'February.pdf';
$get = pdf2string($sourcefile);
echo $get;

I can only get that working for PDFs using version 1.4. No other version will work.

lightweaver · November 19, 2008, 7:32pm

I am using the function posted by ‘lorenw’, and need to be able to generate the plain-text string with word breaks (so as to make it possible to search the string for the occurrence of phrases). Any help would be greatly appreciated!!

lorenw · November 20, 2008, 1:58am

I posted that and have never actually used it in production.

Doesn’t echo $get; give you word breaks? I have an accompanying function that echo’s out the first forty words and relied on word breaks (spaces).

That script is probably 2 years old by now and worked last time I checked.

anyway $get should give you a text string.

lightweaver · November 20, 2008, 8:26am

Thanks for the quick reply! $get gives me a string with all the characters in the pdf file without any spaces (word breaks) - I’ve tried it on several different files, and all yield similar results. Could you post that second function that you mentioned??

Thanks for the help.

frank1 · November 20, 2008, 9:56am

well this was muh helpful

will have a look into it…

any way
The best utilization of these kind of thing i have seen i here…

any way i am not being able to convert those pdf(big) to text and define…those sizes…
(others are easy…permission mgnt,seo things,rewrite and all)
just those part…
i have general idea…if any experts are ready to work on that part commercially pm me…

well i feel it is not against tos of sitepoint to say so,actually i want some expert to assit me or do hard part…but i dont feel people wont do it if i ask it for free…

thanks

lorenw · November 20, 2008, 2:49pm

I just tried reading a number of pdf’s, some could not be read however all of the pdf’s that could be read did have spaces between the words.

Just did a G for
“php” “pdf to text”

You may find a quick answer there. It seems to be a poular topic.