bar338 — 2010-06-12T17:12:56-04:00 — #1
I've done some research and I'm not sure this is possible in PHP but thought I would ask the experts.
In its most basic form, I want to be able to upload a document and extract all text from it which will then be parsed.
Is this possible with PHP either through a built in library or 3rd party library?
What other languages better suit this task (I would like to keep it web-based if possible)?
felgall — 2010-06-12T17:36:13-04:00 — #2
Anything that you feed through any OCR needs to be proofread to correct all the errors before you can use it. There is no way to automate that part of the process.
anthonysterling — 2010-06-12T17:33:41-04:00 — #3
Excellent advice, although I'd probably opt for a CLI compatible OCR app, I don't think PHP has a place here.
scallioxtx — 2010-06-12T17:30:44-04:00 — #4
The process is called OCR (Optical Character Recognition).
You could take a look at phpOCR.
I've never tried it myself, but it's worth a shot.
By the way, I fiddled around with OCR in the past (using desktop apps, not PHP) and found that they're not flawless. For example they easily confuse a c for an o and vice versa. Overall the results are not bad, but it looks a bit like it's written by someone who's made some typos here and there
wackyjoe — 2010-06-12T21:30:57-04:00 — #5
Since you're on a Linux box (I'm assuming), you should have a look at Ocropus, its not PHP, but ofcourse you can call it via exec (=( ).