How can I digitize a mountain of paper files?

My question covers a lot of territory, but I think is the best forum for starters, as it’s going to involve purchasing some new hardware and software.

A few weeks ago, my neighbor’s apartment burned down, nearly taking mine along with it. When I got home from work in the middle of the night, it looked like my apartment had vanished at first. It made me think: What if I had lost all the research I’ve done over a lifetime, much of it in the form of letters, documents and books not even stored in a computer? In fact, I have more than two dozen boxes full of papers that do little more than take up space.

So I decided it’s time to digitize them…I think that’s the right word; I want to essentially scan everything into a computer, but I need some guidance.

For starters, there are at least three ways to transform a paper document into a computer file that I’m aware of…

  1. Scan it as an image.
  2. Scan it as a PDF file.
  3. Use optical character recognition (OCR) software to scan it as text.

In most cases, I’d prefer to scan it as text, not just for the smaller file size but because it would be a much more useful format; I could search it, edit the text, etc.

However, I won’t be able to use OCR software with files that are faded, smudged, etc. Also, there are many images I’ll want to scan.

So let’s start with files that can’t be copied with OCR software. Should I scan them as images or PDF files? (If I scan it as a graphic, which format should I use - TIFF, GIF, JPEG, etc.?) I’ve never worked with PDF files before, but I think you scan a document into a PDF format, right? If so, I assume it’s a smaller file size than a graphic.

Can I search for text on PDF files? In other words, if I search my computer for the word “jaguar,” will it search PDF files or only text files?

Do images come out nice in PDF files, or would it be better to scan them into an image format, like JPG?


Next question - What hardware and software should I buy?

I recently upgraded to a new MacBook Pro running Lion. I think I can get a good scanner for $150, maybe even less than $100 - right? Are there any particular models you recommend?

Do I need to buy special software to create PDF files?

What OCR software do you recommend?

Also, most of my files can be scanned with a regular scanner. (I think the term is flatbed scanner.) However, I have some thicker books that are going to be harder to scan. I’ve seen ads for little handheld scanners. Do they work well, and can you recommend any particular models?


Sorry to cram so many questions into one post. I think my project is pretty simple, really, or it will be after I figure out a few things - like PDF vs images vs OCR. Merely purchasing a scanner will probably be 90% of the project, especially if I can find one that comes with OCR software pre-installed. :wink:

Thanks!

The best way to scan documents is into pdf file format. If you have computer generated text then the search will be able to find your search words. If you have hand-written text or poor quality text then you cannot use search

Usually you will get software with the scanner that will include OCR and the different forms of output. When you’re looking at different models, make sure you check out the functionality of the software, and if possible get a demonstration so you know how easy and quick it is to use.

(Note: the scanning-to-text that I’ve done has been on quite old kit, so things might have moved on since then)

If you scan to PDF, it will scan the entire page as an image, and that’s how it will be stored. You can then run the OCR tool on areas of text, and it will figure out what the text is but usually won’t change the image, so it doesn’t neaten it up at all.

A common option is to scan to Word, where it will replace the OCR’d bits with actual text. This obviously looks a lot neater, and if you have mostly text then it can lead to smaller file sizes.

Scanning to image isn’t something I would recommend for whole page documents. It’s fine when what you want is actually an image, but not great for areas of text, and it’s much harder to then manipulate.

In terms of searching, even if the PDF file contains text (whether that’s generated from a source file or OCR’d from a scan) a Windows search can’t find it (at least not on XP, newer variants may be better), but I don’t know how a Mac search works. However, you can do a search within Adobe Reader that will trawl all PDF files within a given folder for a text string.

You should take a look PDF document scanners, rather than standard flatbed scanners.

I’ve had good results with Xerox models that come with PaperPort software.

Be prepared to spend $150-$300, though.

Thanks for all the tips!

Force Flow: Can a PDF document scanner ALSO do OCR work, or would I have to buy TWO scanners - a PDF scanner and a regular scanner for scanning text and images?

Thanks.

It’s the software that does the OCR, and that’s independent of the type of scanner you’ve got. (Heck, you can even get software that will OCR images and PDFs from your hard drive without even having a scanner installed!)

on the hardware side. many scanners etc have a volume, that is they will do so many scans then wear out (LOL one utility company found using xerox copiers actually handled the load better then a built just for scanning scanner). Also your basic OCR text scanner does not need to be high resolution. for your images that you want good image resolution you will need a high resolution scanner. basic scanning doesn’t take long. ocr conversion and high res scans take longer, so that is another consideration.

If you have a whole lot of stuff and a budget, there are services that will digitize all your paper documents for you. My CPA used one, they did a room full of old documents but it was expensive. Now he got a fujitsu pdf scanner with a document feeder which works well for standard-size paper, but books, odd-size paper, etc is labor-intensive to scan.

There’s an online system called DocShelf designed for this exact purpose. You download a little app, it reads documents from your scanner and uploads to their system. You can then login through their web interface and look at your documents. My sister’s dental office uses it to store their records. It’s more targeted to businesses, but they also have a cheap personal account for individuals.

I’m sure that you can hand it all to your local FedEx Kinkos and they’ll do it for you.

~TehYoyo