Wednesday, July 6, 2011

Scanning: The saga continues

When last we left our hero he was trying to scan in one of his stories from a science fiction magazine so he'd have a text file to edit to reprint the story. You know, the job that people always tell you is "easy".

In a pig's eye, said the little bird.

The scanning itself went fine. I've been scanning contracts and such for long enough I pretty much know the drill. The fun began when I tried to put the .pdf file from the scan through the OCR software to get a text file.

The first problem was I couldn't get the program to start. Every time I gave it a file name to upload the text copy to the program rejected it a message saying there was an error in the path name. Finally I called HP tech support. It cost me $20 because the printer was out of warranty, and a lot of time on the phone, but the tech was able to locate the problem. It turned out I couldn't enter a path name without hitting the "browse" button. Talk about counter-intuitive and no hint in the documentation.

Okay, problem solved. Now to let the OCR software create the text file. Tell it to go and it starts scanning -- and then stops halfway into the first page. The system hangs, eventually crashes and does something bad to the system that requires a power-off re-start to fix.

Rinse and repeat a couple of times.

The obvious diagnosis is that the software stubbed its toe on something, wandered off into the weeds and did something funky somewhere where it shouldn't. But what do I know? So another call to HP tech support. This time t hey don't charge me $20, but it takes an inordinate amount of time messing with settings and such. Finally in the course of this I repeat (for the second or third time) casually to the tech that this is a really big file.

How big? he asks not so casually. 40 pages or so, I tell him. Aha! A quick check of task manager reveals that the file is about 20 MB. Way too big for the program to handle. After a little messing the tech determines that there's no way the software will read a file that size. So its either re-scan the file in smaller chunks (ugh!) or find other software that can handle it (less ugh, but expensive).

I decide to see what I can find in the way of new software. After looking, I come across a recommendation for an Open Source program called "FreeOCR", which is, guess what?, free.
What the heck. I'm on deadline and I need something NOW. So I download FreeOCR and try it.

And it works. Not perfectly. There are a couple of annoying little details, like having to select each column on the page separately to input, but it's fast and it will read a file that size without even breathing hard. (The trick is that FreeOCR operates on a page at a time, no matter how large the file is. So instead of choking on 20 MB of stuff, it takes it in smaller bites.) True, it doesn't do as good a job at character recognition, missing maybe one character in 200, but even at 40 pages that's good enough for me.

So a 'simple' job of scanning ended up taking two days of clock time because scanning isn't so simple at all. A few years ago I was talking to Eric Flint, who is head of the Baen Free Library. The Library is a program to provide free Baen e-books online. (Usually the first book in a series, which hooks you nicely and then they sell you the other books.) He was talking about how difficult it was to get books up for the library, especially the ones that were printed before delivering books as word processor files became popular.

Someone suggested that Eric use volunteers to scan in the books. Eric reacted negatively -- overly so, I thought.

However after this week's experiences, I think I owe Eric an apology.


1 comment:

cracki said...

Volunteer reporting in. If you have clean color/grayscale/mono scans in 300-600 dpi jpg/tiff/pdf, I can work with that.

I scan and OCR all the letters I get. Sometimes even a book, though it takes ages to flip and scan the pages in front of the scanner. I know all about regex massages (and the occasional Python script) to make the scans come out nicely formatted, paragraphs, dashes, quotes, and all.