Much of the world of book scanning is devoted to making paper and ink books into files fit for a Kindle. These files allow users to hotlink from the table of contents, change font, and font size. However, one of the side effects of these conversions to AZW (Kindle’s native format) and Epub (universal electronic publishing format) is that the actual formatting of the book is in a state of flux. As convenient as these features are for general reading they are extremely problematic for academic use. For example–depending on the device used, the font type, or font size–a quote could be on page 23 or 123 which could make following along with a lecture or bibliographic reference nearly impossible. For more on this issue–and the completely useless work-around that both MLA and Chicago have implemented–check out this article.
For academic purposes, I need books that look like the originals. If the absent-minded professor at the front of the class says “pull out the red book and turn to page 23” I don’t want to be left guessing. So .pdf is a perfect choice: pages are in essence photos of the original book and their covers while behind the visible layer lies a text version of the page that allows for searches and cut and paste functionality. Better still, if a user wishes to make these files into a AZW or Epub version later they are readily convertible.
Most scanners these days have an easy straight to PDF mode where in the pages are scanned, run through an optical character recognition (OCR) phase to create that hidden text layer, and saved as very, very small .pdf files. While the straight to PDF modes can be great for an office environment with small documents, when scanning books this is a false economy. Straight to PDF modes neither allow missed pages to be pasted back in without rescanning the whole book nor for cleaning up the scans before they are assembled. Instead, I prefer to keep each stage of the process independent so as to allow for the highest quality product with the least effort.
Step One: Scan the covers. I scan my covers before the book is cut out of its binding so that the book retains its original look as much as possible. I have found that 600 dpi .tif files are sufficient both for nice covers and accurate recognition by the OCR software. The covers are touched up in Photoshop Express and saved until the bindings are removed.
Step Two: Scan the pages. Ideally, a full duplexing scanner like Fuji’s ScanSnap fi-6010N would make quick work of a book scanning both sides of each page simultaneously at twenty-five pages per minute. However, at $2,300 Fuji’s duplexing scanner is well out of my reach–but still cheaper that a commercial non-destructive scanner. After cutting off the bindings, my Epson GT-1500’s simplex scanning creates forty odd pages labeled 001b, 002b, 003b, etc. and then–after flipping the pages–forty even pages to drop in labeled 002, 003, 0004, etc. The level settings in the software are the most powerful tool in creating attractive and easy to OCR files. The white level needs to be set aggressively enough to burn out the papers texture and the black level set high enough to thicken up often fainter ascenders and descenders–the tails on letters d or p for example.
Step 3: Repeat and Collate. Most automatic document feeders have an upper limit on the number of pages that can be loaded. For my Epson, it is forty to forty-five depending on paper thickness. By necessity then, the pages are separated into forty page bundles which will result in files numbered 001, 001b, through to 040b, 041. The next bundle starts at 041b and continues on till you either come to another bundle or your previously scanned back cover named 999 or something equally unlikely to be over-written by actual pages.
Step 4: Clean up. If your files don’t follow the above pattern, then the automatic document feeder (ADF) probably skipped a page or two. It is easiest just to find the place where the numbering got out of sync and delete the files after that point so that you can rescan from there. Otherwise you could also rename the existing files and insert in the missed pages if you prefer.
For most books very little post-processing is necessary. However, periodically I get a book that includes damage, printing press artifacts (e.g. faint text or heavy text), margin notes, or underlining that I want to address in Photoshop before I send the files to the OCR software. This can be extraordinarily time-consuming, I would rather have a clean copy from the start, but sometimes either online booksellers forget to mention defects or I got greedy and ordered another “$0.01 + shipping” book. Cleaning up a book using Photoshop is not–strictly speaking–necessary to make it useable, but will result in more accurate OCR processing and will avoid some of the worst problems with academic book scanning helpfully highlighted here.
Step 5: Optical Character Recognition Processing. I use ABBYY FineReader Express for Mac, which seems to be utterly abandoned by ABBYY–and therefore lacks many of the features available in Windows-based products. Many scanners come with OCR software bundled with their scanners which varies in quality and capability. However, there is precious little information available on the net concerning these variables that might necessitate a catch and release strategy (e.g. buying, trying, and returning scanners) until one finds the best choice.
Once you have found your favorite combination of scanner hardware and OCR software, this stage is easy. Point the OCR software in the direction of your book’s folder of TIF files and let it go. ABBYY Finereader Express for Mac corrects document skew to a microscopic degree, recreates tables, and ignores photos all while creating that invisible layer of text that makes searching and copying possible. Most books I scan will spend one to three hours getting their OCR makeover before a dialogue box allows the finished .pdf file to be saved.
Step 6: Final check, downsizing, and archiving books. The finished .pdf can be checked a final time in Adobe’s .pdf reader to make sure all the pages are present and there are no additional issues to address in Photoshop. Making changes requires that the OCR phase be repeated and the old .pdf deleted, but that part of the task doesn’t require any human input anyway. Once you are satisfied with the final PDF book the saved file will probably be between 150mb and 275mb which is a little too large for most portable uses. At this point, you can enlist the help of a .pdf processing software that will decrease the quality of the jpeg images in the pdf and shrink the file size. I use Mac’s Automator program with “Apply Quartz Filter to PDF” set to shrink the resolution to 300dpi by selecting “Image Sampling” to 50% and “Image Compression” to Jpeg with the quality slider right in the middle. This usually halves the size of the finished .pdf, but if one were more aggressive with the sampling and compression settings, one could shrink the file still further.
(NOTE for iPad Users: iBooks seems to be allergic to large files sizes in my experience. For this reason I would recommend iAnnotate which seems to process larger files with greater success. The lower processing power of the iPad 1 caused some books to crash iAnnotate when it attempted to process them, but I have never had any issues with the iPad 2.)
The final step in book scanning is archiving the scans. I do this for two reasons. First, the whole point of book scanning is to create a digital library that will preserve my books for the rest of my life and beyond. If I only save the final .pdf or the lower quality .pdf then the book is locked into a format that might or might not be around in ten years. In contrast TIF files are the most basic (least compressed) and fundamental image format. Whatever comes next will have to be able to process these most basic image files such that I will always be able to return to my earliest books and recreate new versions to work with new file types as they become available.
Secondly, this process creates a great book to read in my iPad, but I need a permanent archive of books that isn’t (as) susceptible to loss or corruption. One of my old jobs was doing photographic retouching for professional photographers. Working in this capacity, I quickly learned that burned DVDs and even hard drives filled with image files will often become unstable or even corrupted after just a few years. While solid state (flash-based) hard drives are the obvious solution to this problem, they are still very expensive. Until that fact changes, I use a RAID array of matching hard drives to archive my library–now well over a thousand volumes strong. While this might seem like overkill, I am unwilling to trust my electronic library to anything less.
So, it is possible to create high quality electronic books from modest hardware and keep them (relatively) safe for years to come. I welcome contributions from fellow book scanners and those interested in de-pulping their own libraries in the comment section.