Print Books to Ebooks
Here's how we convert your print books into ebooks:
First we get a copy of your book, which we generally acquire from a book seller or from you.
We then unbind the book. (Unless it's a rare copy; then we scan it gently.) Unbinding is basically removing the cover, cutting the spine off, trimming the pages to be very square, and manually ensuring the pages aren't stuck together. (The glue on books tends to seep in further than you think, so we have to be sure the pages are well separated, or else pages will get missed or jam in the scanner. They have to be very square or they scan crooked, which causes major problems in text recognition.)
We then scan the pages, being sure as we do that none have been missed. (Pages that stick together may go through the scanner silently omitting the text on the two facing pages, so we watch very carefully for that.) If any pages have been missed, we rescan them and reinsert their images back where they belong. Sometimes the pages jam if they were only slightly stuck together, which often crinkles the pages or tears them so they can't be scanned, in which case we hand-type those pages.
We run the scanned results through Optical Character Recognition (OCR) software, to convert the page images into text we can work with.
We scan the images multiple times, using different settings, and run it through different OCR systems we have, to determine which combination of scanner settings + OCR settings + phase of the moon are giving us the most accurate text. (This differs depending on the font used in the printed book, the font size, the mix of fonts used in the book, the formatting of the headers/footers (some OCR systems & settings are better at detecting those and removing them), the kind of paper [the color, age, etc. determine how much the ink smudges as it moves through the scanner, whether dirt or fibers in the paper look like periods, commas, change the look of letters like a 'c' to an 'e' to the software, etc. etc.]. By trying different combinations of software and settings we find the one that is working best for that book.)
Next we look for and fix errors in the scanning process. OCR software that claims to be 99% accurate still means one error per hundred characters, or about one wrong word per 17 words. 99.9% accurate software still leaves one wrong word out of about 170 words, or a couple errors per printed page.
That's completely unacceptable to us if we're going to put our name on it and publish it. Readers justifiably will post reviews on Amazon/etc. saying a book has a poor conversion job, which will deter others from buying it.
Unfortunately, finding the errors is a laborious process. The errors will be a mix of obvious ones that a spell checker might find and errors that software can't detect that are only visible to a human.
It's actually a pretty good Turing test. (Testing whether an entity is a human or not.) In fact, many of the "Are you human?" tests used on the web for filling out forms use exactly this method: Reading some text and figuring out the words, since computers have a hard time with it.
In our case, the problem is even harder: It's not just looking for words in a spell check dictionary, it's...
- Looking for words that scanned as other words, because of isomorphic characters. For example, a 'cl' could look like a 'd'; or an 'rn' could look like an 'm'. Thus, which is the correct word, "c l o w n" or "d o w n"? "y a r n" or "y a m"?
All those spell check fine. Only a human can find that while proofreading it and know the sentence doesn't sound right. ("The cat played with a ball of yam.") :) Even the proofreader has to be alert and compare to the original. E.g. if it said, "I sent that clown to Earth", that still parses as grammatically fine, but so does, "I sent that down to Earth". A human can say, hmmm, "clown" doesn't sound like appropriate for how that character speaks. — And thus one checks the original book, to see what it said. (Or, since we sometimes find errors in the original books, we fix those if they're obvious, or check with you.)
- Missing or incorrect punctuation. Periods are small, and can be missed by the OCR software This results in unintentionally run on sentences. Or on the flip side, a speck of dirt can look like a period, so sometimes they get added where they. don't belong. Or. a comma that didn't print well might look like a period.
A human reader notices these (since they're very irritating), but they do require reading the whole piece carefully.
- Fixing the paragraphing. Page breaks often split
sentences, and the OCR systems don't always put them back together correctly, and may insert paragraphs where they don't belong, or join paragraphs together that shouldn't be. A spellchecker doesn't notice this; our human readers do.
- Fixing hyphenation. Print books often have hyphenated words, which the OCR systems can try to re-move, but not al-ways correctly. They may remove, or, leave in, dashes that create in-correct words that pass a spell check, but aren't correct per the original. A human reader will likely notice these since they don't read right. Thus we read it first so your readers won't complain.
- Fixing the spacing. OCR systems might misinterpret the extra spaces be tween letters or words in justified text, and add spaces (or remove them) when they shouldn't. This can lead to ebooks where the spacing isn't right, which looks disturbingly wrong and of poor quality to readers.
- Fixing the formatting. OCR systems aren't as good as they should be identifying paragraph styles correctly. For example, chapter headings, poetry, inserted newspaper headlines, etc. may be formatting wrong.
Or main body text may come out e.g. centeredwhen it shouldn't. A human can tell something doesn't read correctly and check the original, but a spell-checker can't.
To avoid, detect, and fix these problems we do a number of things:
We typically take the best two, sometimes three, scans that used different settings and compare them together using software we developed. This finds scanning errors a spellcheck can't.
We spellcheck the results using multiple spellchecking systems. The problem with spellcheckers is that they don't know all the specific words that are correct in a document — names of characters and other proper nouns, arcane words, invented words in science fiction and fantasy, etc. Thus many of the flagged words turn out to be correct per the original. To determine which, a human has to look at the word and manually locate the same place in the original to see how the original had it.
We then run AI software we've developed in-house to look for common scanning mistakes. A human reader looks at these flagged cases, compares to the original, and fixes as needed.
As noted above, software simply can't catch a lot of the subtle errors that can irritate a reader.
Ultimately, the only way to catch these errors is for a human to read the book word by word, so that's what we do. This is obviously a time-consuming process. Not only do they have to read it carefully (thus, more slowly than for pleasure), they have to check problems against the original and fix what needs repair.
At this point we have a Word .doc file that has all the right words in it, and the correct paragraph splits, but generally not a usable format for making ebooks.
The next step is to correct the stylistic formatting problems that the OCRing has introduced. For example, the OCR software might mysteriously have decided to do a right-justified paragraph here or there, incorrectly italicized text, changed the font face or font size of the letters in the middle of a word, or applied inconsistent paragraph indents, etc.
We also check for correct scene breaks at this point. The OCR output might have joined scenes together.
After this, we clean up the Word .doc file. There are a lot of internal formatting codes that aren't visible to the eye but that will prevent a book from being accepted by a distributor. For example, if a .doc file has certain "Style" conventions in it, it will result in .EPUB files that fail to pass the "epubcheck" test that many distributors require. You can't see these with your eye reading the .doc file, or even see them looking at the list of styles. We inspect the insides of the .doc file with software we've written and fix these problems up. (Or, worst case, we do it by hand, which can be extremely time-consuming.) Though it's generally better than the solution one distributor recommends, of removing all formatting then reading through the original text to locate every instance of italics, etc. and re-italicizing/etc. Yuck!
Complicating this is that each ebook format (.EPUB, .MOBI, .LRF, etc.) and each distributor (Amazon, B&N, Apple, Smashwords, Ingram, etc.) have their own requirements for the format. Smashwords, for example, has a whole book one needs to read, understand, and apply for a file to meet their requirements. Some distributors are lax, and may accept a file that has formatting problems inside, and you the author find out only when the readers write negative reviews about the poor conversion. We ultimately produce a very clean and simple Word .doc file that serves as the basis for creating all the ebook files for the distributors, but it isn't a simple process.
In parallel with this, we're commissioning new cover art. Rarely does an author have rights to use the cover art from the print book. Unfortunately artists aren't necessarily inexpensive. We also make sure the cover art files meet the various technical requirements of the distributors.
During this time we're also tracking down rightsowners to Forewords or Afterwords written by other authors to see if we can include them.
Next we begin converting the Word .doc file into the various ebook formats and tweaking them for the various distributors. (For example, Smashwords has specific licensing terminology you have to insert that isn't applicable to others.)
There aren't any standard means to do all this converting, so we've written our own systems to do the conversions that mostly automates it. Nonetheless, this phase typically involves opening up the ebook files and peering inside at the code to see why they aren't converting exactly as they should have. If the file is failing "epubcheck" for example, we have to figure out what the rather cryptic error messages mean (such as, 'attribute "name" not allowed here; expected attribute "accesskey"' — huh?!?), reverse engineer the conversion process to see how the original resulted in the wrong output, and fix it at the source.
Once we have the files ready, we can begin uploading them to the distributors. Each distributor has assorted forms to fill out to record information about the book ("metadata"), which usually takes longer than expected to work through.
Which categories on each distributor's site should the book be listed under? Each distributor has a different, extensive list, and different number of allowable entries.
Then there are descriptions and marketing blurbs to write, and keywords to determine, with regard to what's most applicable to the book. We're grateful when authors provide this, but typically they do not.
Next we monitor the status of the upload process. It often takes several days or weeks to hear back on the status of an upload for the distributors. If you don't look for this — since distributors generally don't send email to alert you of problems — then that book languishes unpublished until someone looks. To avoid that, we check regularly to see when they've done their thing and check the results. If they report some kind of error, we fix it and continue. Lather, rinse, repeat.
Finally the book is visible to readers and ready for purchase! The last part is the best:
We've already established banking and payment relationships with the various distributors, so our final step is to track sales, collect payments (which we will tend to collect faster than individual authors, since distributors usually have minimum payment threshold amounts per month), and — voilà! — distribute the income back to you.
So, that's what we do. We've love to work with you!