By: Randy Van Ittersum & Erin Spalding, CDIA+ Instructor
Welcome to Day 3 of our 5-Day Mini-Course on document imaging technology. Today's lesson will focus on whether or not you should OCR your documents. For some organizations, this technology is a must. Yet for other organizations it could render your document management system useless.
Optical Character Recognition (OCR) is a process of converting text on a scanned image into text that can be searched. The decision to OCR versus labeling a document should revolve around the issue of data mining versus document retrieval. In order for documents and/or information contained within those documents to be searched, they must be indexed. There are document imaging companies that advocate that customers OCR a document for search purposes, simply because it automates the process.
We believe the decision to use one method over the other needs to be made on a document-by-document basis. To understand our position, you must first understand the drawbacks associated with each process.
While search engines (indices) are easy to use, the results they return are all or nothing affairs. When you enter a word that describes what you are looking for you can be bombarded with endless lists of results that may or may not be relevant. For example, searching for "java" would return documents that describe java the programming language, java the coffee, and Java the Indonesian island. When doing full text searches, the greater the number of documents on your server, the greater the number of irrelevant search results.
Labeling documents provides the highest level of accuracy to your search results because it searches only the key words and numbers that you have assigned to a document.
OCR is Not 100% Accurate:
Optical Character Recognition (OCR) is at best an inaccurate process. The idea behind OCR is to execute a "full-text search" on the document with key words and phrases that are known to be included in a document.
The OCR process is extremely sensitive to the quality of the image as well as the differences in the fonts used within the document. As a result, the output from an OCR process is seldom totally correct. If the OCR process claims to be 95% accurate, that means that one character in 20 is not recognized. Errors are introduced when characters "bleed" and touch one another or when the scanner picks up "ghost" images from the reverse side of a document. OCR software invariably substitutes "1" for "I" and "e" for "c". Because of the inaccuracy inherent in the OCR process, it requires and operator to manually correct all the suspect characters.
If you intend to use OCR in place of labeling a document because the process is automated, the need to correct suspect characters negates any time savings gained by the automated process. Furthermore, you still have the problem of capturing irrelevant OCR documents in a search because the keyword is included within the document.
You may encounter companies that propose that it is not necessary to correct the suspect characters and you will usually find the document you are looking for. They suggest that you use "fuzzy search" technology. This technology expands queries to include terms that sound like or are typographically similar to the term requested. The problem with fuzzy searches is that they produce an even larger number of documents that are irrelevant.
When We Would Recommend to OCR:
We recommend you OCR a document when you will need to conduct an intra-document search. This is especially useful if you are looking for information contained within a document. An example would be an attorney looking for a statement contained in a deposition or the need to find information contained within a research paper.
By searching within a document you can jump directly to the specific information needed within the document. One will basically navigate from occurrence to occurrence of the word, starting with the first "hit". Each "hit" is highlighted within the document, thus making the search process much more efficient, and enabling users to retrieve and use information faster, in fewer steps.
Which method should you use, should you label or OCR a document? Which process you use will depend on your own particular situation. We would suggest that you normally label a document, and when needed, OCR those documents that will be subject to an intra-document search.
Randy Van Ittersum &
Erin Spalding, CDIA+ Instructor
P.S. Check out our new healthcare product called LifeKey. LifeKey allows you to carry your medical records with you at all times on a USB Flash drive. It uses the first universal electronic medical record that can be viewed from virtually every computer in the world without the need for proprietary software or hardware. If you know someone with a serious medical condition, please share this information with them. It could save their life.
For more information go to http://www.mylifekey.com