Please add "randy@disusa.com" to your e-mail Address Book to ensure that this 5-Day Mini-Course does not get filtered to your junk folders.

If you cannot read this email or wish to view it within a browser, please go to http://www.disusa.com/mini_course/day3.html

 
Your Sarech Enigne Can’t Raed This

By: Randy Van Ittersum & Erin Spalding, CDIA+ Instructor

Welcome to Day 3 of our 5-Day Mini-Course on document imaging technology. Today's lesson will focus on whether or not you should OCR your documents. For some organizations, this technology is a must. Yet for other organizations it could render your document management system useless.


Optical Character Recognition (OCR)

Optical Character Recognition (OCR) is a process of converting text on a scanned image into text that can be searched. The decision to OCR versus labeling a document should revolve around the issue of data mining versus document retrieval. In order for documents and/or information contained within those documents to be searched, they must be indexed. There are document imaging companies that advocate that customers OCR a document for search purposes, simply because it automates the process.

We believe the decision to use one method over the other needs to be made on a document-by-document basis. To understand our position, you must first understand the drawbacks associated with each process.

Search Results
While search engines (indices) are easy to use, the results they return are all or nothing affairs. When you enter a word that describes what you are looking for you can be bombarded with endless lists of results that may or may not be relevant. For example, searching for "java" would return documents that describe java the programming language, java the coffee, and Java the Indonesian island. When doing full text searches, the greater the number of documents on your server, the greater the number of irrelevant search results.

Labeling documents provides the highest level of accuracy to your search results because it searches only the key words and numbers that you have assigned to a document.

OCR is Not 100% Accurate:
Optical Character Recognition (OCR) is at best an inaccurate process. The idea behind OCR is to execute a "full-text search" on the document with key words and phrases that are known to be included in a document.

The OCR process is extremely sensitive to the quality of the image as well as the differences in the fonts used within the document. As a result, the output from an OCR process is seldom totally correct. If the OCR process claims to be 95% accurate, that means that one character in 20 is not recognized. Errors are introduced when characters "bleed" and touch one another or when the scanner picks up "ghost" images from the reverse side of a document. OCR software invariably substitutes "1" for "I" and "e" for "c". Because of the inaccuracy inherent in the OCR process, it requires and operator to manually correct all the suspect characters.

If you intend to use OCR in place of labeling a document because the process is automated, the need to correct suspect characters negates any time savings gained by the automated process. Furthermore, you still have the problem of capturing irrelevant OCR documents in a search because the keyword is included within the document.

Fuzzy Search
You may encounter companies that propose that it is not necessary to correct the suspect characters and you will usually find the document you are looking for. They suggest that you use "fuzzy search" technology. This technology expands queries to include terms that sound like or are typographically similar to the term requested. The problem with fuzzy searches is that they produce an even larger number of documents that are irrelevant.


When We Would Recommend to OCR:
We recommend you OCR a document when you will need to conduct an intra-document search. This is especially useful if you are looking for information contained within a document. An example would be an attorney looking for a statement contained in a deposition or the need to find information contained within a research paper.

By searching within a document you can jump directly to the specific information needed within the document. One will basically navigate from occurrence to occurrence of the word, starting with the first "hit". Each "hit" is highlighted within the document, thus making the search process much more efficient, and enabling users to retrieve and use information faster, in fewer steps.


Conclusion

Which method should you use, should you label or OCR a document? Which process you use will depend on your own particular situation. We would suggest that you normally label a document, and when needed, OCR those documents that will be subject to an intra-document search.

Warmest regards,
Randy Van Ittersum &
Erin Spalding, CDIA+ Instructor

P.S. Check out our new healthcare product called LifeKey. LifeKey allows you to carry your medical records with you at all times on a USB Flash drive. It uses the first universal electronic medical record that can be viewed from virtually every computer in the world without the need for proprietary software or hardware. If you know someone with a serious medical condition, please share this information with them. It could save their life.

For more information go to http://www.mylifekey.com


 

 

Copyright © 2006 | Document Imaging Solutions, Inc.

Were you forwarded this e-Course by a friend? Fill out our no-hassle subscription form to get your own copy of the "Document Imaging Manifesto and 5-Day Mini-Course" by clicking here.

Document Imaging Solutions, Inc. only sends the 5-Day Mini-Course and information to e-mail addresses that we have acquired through information requests, and reseller relationships. We do not purchase or acquire lists for e-mail marketing purposes. If you feel that you have received this message in error or do not wish to receive further e-mails from Document Imaging Solutions, Inc., please unsubscribe from the link at the end of this email.


CONFIDENTIALITY NOTICE: This message and any attached documents may contain confidential information from Document Imaging Solutions, Inc. The information is intended only for the use of the individual or entity named above. If the reader of this message is not the intended recipient, or an employee or agent responsible for the delivery of this message to the intended recipient, the reader is hereby notified that any dissemination, distribution or copying of this message or of any attached documents, or the taking of any action or omission to take any action in reliance on the contents of this message or of any attached documents, is strictly prohibited. If you have received this communication in error, please notify the sender immediately by e-mail or telephone, at (616) 847-5055, and delete the original message immediately. Thank you.