To OCR or Label Electronic Documents for Retrieval
When scanning documents, the decision to use Optical Character Recognition (OCR) versus labeling should revolve around the issues of data mining and document retrieval. For documents and/or information contained within those documents to be searchable, electronic documents must be indexed. It is a disservice to the imaging industry's customers when it advocates OCR as the preferred method of document imaging for search purposes. OCR has been promoted to allegedly automate the process. Although our document imaging systems allow you to either label or OCR a document for indexing, the preferred method to use needs to be made on a document-by-document basis. Understanding the drawbacks associated with each method will help clarify when OCR or labeling is preferred.
OCR is Not 100% Accurate:
While search engines (indices) are easy to use, the search results are often imprecise and display irrelevant information. For example, searching for java might return documents that describe java the programming language, java the coffee, and Java the Indonesian island. The greater the number of documents on your server, the greater the number of irrelevant search results.
Labeling documents will significantly improve the accuracy of your search and return a higher number of relevant documents. This is especially true for industries retrieving standardized documents such as job files and invoices.
For other organizations, it may be more practical to OCR the document in order to search for a key word or phrase contained within the text. It is not recommended to OCR every page contained within a document due to time requirements. Rather, individual pages form the document should be selected to OCR that would be relevant to a future search.
OCR is the process of converting text on a scanned image into text that is be searchable. One then executes a full-text search on the OCR document with words and phrases known to be included in a document. The OCR process is extremely sensitive to the quality of the image, as well as the font differences within the document. As a result, the output from an OCR process is seldom flawless.
If the OCR process claims to be 95% accurate, then one character in twenty is not recognized. Errors are introduced when characters bleed and touch one another, or when the scanner picks up ghost images from the reverse side of the document. The inaccuracy of the OCR process requires an operator to manually correct the suspect characters. Using OCR in place of labeling often negates any time gained by the automated process because of character corrections.
Several companies propose that it is unnecessary to correct the suspect characters and that document searches are accurate. They suggest using fuzzy search technology that expands queries to include terms that sound similar or are typographically similar to the term requested. However, fuzzy searches often produce an even larger number of documents that are irrelevant and unnecessary.
Appropriate Conditions for OCR:
Our document imaging system allows intra-document searches. This is especially useful if one is data mining or looking for information contained within a document.
One can retrieve a document and then jump directly to the specific information needed within that document. Navigation from occurrence to occurrence of the keyword(s) will start with the first hit. Each hit is visibly highlighted within the document, making the search and retrieval process much more efficient.
Our solution allows you to either label or OCR a document. We suggest labeling most documents, and using OCR with pages that are relevant to a future search. This method will keep your index database to a manageable size, as each word in the OCR document is entered into your index database.
Many of our customers find it faster and more accurate to manually enter predetermined keywords, phrases and numbers to a document than to OCR the document and correct the suspect characters. The freedom to use an unlimited number of characters and numbers when labeling documents allows operators to use existing terminology with which employees are already familiar. This facilitates a quick and easy transition into using digital documents in place of paper documents throughout their organizations.