OCR : all about optical character recognition

Text recognition software is nowadays a must in daily work tools. This concerns for example Google Drive or Adobe Acrobat. One of the best known in the world is OCR or “Optical Character Recognition”. Extremely original, this software has played an essential role for many decades.

Table of contents

But what is recognition?optical character recognition (OCR) ?

Optical character recognition or OCR is a technology “image-text”. It helps users to extract text from images or scanned documents. However, it is a software that allows to recognize and translate an image into text. OCR is therefore used to recognize letters, words, line elements, sentences as well as patterns. Therefore, the time that is spent on manual document processes is really reduced.

OCR is then very useful when data has to be processed further. This applies, for example, to accounting or expense management. In other cases too, for loyalty marketing campaigns or identity verification.

Very often, OCR solutions are associated with AI (artificial intelligence) and the ML (machine learning). This is particularly in order to automate certain processes and to increase the accuracy of data extraction.

Who invented recognition?Optical Character Recognition (OCR)?

Gustav Tauscheka self-taught Austrian engineer with more than 200 patents and inventions to his name, created the OCR. He patented it in Germany in 1929. Then, Paul Handel also patented it in 1933 before Tauschek did it a second time in the USA in 1935.

Brief history of optical character recognition

By the way, the first forms of OCR images in text appeared in the years 1800. They were dedicated to the blind, in order to help them to read. Then, in 1970, the American inventor Ray Kurzweil created Kurzweil Computer Products Inc. The company was inspired by Gustav Tauschek’s device to create its omni-police OCR software. The latter recognized any text font remarkably well. Tauschek’s mission was to create software for transform images into text with precision and efficiency. So the engineer used it mainly in his punch card based calculating machines. It is from there that Tauschek invented his reading machine. It is a mechanical device capable of reading characters and numbers on an image. And then, to transform them into characters and numbers printed on a sheet of paper. Although several people before Tauschek had proposed similar forms, he was the first to take it off the page. He was also the first to transform this technological invention into a real-world device with his reading machine.

OCR: Evolution and Importance

Following Tauschek’s creation, many other inventors and engineers took the idea and created all sorts of new technologies. OCR then evolved significantly over the years. In 1931, for example, optical character recognition was the basis for the creation of a text-telegraph device. This one evolved in 1951 in a device of text in Morse code. Then, in 1966, the device came to read handwriting and transform it into text. It has always continued its way. It is in 1978 that the Omni-font OCR of Ray Kurzweil was born.

Then, in the 1980s, optical character recognition came into its own. Barcode scanners in retail stores and Xerox machines in offices and schools. And to this day, there are free online versions of OCR software. Google Drive and Adobe Acrobat offer them. These work in over 200 different languages with precision and clarity.

How does the recognition work?Optical Character Recognition (OCR) ?

OCR matches the text in an image with the digital database of corresponding letters and numbers. Then it reprints or archives it more clearly, more vividly and with much greater accuracy. It’s a bit like the human ability to read text and recognize patterns and characters. But this time, the quality is better and the process is shorter. However, there are a few steps to follow.

Step 1: Preprocessing the image

The first thing to do is to improve the quality of the image. This is to ensure that the data output is accurate. The OCR engine therefore looks for errors and problems and corrects them. Four techniques are the most used for the realization of this stage. They are in particular, DE-skew, Binarization, Zoning and Standardization. The first one rectifies and corrects the angle of the photo. The Binarization consists in converting the image in black and white. It allows to separate more precisely the text from the background. Zoning” is used to identify columns, rows, blocks, captions, paragraphs, tables and other elements. This is what gives it its other name of layout analysis. And finally, normalization is a noise reduction process. It does this by adjusting the pixel intensity value to the average values of the surrounding pixels.

Step 2: Segmentation

The second step is segmentation. It is a process to recognize a whole line of text at once. It consists of two steps. The first one is the word detection and lines of text. It identifies the lines and the words that belong to them. The second is script recognition. It identifies the script based on documents, pages, lines of text, paragraphs, words and characters.

Step 3: Character recognition

The third step is character recognition. The image or document is broken down into parts, sections or areas. Then the characters that each of them contains are recognized. For this, there are two approaches. The first is the matrix correspondence, comparing characters to a library of character matrices. Then, there is feature recognition, performed from images. The shape, height or size of a character are compared to those of the existing library.

Step 4: Post-processing of the output

The fourth and final step is the post-processing of the output. It encompasses the techniques to have a very accurate result. First, the data are detected. Then, they are corrected if necessary. Then, the extracted data is grammatically checked against a character library.

What are the limitations of template-based OCR?

Even though OCR has its advantages, it still has its limitations. As explained above, traditional OCR was designed for the blind. That said, it has never been a solution for dynamic data extraction. Here are the 5 main limitations.

OCR depends on input quality

The quality of the resulting text depends primarily on the quality of the input image. That said, it is the image transmitted to the engine. As an example, an image with character heights lower than 20 pixels could in no way extract a very precise text.

OCR depends on templates and rules

To work properly, OCR requires templates and rules. The appropriate field and row data can only be obtained from strict rules programmed to the engine. As a result, it cannot cope with the diversity of the documents.

Lack of automation

This limitation is directly related to the other two. The high dependency of traditional OCR software on templates and rules makes it difficult to deprives it of many automation possibilities. Here is the closest example, when extracting structured data from invoices. For this, each specific data field requires a new rule. However, there are many invoice styles and formats. As a result, there are also several rules.

That said, the more rules there are, the more data and resources there will be that are needed to train the engine. Then a huge bottleneck is likely to occur.

OCR is expensive

Maximum accuracy means developing more and more rules and algorithms. As a result, traditional OCR can become very expensive.

Moreover, these rules and algorithms do not even fully guarantee high-quality output. The quality of the output depends largely on the input quality of the image.

It does not stand up well to a wide variety of documents

Data extraction is easy for OCR when it comes to simple documents with little variation. However, when it is the case of companies that have to process various documents, it becomes complicated. Because indeed, the higher the variety of documents, the more difficult it becomes. The reason is that traditional OCR is trained with templates.

In short, OCR is not perfect. However, it is not hopeless either. OCR has come a long way in meeting the demands of the market. The latter is becoming more demanding in terms of requirements and functionality over the years.

Be the first to comment

Leave a Reply

Your email address will not be published.