Optical Character Recognition

This section describes features that are coming in 4.0

With Optical Character Recognition (OCR), any binary content that you place into Cloud CMS will automatically be processed to analyze and detect characters, words and meaning. This allows you to put content into Cloud CMS that includes scanned documents with printed or handwritten elements, such as forms filled out by customers or clients or archived documents that your organization seeks to digitize.

For example, you might have a scanned form that looks like this:

Using an OCR service, Cloud CMS will scan the image, analyze it and detect meaningful information. The extracted, low-level information is made available as well as the extracted text (for full-text indexing) and any detected structured information (such as key/value pairs and table information).

The OCR extraction process will analyze your binary content and detect any meaningful content that is comprised of:

  • characters
  • words
  • sentences
  • paragraphs
  • pages
  • key/value pairs
  • table columns, rows and cells
  • full text

The extracted content is then stored onto your original Node so that you can work with it from a post-processing perspective. In addition, the Cloud CMS user interface makes use of this extracted content to provide your editorial users with in-context insights and awareness of what was extracted.

Your OCR analyzed content is automatically indexed for full-text search so that your editorial users can quickly find things based on the extracted text.

Triggering OCR Extraction

There are two ways to trigger an OCR Extraction - either by using the f:auto-ocr-extract feature or by setting up a Rule.

Using the f:auto-ocr-extract feature

When you add the f:auto-ocr-extract feature to a content instance, OCR extraction will run every time you create or update the default attachment of that node. In other words, when you change the binary payload, OCR extraction will automatically run in the background.

For more information on this feature, please check out our formal documentation on the f:auto-ocr-extract feature.

Using a Rule (Drop Zones)

You can configure a Rule to conditionally execute the Extract With OCR action whenever a content lifecycle policy trigger. This lets you finely control exactly when an OCR extraction rules.

One thing to consider is a "Drop Zone" folder. A "Drop Zone" is a folder that has a rule on it such that anything that is dropped into that folder will automatically have something done to it. In this case, you might set up a Drop Zone so that any content dropped into the folder has OCR extraction performed on it.

To do so, you simply set up a Rule on a folder that is bound to the p:afterAssociateNode policy. The Rule runs the Extract with OCR action. Now, when a user drops a image (or a PDF or any other content with a binary payload) into that folder, OCR Extraction will run.

You might then configure a follow-on action that moves the document to a different folder.

Extracted Results

Once extraction has completed, there will be 3 attachments added to your node:

  • ocr_extraction - a JSON document containing a full breakdown of all extracted elements. This information is very low-level but is useful from a post-processing perspective to glean meaningful insights into the content of the binary payload.

  • ocr_text - a text/plain document containing all of the extracted text content merged together as a human readable document. This is used to ensure that Cloud CMS will automatically index your OCR extracted content for full-text search and is made available for further use by third parties.

  • ocr_analysis - a JSON document that contains the results of analysis that Cloud CMS performs on the extraction to detect key/value pairs and table row/column/cell information. In effect, this document contains meaningful information that was gleaned from the structure of the results.

ocr_extraction

The ocr_extraction attachment contains raw extraction information about every detected element. Relationships between elements are reflected in the response including positional information on the page. Elements with sub-elements (such as a line containing words) are broken out very distinctly.

Here is an example: ocr_extraction.json

ocr_text

The ocr_text attachment contains a full-text dump of all of the text from the extraction organized sequentially as a reader might expect it to be. The nested structure of related elements is considered when building the full-text representation. It is meant to be human readable and is used internally to ensure that OCR extracted content is indexed within Cloud CMS right away for full-text search.

Here is an example: ocr_text.txt

ocr_analysis

The ocr_analysis attachment contains an analysis that Cloud CMS performs against the ocr_extraction payload. By walking elements and considering their relationships, the analysis reflects discovered content that adheres to a specific structure. This is used to detect Key/Value pairs and Table elements (columns, rows and cell values).

Here is an example: ocr_analysis.json

OCR Providers

The following OCR Service Providers are available: