Auto OCR Extract

This section describes features that are coming in 4.0

QName: f:auto-ocr-extract

With this feature in place, a content instance will automatically have its default binary attachment analyzed using an Optical Character Recognition (OCR) service. The extracted analysis and text will then be applied back onto the node, allowing it to benefit from automatic full-text search indexing and other tooling.

To use this service, you will first need to set up an OCR Extraction Service. The service can either be configured as the default OCR Extraction Service for your Project or it can be referenced from the f:auto-ocr-extract feature. This allows different feature configurations to use different services if that is your preference.

The feature is defined as shown here:

{
    "_qname": "f:auto-ocr-extract",
    "_type": "d:feature",
    "type": "object",
    "title": "Auto OCR Extract",
    "description": "Indicates that node should automatically have its attachment processed through optical character recognition for extraction of text, blocks and geometric shape information",
    "systemBootstrapped": true
}

Configuration

<thead>
    <tr>
        <th>Property</th>
        <th>Type</th>
        <th>Default</th>
        <th nowrap>Read-Only</th>
        <th>Description</th>
    </tr>
</thead>
<tbody>
    <tr>
        <td>serviceId</td>
        <td>text</td>
        <td></td>
        <td></td>
        <td>
            The `_doc` identifier of the OCR Extraction Service configuration to use.
        </td>
    </tr>
</tbody>

Optical Character Recognition

With this feature in place, an OCR extraction process will launch whenever you create or update the default attachment for the node. The OCR extraction process will analyze your binary content and detect any meaningful content that is comprised of:

  • characters
  • words
  • sentences
  • paragraphs
  • pages
  • key/value pairs
  • table columns, rows and cells
  • full text

The extracted content is then stored onto your original Node so that you can work with it from a post-processing perspective. In addition, the Cloud CMS user interface makes use of this extracted content to provide your editorial users with in-context insights and awareness of what was extracted.

For more information, please see our formal documentation on how Cloud CMS works with OCR.

Example

If you have a default OCR Extraction Service configured for your project, all you need to do is this:

{
    "title": "My Document",
    "_features": {
        "f:auto-ocr-extract": {
        }
    }
}

Otherwise, you will need to point to a configuration:

{
    "title": "My Document",
    "_features": {
        "f:auto-ocr-extract": {
            "serviceId": "ec0a7bfb2d2f971b262e"
        }
    }
}