Auto OCR Extract
QName: f:auto-ocr-extract
With this feature in place, a content instance will automatically have its default
binary attachment analyzed using an Optical Character Recognition (OCR) service. The extracted analysis and text will then be applied back onto the node, allowing it to benefit from automatic full-text search indexing and other tooling.
To use this service, you will first need to set up an OCR Extraction Service. The service can either be configured as the default OCR Extraction Service for your Project or it can be referenced from the f:auto-ocr-extract
feature. This allows different feature configurations to use different services if that is your preference.
The feature is defined as shown here:
{
"_qname": "f:auto-ocr-extract",
"_type": "d:feature",
"type": "object",
"title": "Auto OCR Extract",
"description": "Indicates that node should automatically have its attachment processed through optical character recognition for extraction of text, blocks and geometric shape information",
"systemBootstrapped": true
}
Configuration
<thead>
<tr>
<th>Property</th>
<th>Type</th>
<th>Default</th>
<th nowrap>Read-Only</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>serviceId</td>
<td>text</td>
<td></td>
<td></td>
<td>
The `_doc` identifier of the OCR Extraction Service configuration to use.
</td>
</tr>
</tbody>
Optical Character Recognition
With this feature in place, an OCR extraction process will launch whenever you create or update the default
attachment for the node. The OCR extraction process will analyze your binary content and detect any meaningful content that is comprised of:
- characters
- words
- sentences
- paragraphs
- pages
- key/value pairs
- table columns, rows and cells
- full text
The extracted content is then stored onto your original Node so that you can work with it from a post-processing perspective. In addition, the Cloud CMS user interface makes use of this extracted content to provide your editorial users with in-context insights and awareness of what was extracted.
For more information, please see our formal documentation on how Cloud CMS works with OCR.
Example
If you have a default OCR Extraction Service configured for your project, all you need to do is this:
{
"title": "My Document",
"_features": {
"f:auto-ocr-extract": {
}
}
}
Otherwise, you will need to point to a configuration:
{
"title": "My Document",
"_features": {
"f:auto-ocr-extract": {
"serviceId": "ec0a7bfb2d2f971b262e"
}
}
}