Search

Any content you put into Cloud CMS is automatically indexed for full-text and structured search. This lets your editorial teams instantly search for content and find the things they're looking for.

Under the hood, Cloud CMS uses Elastic Search and makes available to your editorial users and developers the full syntax of the Elastic Search Query DSL. This allows you to execute simple searches as well as more complex queries that take into account term and phrase matching, nested operations, logical constructs, fuzziness, proximity, wildcards and regular expression matching.

Cloud CMS automatically maintains Elastic Search indexes for you on a per-branch basis. Thus, you can create as many branches as you would like and each branch will have a uniquely maintained and runtime-ready index to power your search API calls.

Cloud CMS additionally offers the Find API. The Find API lets you execute concurrent MongoDB and Elastic Search queries and compose them into a single, intersecting record set.

If you're looking for a reference on how to write search queries, we recommend visiting our page on Query Strings.

Automatic Search Indexing

Cloud CMS automatically indexes all of your content. This includes both its JSON structure and any binary attachments that belong to the node.

For example, a node might have 3 JSON metadata fields and 2 binary payloads (let's say, a PDF document and a Word document). Cloud CMS will index the JSON fields first and then also index the PDF document and the Word document. To do so, Cloud CMS performs text extraction on each binary file and loads the extracted tokens onto special fields within Elastic Search for discovery. Thus, all of your content is instantly available for full-text search.

Cloud CMS supports a wide variety of desktop MIME type files, including Microsoft Office formats, PDF, text formats and most common Audio and Video formats. Depending on the MIME type, different elements are automatically extracted. For some formats, such as Audio and Video formats, header information is extracted whereas other formats (such as Powerpoint or PDF) will have its textual elements extracted. Cloud CMS essentially tries to extract as much as it can.

Per-branch Search Indexes

Search indexes are maintained at a branch level. If you're working in the master branch, it will have it's own index which represents the tip view of content within that branch. If you fork another branch, it will have it's own index. These indexes are automatically maintained for you as you use Cloud CMS.

Searching within Projects

Within the Cloud CMS user interface, searching is available within a search box for every project.

From within a project, you can search for all documents contained within that project. Cloud CMS provides a search screen that gives you the ability to write out the text of your search as well as set up common filters (such as property filters, date/time and more).

search1.png

By default, search results include scores and a few interesting properties. You may further wish to customize the results list to show custom properties. This can be done by writing custom UI Templates.

Federated Search across Projects

Cloud CMS also lets you perform a unified search across multiple projects. From within the Cloud CMS UI, you simply navigate to your platform and type into the search box. This performs a single search across ALL of your projects.

search2.png

Search results come back with full node properties loaded and some metadata about the search (including it's score within the relevant search index).

Permissions

As with everything Cloud CMS, the search API respects the underlying permissions and authorities that have been granted to the objects that are considered result candidates. Authorities are checked before content is retrieved which means that two people could execute the same search and get different results.

An example - suppose that there are 10 content items with the term "Pink Floyd" in them. An administrator (who has super authorities and can do just about anything) might run this query and get back 10 results. However, user A might only have CONSUMER authorities against 4 of those content items. When person A performs the search, they would only get back a result set of size 4.

Permissions are baked into Cloud CMS all the way down to the core. If you need to get back the full set of content objects for purposes of synchronization or anything else, make sure that you have sufficient authorities to do so.

How to Describe Searches

To run a search, you simple pass Cloud CMS a JSON object or some text that you wish to search for.
If you pass some text, then the text is expected to be an Elastic Search Query String. If you pass a JSON object, then the object is expected to conform to the Elastic Search Query DSL.

Using a Query String

A text search involves passing a string that might simply be a keyword, such as:

"joe smith"

This will run a search across all of your content and find any content where the phrase joe smith exists. This is a case insensitive search and so it will find content that includes content like Joe Smith and JOE SMITH.

You can also pass text that contains a Query String. A Query String uses Elastic Search's Query DSL and must conform to the Elastic Search SDL for Query Strings:

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html

Using this DSL, you can express a far more complicated query as a bit of text. For example, you might want to find all content of type my:article created in 2018 that contains the text Cloud CMS is awesome using a proximity of 5:

__type:"my:article" AND _system.created_on.year:2018 AND "Cloud CMS is awesome"~5

For a full reference on text searches, please see our reference on Query Strings.

Using Query JSON

If you send a JSON object, then the JSON object should be expressed using the Elastic Search Query DSL. This DSL lets you take full control of the search mechanics and perform 100% of the search functionality that Elastic Search offers.

In contrast, a Query String (text) is more limited in its expression. A query string is effectively parsed into its JSON form before being executed. For example, a query string search for "joe smith" ends up looking like the following in JSON form:

"query_string" : {
    "query" : "joe smith"
}

The JSON DSL is very powerful and gives you full access to everything Elastic Search can do. If you wish to take full advantage of Elastic Search and find that you cannot achieve what you want to achieve using a textual query string, you will eventually want to write your queries as JSON.

Search API

There are several methods available on the Cloud CMS REST API to perform searches. All of the REST methods assume a repository and branch that identifies the index to be searched.

You can perform simple text-based searches using little more than a GET method like this:

GET /repositories/{repositoryId}/branches/{branchId}/nodes/search?text={text}

Or you can perform more elaborate searches using the Elastic Search JSON DSL.

POST /repositories/{repositoryId}/branches/{branchId}/nodes/search

Set your request content type to application/json and pass a payload consisting of the Elastic Search DSL configuration block. For example, you might do this:

POST /repositories/{repositoryId}/branches/{branchId}/nodes/search
{
    "search": {
        "query": {
            "query_string" : {
                "default_field" : "content",
                "query" : "this AND that OR thus"
            }
        }
    }
}

Search Results

Search results are handed back in the API's Collections Envelope format. It may look something like this:

{
    "rows": [
        {
            "title": "My First Article",
            "_search": {
                "score": 10.81413
            }
        },
        {
            "title": "My Second Article",
            "_search": {
                "score": 3.56121
            }
        },
        {
            "title": "My Third Article",
            "_search": {
                "score": 1.01982
            }
        }                
    ],
    "size": 3,
    "total_rows": 1002,
    "offset": 0
}

Scoring of Search Results

When using the Search REST API, each result that comes back will have a special _search property on it. This property is populated with information provided by Elastic Search. It will contain the score that the Elastic Search computed (as shown above).

Higher scores indicate a better or more relevant match.

Limiting Fields in Search Results

To limit the fields that come back in your search, use the _fields subobject. This subobject provides key/value pairs where each key is the name of a field to include.

Suppose we have articles with the following schema:

{
    ...,
    "properties": {
        "title": {
            "type": "string"
        },
        "description": {
            "type": "string"
        },
        "category": {
            "type": "string"
        }
    }
}

We could run a search and hand back only the title and the category like this:

POST /repositories/{repositoryId}/branches/{branchId}/nodes/search
{
    "search": {    
        "query": {        
            "query_string" : {
                "query" : "hello world"
            }
        }
    },
    "_fields": {
        "title": 1,
        "category": 1
    }
}

And the results might look like this:

{
    "rows": [
        {
            "title": "My First Article",
            "category": "blue"
        },
        {
            "title": "My Second Article",
            "category": "red"
        },
        {
            "title": "My Third Article",
            "category": "blue"
        }                
    ],
    "size": 3,
    "total_rows": 101,
    "offset": 0
}

Analyzers

Cloud CMS automatically configures your Elastic Search indexes with mappings and settings for your content model. Included in that configuration are a set of analyzers that are used to provide some sensible default search behavior.

The following analyzers are provided:

  • default - applied by default using a standard tokenizer. The lowercase and asciifolder filters are applied. This supports case-insensitive search.
  • basic - uses the standard tokenizer and lowercase filter only. This filter distinguishes between diacritic and non-glyph characters.

Diacritic Characters

Cloud CMS supports the indexing and search of text that includes diacritic characters. Diacritic characters are characters that include a glyph that is added to its basic, often-latin derived equivalent. These include characters such as ë, ñ, č and many others.

Diacritic characters are indexed as-is, allowing for exact matches. They are also indexed using ASCII folding such that those characters are mapped to their reduced equivalents. For example:

  • ë -> e
  • ñ -> n
  • č -> c

Suppose you had two content items:

{
    "title": "España is beautiful in the summer"
}

and

{
    "title": "Spain is often misspelled as Espana"
}

If you were to search for España, you would get two results since the ASCII folding implementation will simplify the term to give you more search results. Bear in mind that the scoring of the search results will reflect the precision of the match and you can use that, after the fact, to remove results that you do not wish to keep.

If you were to search for Espana, you would also get two results. For the same reason as stated above.

If you wanted to search for an exact match on España, you can do so by using a non-default analyzer with your query. For example, the basic analyzer does not include ascii_folding and will produce exact matches.

You could run the following:

{
    "query": {
        "query_string": {
            "query": "España",
            "analyzer": "basic"
        }
    }
}

And you will get back 1 result.

Special / Escapable Characters

Elastic Search requires that you escape certain special characters when using them within a query. These characters include:

+ - = && \\ > < ! ( ) { } [ ] ^ " ~ * ? : \ /

To include these characters in a search, you will need to escape them with \\ as a prefix.

For example, suppose you wanted to search for [What's the Frequency, Kenneth?]. This has several special characters in it including [, ] and ?.

To search for this, you'd escape the search term to:

\\[What's the Frequency, Kenneth\\?\\]

For more information, see: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#_reserved_characters

Examples

Please take a look at the Search Examples page for examples of searches using both the full-text search approach and the Elastic Search JSON DSL.

Further Reading