Bulk Import

Cloud CMS provides a bulk import tool that makes it easy to load content into Cloud CMS from a variety of external file formats and data sources. The bulk import tool ingests this data and writes it into a Cloud CMS branch within a single transaction so that you don't suffer from partial imports due to a failure along the way.

The Cloud CMS bulk import tool is designed to help you migrate existing content into Cloud CMS. This may include desktop files or structured data from legacy content management systems. Cloud CMS customers have used the bulk import tool to migrate content from XML, JSON, CSV files, SQL databases and other CMS products including Adobe Experience Manager, Alfresco, Contentful and Drupal.

The bulk import tool is delivered as a Node.js module known as the Cloud CMS Packager.

The Cloud CMS Packager provides an easy way for you to programmatically create a new archive on disk (i.e. the package) and populate it with your content types and content instances. When you're done, the archive is written to disk as a ZIP file.

The ZIP file is a Cloud CMS Archive that adheres perfectly to the Cloud CMS transfer format. As such, you can upload this ZIP file into a Cloud CMS Vault and import it into your branches.

As such, the Cloud CMS Packager provides an offline way for you to build Cloud CMS compatible archives that can be transactionally imported into your Cloud CMS projects. It's very convenient.

Getting Started with Packager

To use the Cloud CMS Packager, you simply write a bit of Node.js code that pulls in the Packager module. You can then use this module to create a Packager.

Here is an example of a some packager code that imports a single "Hello World" node:

var PackagerFactory = require("cloudcms-packager");
PackagerFactory.create(function(err, packager) {

    if (err) {
        return console.error(err);
    }

    // let's put things into the package...
    packager.addNode({
        "title": "Hello World"
    });

    // package up the archive
    packager.package(function(err, info) {

        if (err) {
            return console.error(err);
        }

        console.log("All done - wrote file: " + info.filename);
    });

});

The create method creates a new Packager instance. This allocates some temp space on disk where your objects will be stored as you add them. If something goes wrong while creating the Packager, an err is handed back.

The Packager instance exposes a number of useful methods that you can use to add content into the instance. These methods let you explicitly add objects (as shown above) and they also let you do things like consume content from directories on disk or add binary attachments. More on this below.

Finally, once you've populated the Packager instance with your content, you call the package method to ZIP up your content and store it on disk.

The following output is what you might see if you ran one of the Packager examples:

New package, working dir: /var/folders/zv/r_xmfn113hs5nnq9p7rkrjqr0000gn/T/packager118727-70500-1yj7ud0.on08k
[1535386678601] Resolving 737 aliases
[1535386678601] Resolved: 0 of 737 aliases
[1535386678603] Resolved: 737 aliases
[1535386678604] Resolution of aliases complete
[1535386678604] Compilation references and aliases resolved successfully
[1535386678604] Binding attachments
[1535386678605] Completed binding of attachments
[1535386678605] Starting cleanup of objects
[1535386678608] Completed cleanup of objects
[1535386678608] Starting verification of objects
[1535386678609] Completed verification of objects
[1535386678610] Total number of records to write: 737
[1535386678610] Writing out: 1 parts of size: 50000 each
[1535386678610] Writing single archive
[1535386678610] Using temp location: /var/folders/zv/r_xmfn113hs5nnq9p7rkrjqr0000gn/T/packager118727-70500-1yj7ud0.on08k/package
[1535386678617] Wrote: 1 of 737 objects
[1535386678876] Wrote: 737 of 737 objects
[1535386678876] Handled attachments for: 1 of 737 objects
[1535386678887] Handled attachments for: 737 of 737 objects
[1535386678897] Writing manifest
[1535386678923] Completed commit of objects to disk
[1535386678923] Creating archive file
[1535386680164] Wrote [group=worldcup2014squads, name=example1, version=1535386678553]: archives/worldcup2014squads-example1-1535386678553.zip

This created an archive on disk at the path archives/worldcup2014squads-example1-1535386678553.zip.

The archive has the following properties:

group: worldcup2014squads
name: example1
version: 1535386678553

Uploading

Once the archive has been created on disk, you can upload it to your Cloud CMS project / repository branch. You can either do this by hand or via the command line.

To do this via the command line, you will first need to install the Cloud CMS Command Line Tool.

You will then need to add a gitana.json file to your directory. This tells the command line tool how to connect to your Cloud CMS installation.

And then you can simply run:

cloudcms archive upload --group <group> --artifact <name> --version <version>

For example, using the output file from above, you might run:

cloudcms archive upload --group worldcup2014squads --artifact example1 --version 1535386678553

The tool will upload the given archive to your vault. And it will let you know once it has completed.

At that point, you can import your package to your project's repository branch. You'll need a repositoryId and branchId for this. To get these, go to a piece of content in the desired project and branch and click 'View in API'. The url for the page you are sent to will be of the form: /proxy/repositories/{repositoryId}/branches/{branchId}/nodes/{nodeId}.

Once you have this, to import the archive, the command is:

cloudcms branch import  --group <group> --artifact <name> --version <version> --repository <repositoryId> --branch <branchId>

So for our example, you might run:

cloudcms branch import --group worldcup2014squads --artifact example1 --version 1535386678553 --repository 38e31f2e53d196d5647a --branch 38cb30d7fd13a6d420bd

For more information on the command line tool, we recommend visiting the Cloud CMS Command Line Tool page.

Examples

We've put together a few examples to give you a better idea of what is possible and also to give you a head start in terms of building out your own migration scripts.

The following examples are available:

Aliases

When you add content to the packager, you may specify an _alias field to provide a unique ID that you may use to reference the content across objects. The _alias serves as a temporary ID for the object during the packaging. It does not get persisted to the backend once the content is imported.

Associations

In addition to content instances, you may manually define associations. Associations are nodes (in their own right) that follow a specific structure. The easiest way to add an association is via the addAssociation() method as described in the API section below.

You may wish to the _alias values for the source node and the target node to connect your associations.

Suppose you had two nodes like this:

{
    "title": "Daenerys Targaryen",
    "firstName": "Daenerys",
    "lastName": "Targaryen",
    "_alias": "dt"
}

And:

{
    "title": "Jon Snow",
    "firstName": "Jon",
    "lastName": "Snow",
    "_alias": "js"
}

You could add an association using the addAssociation() method like this:

packager.addAssociation("dt", "js", {
    "_type": "a:linked",
    "familyMembers": true
});

Here is another example with a simple parent/child relationship:

packager.addNode({
    "title": "The Parent Node",
    "_alias": "n1"
});
packager.addNode({
    "title": "The Child node",
    "_alias": "n2"
});
packager.addAssociation("n1", "n2", {
    "_type": "a:child"
});

Relator Properties

The bulk import process also provides support for Relator Properties. Relator Properties are special properties that connect one or more nodes together. They automatically compute and manage graph associations in the background for you.

To populate a relator property, you can use the special __related_node__ field.

Suppose, for example, that you have an author who looks like this:

{
    "title": "Daenerys Targaryen",
    "firstName": "Daenerys",
    "lastName": "Targaryen",
    "_alias": "dt"
}

This author has an _alias defined with the value dt.

You might then have article with a relator property named authoredBy. You can populate it like this:

{
    "title": "Bend the Knee",
    "body": "How to bend the knee",
    "authoredBy": {
        "__related_node__": "dt"
    }
}

When the content is imported, the relator property will be connected for you automatically.

Here is another example shown in one fell swoop:

packager.addNode({
    "title": "The Related Node",
    "_alias": "related1"
});
packager.addNode({
    "title": "The Relating node",
    "points-to": {
        "__related_node__": "related1"
    }
});

Existing (Auto-Merge)

When the importer runs, it will naturally check for collisions between importing content and existing content based on the following properties:

_doc
_qname
The content path

This is the natural way that Cloud CMS works and so any content imported will naturally merge with content that has any of the given matching elements. However, you may also wish to specify a custom match. You can do this using the _existing parameter.

Suppose a piece content already exists like this:

{
    "title": "Two-Handle Bathroom Faucet",
    "model": "TS6925",
    "description": "A soft-stream water flow, Doux is the perfect addition to any bathroom interior."
}

We could replace that content with imported content if our imported content specifies _existing like this:

{
    "title": "Doux Chrome Two-Handle High Arc Bathroom Faucet",
    "model": "TS6925",
    "description": "A graceful arc and unique, soft-stream water flow, make Doux the perfect addition to any bathroom interior.",
    "_existing": {
        "model": "TS6925"
    }
}

Large Data Sets

If you're importing a particularly large data set, you may need to increase the amount of memory allocated to your Node process. You can do this by setting the max_old_space_size option to a sufficiently high value in MB.

export NODE_OPTIONS=--max_old_space_size=4096

API Reference

To create a Packager instance, call the create method on the Packager Factory.

`packagerFactory.create([config], callback)`

The config argument is optional. It lets you configure global parameters for the packager. If not provided, the following defaults are used:

{
    "outputPath": "./archives",
    "archiveGroup": "packager",
    "archiveName": "import",
    "archiveVersion": new Date().getTime()
}

This results in an archive being created in the archives directory relative to the working directory. The output file path will be ./archives/packager-import-1535391537589.

The callback signature receives the instantiated packager or an error if something went wrong.

packagerFactory.create(function(err, packager) {

    if (err) {
        return console.error(err);
    }

    // let's start packaging!

});

`packager.addNode(json)`

Adds a content item to the packager. The incoming JSON provides the properties for the content item.

The JSON should specify the _type attribute. If _type is not provided, n:node will be assumed.

Here is an example where we add the esteemed scientist Carl Sagan to our packager:

packager.addNode({
    "title": "Carl Sagan",
    "_type": "my:author"
});

`packager.addAssociation(source, target, json)`

Adds a content association to the packager. The incoming JSON provides the properties for the association.

The JSON should specify the _type attribute. If _type is not provided, a:linked will be assumed.

The source and target should be the respective _alias values for the source and target nodes in the association.

Here is an example:

packager.addAssociation("sourceAlias", "targetAlias", {
    "_type": "my:association-type",
    "quality": 3
});

`packager.addAttachmentFromDisk(_doc, attachmentId, attachmentSource)`

Adds an attachment to a node or an attachment.

The _doc may either be the node ID or the _alias of the node.

Here is an example:

packager.addAttachment("dt", "default", "images/daenerys.jpg");

`packager.addFromDisk(filePath, [typeQName])`

Parses a JSON file from disk and packages the content from that JSON file as nodes of the given type.

The JSON may either be an JSON object ({}) or a JSON array ([]) of objects.

The typeQName argument is optional. If it isn't provided, the value n:node is assumed.

Suppose the following file exists on disk at data/nodes.json:

[{
    "title": "Carl Sagan"
}, {
    "title": "Stephen Hawking"
}]

We could import it with the following command:

packager.addFromDisk("data/nodes.json", "my:scientist");

`packager.addDirectory(directoryName)

Adds the contents of a directory (and any subdirectories) to the packager. The directory is scanned for the following files:

node.json
association.json

These files are added to the packager. The packager then looks for any sub-folders that may contain additional information related to these nodes/associations. These folders include:

attachments
forms
translations

Here is an example of a directory structure that could be consumed:

    folder/
        carl-sagan/
            node.json
            attachments/
                default.jpg
                avatar.jpg
            translations/
                translation1.json
        my_author/
            node.json
            forms/
                default.json

See the examples for a further reference on this.

`packager.getNodesWithType(type)`

Looks across all of nodes that are currently added to the packager and hands back a map of any nodes (or associations) that have the given type. The map is keyed by node ID.

Example:

var results = packager.getNodesWithType("my:author");

`packager.package(config, callback)`

Packages up the contents of the packager into a ZIP file.

The config argument is optional. It lets you configure global parameters for the packager. This config is identical to the packagerFactory.create method's config argument. It allows you to specify config at package time should you prefer.

The callback argument takes an error as the first argument and the successful package information as the second argument.

packager.package(function(err, info) {

    if (err) {
        return console.error(err);
    }

    // success!
});

The info object looks like this:

{
    "group": "worldcup2014squads",
    "name": "example1",
    "version": "1535386678553",
    "filename": "archives/worldcup2014squads-example1-1535386678553.zip"
}

Tips and Tricks

The _key property may be used to provide a unique key for your content items upon import. If a _key field is provided, it will be used to auto-generate an _existing field for collision detection. This key will be compared with existing content items. If collisions are found, the existing content items will be overwritten. This allows you to run your import multiple times successively.

The _alias property allows you to define a unique tag that you can use to refer to your imported element from within other elements. This is very helpful in terms of relational modeling.

A single ZIP archive of your package will be imported transactionally. That is to say, all of your content will import in one fell swoop (or none of it will).

If you have more than 50,000 items in your package, the package may split across multiple archives. You will see a suffix added to your archive files, such as:

archives/org.mycompany-MyContent-02-05-2018-1.zip
archives/org.mycompany-MyContent-02-05-2018-2.zip
archives/org.mycompany-MyContent-02-05-2018-3.zip

This has no impact on the importer mechanics.

Uploading will upload multiple archives and importing will import multiple archives.

The archives will still be imported within a single, transactional commit.