Repository Compression

Cloud CMS content is stored within a Repository. A Repository differs from other types of data stores in that it provides Copy-On-Write mechanics using Changeset-driven versioning.

Every time you create, update or delete content within a repository, those adjustments are written onto a new Changeset. Changesets are layered automatically and provide a stack of differences that, over time, allow you to scroll back to any moment in time to see a perfect capture of every modification made. It's a bit like a movie in that Cloud CMS lets you rewind and fast forward, frame-by-frame, to see everything that has happened.

This versioning model makes some very powerful capabilities possible, including:

  • The ability to replicate or copy content between Cloud CMS instances across the globe and between data centers
  • The ability to gain insight into exactly what changes were made historically and by who
  • The ability to work in multiple branches using forks and merges so that your editorial team can work on long tasks and release them in one fell swoop
  • The ability to provide an audit record around your content and user changes

However, there is also one downside to this approach and that is that it takes more disk space.

Imagine that you create a node and then update if 4 times. In effect, this would transition the node between 5 different states:

N1                          create
N2, N3, N4, N5              update

Suppose all of the states consisted of content that was more or less the same size (X). If you only stored the latest, then the storage footprint would be X. However, with a Cloud CMS Repository, the storage footprint is 5X since all states along the way are retained.

Similarly, suppose that you created a node, updated it 3 times and then deleted it. The state transitions would be:

N1                          create
N2, N3, N4                  update
N5                          delete

Once again, suppose the content is the same size across all of these (X). If you only stored the latest, then the storage footprint would be 0 (since we delete at the end). However, with a Cloud CMS Repository, the storage footprint is still 5X since all states, even deletes, are retained along the way.

Saving Space with Compression

For most users, the growth in database size will be negligibly different. However, depending on how you use Cloud CMS, you could find yourself in situations where the database is growing quickly or is larger than you would desire. Fortunately, Cloud CMS provides a way to reduce a Repository's size by running a "compression" tool against it.

Compressing a Repository means that Cloud CMS will look at the changeset history and find states that could be removed without harming the net data structure. In doing this, Cloud CMS will intentionally purge historical data that is no longer necessary in representing the final data. This results in a smaller Repository size but also results in some "historical knowledge" being cleaned out.

A Repository after compression is smaller but it is also less accurate in terms of describing who did what and at what time.

In Scenario #1, above, the original changeset history looked like this:

N1                          create
N2, N3, N4, N5              update

The compressed changeset history would look like this:

N5                          create

And in Scenario #2, above, the original changeset history looked like this:

N1                          create
N2, N3, N4                  update
N5                          delete

The compressed changeset history would be empty. The node is would be completely removed.

Branches

One of the most powerful features of Cloud CMS is its ability to work with multiple branches of content at once. Branches forks off from previous branches and then may merge back at some future point in time.

Cloud CMS Repository Compression works nicely with branches. Branch fork and merge changesets are especially handled so that the compressed changeset history is still 100% accurate and backward compatible with future merges and forks.

Suppose you had two branches (M and N) and node create/update operations spread out across changesets like this:

             M1 - M2 - M3 - M4
            /                 \
N1 - N2 - N3 - N4 - N5 - N6 - N7 - N8 - N9

After a compression, the repository might look like this:

             -------------- M4
            /                 \
--------- N3 ---------------- N7 ------ N9

Usage

Fundamentally, Repository compression is available via the Cloud CMS REST API. However, you can also run it via our Cloud CMS command line client.

REST API

You can compress an entire repository like this:

POST /repositories/{repositoryId}/compress

Or compress specific nodes by identifying them in the JSON payload:

POST /repositories/{repositoryId}/compress
{
    "nodeIds": ["nodeId1", "nodeId2", ...]
}

Command Line Client

The Cloud CMS command line tool also makes it easy for you to compress repositories:

cloudcms compress {repositoryId}

Compression Operations and Results

Compression may take some time to run. You should watch the API logs for more information on its status. In addition, the REST API call will give you back a Job ID that you can use to query on the progress and status of the compression job as it runs. The command line tool will wait and let you know when compression has finished.

When Compression finishes, your Repository will be smaller. However, the allocated size of any DB files will NOT be smaller since MongoDB does not release space on-the-fly. To recover disk space, you will need to either:

  • repair your MongoDB database in the background
  • dump and restore your MongoDB database

Let's look at an example of how the second option is achieved.

Example: Compress a Repository and Recover Disk Space

Suppose we have an existing Cloud CMS installation. When we log in MongoDB, we see that our /data/db directory has gotten quite big and when we dig in a bit further, we notice that the repository-48a3f83668a69517bba1 folder is a bit large. We'd like to compress this repository. First, we note:

  • The Repository ID is 48a3f83668a69517bba1
  • The Database name is repository-48a3f83668a69517bba1

To compress the repository, we first open up a terminal window and then go into the MongoDB bin directory (where we can run one or more of the MongoDB command line tools).

Step 1 - Take a backup

Very important! You should back up your repository before doing anything else. You can do so like this:

./mongodump --db=repository-48a3f83668a69517bba1 --out backup

The backup files will be placed into the backup directory.

Step 2 - Compress the repository

Use the Cloud CMS command line tool to compress the repository:

cloudcms compress 48a3f83668a69517bba1

This will take some time but will let you know when it has finished.

Step 3 - Dump the DB to disk

This will export the database to disk (placing it in the temp directory).

./mongodump --db=repository-48a3f83668a69517bba1 --out temp

Step 4 - Drop the DB

Now lets drop the live DB altogether, clearing it out.

./mongo repository-48a3f83668a69517bba1 --eval "db.dropDatabase()"

Step 5 - Restore the DB

Finally, lets restore the DB from the temp directory.

./mongorestore temp

That's it. When we now look at disk, we'll see that the repository-48a3f83668a69517bba1 is in place but is smaller. It was compressed and then exported to temp. Upon re-imported, everything was recreated and the net result is a smaller DB since MongoDB will have less to allocate.