Repository Compression
Cloud CMS content is stored within a Repository.
A Repository differs from other types of data stores in that it provides Copy-On-Write
mechanics using Changeset-driven versioning.
Every time you create, update or delete content within a repository, those adjustments are written onto a new Changeset.
Changesets are layered automatically and provide a stack of differences that, over time, allow you to scroll back to any moment in time
to see a perfect capture of every modification made. It's a bit like a movie in that Cloud CMS lets you rewind and fast forward,
frame-by-frame, to see everything that has happened.
This versioning model makes some very powerful capabilities possible, including:
- The ability to replicate or copy content between Cloud CMS instances across the globe and between data centers
- The ability to gain insight into exactly what changes were made historically and by who
- The ability to work in multiple branches using forks and merges so that your editorial team can work on long tasks and release them in one fell swoop
- The ability to provide an audit record around your content and user changes
However, there is also one downside to this approach and that is that it takes more disk space.
Imagine that you create a node and then update if 4 times. In effect, this would transition the node between 5
different states:
N1 create
N2, N3, N4, N5 update
Suppose all of the states consisted of content that was more or less the same size (X). If you only stored the latest,
then the storage footprint would be X. However, with a Cloud CMS Repository, the storage footprint is 5X since all
states along the way are retained.
Similarly, suppose that you created a node, updated it 3 times and then deleted it. The state transitions would be:
N1 create
N2, N3, N4 update
N5 delete
Once again, suppose the content is the same size across all of these (X). If you only stored the latest, then the
storage footprint would be 0 (since we delete at the end). However, with a Cloud CMS Repository, the storage footprint
is still 5X since all states, even deletes, are retained along the way.
Saving Space with Compression
For most users, the growth in database size will be negligibly different. However, depending on how you use
Cloud CMS, you could find yourself in situations where the database is growing quickly or is larger than you would
desire. Fortunately, Cloud CMS provides a way to reduce a Repository's size by running a "compression" tool against it.
Compressing a Repository means that Cloud CMS will look at the changeset history and find states that could be removed
without harming the net data structure. In doing this, Cloud CMS will intentionally purge historical data that
is no longer necessary in representing the final data. This results in a smaller Repository size but also results
in some "historical knowledge" being cleaned out.
A Repository after compression is smaller but it is also less accurate in terms of describing who did what and
at what time.
In Scenario #1, above, the original changeset history looked like this:
N1 create
N2, N3, N4, N5 update
The compressed changeset history would look like this:
N5 create
And in Scenario #2, above, the original changeset history looked like this:
N1 create
N2, N3, N4 update
N5 delete
The compressed changeset history would be empty. The node is would be completely removed.
Branches
One of the most powerful features of Cloud CMS is its ability to work with multiple branches of content at once.
Branches forks off from previous branches and then may merge back at some future point in time.
Cloud CMS Repository Compression works nicely with branches. Branch fork and merge changesets are especially handled
so that the compressed changeset history is still 100% accurate and backward compatible with future merges and forks.
Suppose you had two branches (M and N) and node create/update operations spread out across changesets like this:
M1 - M2 - M3 - M4
/ \
N1 - N2 - N3 - N4 - N5 - N6 - N7 - N8 - N9
After a compression, the repository might look like this:
-------------- M4
/ \
--------- N3 ---------------- N7 ------ N9
Usage
Fundamentally, Repository compression is available via the Cloud CMS REST API.
However, you can also run it via our Cloud CMS command line client.
REST API
You can compress an entire repository like this:
POST /repositories/{repositoryId}/compress
Or compress specific nodes by identifying them in the JSON payload:
POST /repositories/{repositoryId}/compress
{
"nodeIds": ["nodeId1", "nodeId2", ...]
}
Command Line Client
The Cloud CMS command line tool also makes it easy for you to compress repositories:
cloudcms compress {repositoryId}
Compression Operations and Results
Compression may take some time to run. You should watch the API logs for more information on its status. In addition,
the REST API call will give you back a Job ID that you can use to query on the progress and status of the compression
job as it runs. The command line tool will wait and let you know when compression has finished.
When Compression finishes, your Repository will be smaller. However, the allocated size of any DB files will NOT be
smaller since MongoDB does not release space on-the-fly. To recover disk space, you will need to either:
- repair your MongoDB database in the background
- dump and restore your MongoDB database
Let's look at an example of how the second option is achieved.
Example: Compress a Repository and Recover Disk Space
Suppose we have an existing Cloud CMS installation. When we log in MongoDB, we see that our /data/db
directory has
gotten quite big and when we dig in a bit further, we notice that the repository-48a3f83668a69517bba1
folder is a bit
large. We'd like to compress this repository. First, we note:
- The Repository ID is
48a3f83668a69517bba1
- The Database name is
repository-48a3f83668a69517bba1
To compress the repository, we first open up a terminal window and then go into the MongoDB bin directory (where we can
run one or more of the MongoDB command line tools).
Step 1 - Take a backup
Very important! You should back up your repository before doing anything else. You can do so like this:
./mongodump --db=repository-48a3f83668a69517bba1 --out backup
The backup files will be placed into the backup
directory.
Step 2 - Compress the repository
Use the Cloud CMS command line tool to compress the repository:
cloudcms compress 48a3f83668a69517bba1
This will take some time but will let you know when it has finished.
Step 3 - Dump the DB to disk
This will export the database to disk (placing it in the temp
directory).
./mongodump --db=repository-48a3f83668a69517bba1 --out temp
Step 4 - Drop the DB
Now lets drop the live DB altogether, clearing it out.
./mongo repository-48a3f83668a69517bba1 --eval "db.dropDatabase()"
Step 5 - Restore the DB
Finally, lets restore the DB from the temp
directory.
./mongorestore temp
That's it. When we now look at disk, we'll see that the repository-48a3f83668a69517bba1
is in place but is smaller.
It was compressed and then exported to temp. Upon re-imported, everything was recreated and the net result is a smaller
DB since MongoDB will have less to allocate.