Health Check / Load Balancers

In a production configuration of Cloud CMS, you will want to have a load balancer (LB) in front of both your API
and UI clusters. You should have one load balancer for the API cluster and one load balancer for the UI cluster.

Each load balancer is responsible for receiving requests from the outside world and efficiently distributing those
requests to the N servers that make up each cluster.

Load balancers use a number of strategies to try to determine which servers are most readily "available" to handle the
next request. To figure this out, load balancers often make HTTP calls to each server to get back a status code
(200) that indicates the server is healthy and available. More advanced load balancers may make further use of
response information to determine just "how healthy" a server is.

If a Load Balancer receives negative information about a server (i.e. non-200 error codes) for a period of time, it
may elect to pull that server out of service. Cloud CMS API and UI servers are stateless - they can be pulled out of
service and new servers can be added to their respective clusters on-the-fly.

API

When setting up a load balancer with the Cloud CMS API, we recommend the following HTTP endpoint:

GET /healthcheck

This will return a 200 if the server is healthy. The actual calculation about whether the server is healthy takes
into account a number of internal metrics including memory usage, CPU utilization, available disk storage and
open file handles. If any of these metrics are deemed to be failing (such as if memory is allocated and staying
high over a long period of time and isn't being released), a non-200 may be returned.

In addition, this method will hand back a JSON object with information about the computed metrics. Here is
an example of what that may look like:

{
    "healthy": true,
    "process-cpu-usage": 0.003007125105603858,
    "system-cpu-usage": 0.036103610361036105,
    "memory-usage-percentage": 0.1568696691605085,
    "max-file-handles": 10240,
    "open-file-handles": 701,
    "disk-usage-percentage[0]": 0.9851858949946486,
    "initialized": true,
    "ok": true
}

Where:

  • healthy - whether the server is healthy
  • process-cpu-usage - CPU utilization for Cloud CMS processes (across all CPUs/cores)
  • system-cpu-usage - CPU utilization for Cloud CMS + OS + other processes (across all CPUs/cores)
  • memory-usage-percentage - Memory utilization for Cloud CMS processes
  • max-file-handles - The maximum number of file handles configured
  • open-file-handles - The number of open file handles
  • disk-usage-percentage[x] - Disk/volume utilization for Cloud CMS volume mounts (indexed)
  • initialized - Whether Cloud CMS is running (true) or still starting up (false)

UI

When setting up a load balancer with the Cloud CMS UI, we recommend the following HTTP endpoint:

GET /healthcheck

This will return a 200 if the server is healthy. Unlike the API case, there are no additional computations
calculated to take into account metrics around disk space, memory or CPU utilization. This is a straightforward check
where a 200 will be returned if the server is healthy and a non-200 will be returned if the server is not.