Gitana 4 Roadmap – Job queue performance and management
Jul 13
With the early arrival of Gitana 4.0, we’ve improved the product to deliver a number of important improvements to our customers -- the user interface has been enhanced to provide a better editorial experience, the publishing engine now uses deltas and differencing for faster releases with smaller payloads and we’ve baked both generational and discerning AI services into the foundation of our product, just to name a few.
In this article, I’d like to provide some insight into our future direction. Specifically, I’d like to highlight our active investments into our Distributed Job Engine.
Distributed Job Engine
The Distributed Job Engine is a cluster-wide service that spawns and coordinates workers to execute long-running tasks or jobs. These are sometimes thought of as background tasks. The Job Engine is used to coordinate publishing and deployment operations, perform real-time replication, integrate with AI services, index content for full-text search, transform MIME type payloads from one format to another, extract text and metadata from files, classify and tag documents into taxonomies and more.
While the job engine is efficient and horizontally scalable in 4.0, we have identified avenues for improvement that are truly exciting. These include scheduling improvements, the introduction of fast lane assignment, dynamic reallocation, predictive routing and enhancements to reporting, events and notifications.
Scheduling
Our support for priority queues will improve to allow for a configurable, rules-based assignment resource requirements and limits for individual jobs. This will allow the scheduler to not only allocate jobs based on priority but also on required service levels and resource needs. This will empower the scheduler to allocate workers for higher priority jobs onto pods that guarantee a required service level and affinity (i.e. adequate CPU and memory to tackle the task at hand).
When all is said and done, developers will be able to launch jobs that schedule with higher or lower priority and execute within a much tighter deviation for quality of service.
Fast Lanes / Multiple Queues
Our Scheduling improvements will also include a simpler model – i.e. the notion of “fast lanes”. In effect, these are separate queues whose parameters are specified in the queue configuration itself. This frees developers from having to assign those parameter at the time that a job is submitted.
Customers will be able to separate out “fast lane” queues that automatically allocate to pods with more memory and more available resource. Some queues can be configured to take priority over others. This makes it easy for customers to monitor the quality of service of their executing jobs and make adjustments at the queue level to accommodate variations in demand.
Dynamic Reallocation
Workers that execute in the cluster can transition jobs into different states. They can even pause jobs, interrupt them or reschedule them. However, when priority work arrives, long-running and lower priority jobs sometimes need to be not only paused, but reallocated onto different pods running in the cluster.
Dynamic reallocation provides the ability for jobs to be paused, have their state passivated and then have that job remounted onto a new pod running elsewhere in the cluster. Either immediately or at a later point in time.
While this ability exists for some job types in Gitana 4.0, we will be extending it to all job types. This will support some of our improvements to priority scheduling by allowing the scheduler to query, interpret and potentially reallocate jobs that are already in-flight.
Predictive Routing
With additional metrics being gathered for jobs and executing workers, we will see increased usage of predictive artificial intelligence models to make determinations about optimal scheduling.
These models use historical information about the past performance of jobs to make future decisions on how best to allocate jobs onto worker pods. These decisions incorporate predictions about a job that take into account factors like potential execution time, memory and CPU consumption.
For jobs that execute on content in branches, these predictive services will also aid scheduling decisions that are predicated on branch locks, the number of content items being operated upon and more. These factors will play an important role in increasing parallelism and improved throughput for operations that would otherwise block based on branch-level locking.
Reporting, Events and Notifications
The additional metrics collected will be available via the API and from within the user interface. Customers will be able to inspect individual jobs (as they can now). But they’ll also be able to inspect queues to understand and validate the intended quality of service for any individual queue.
Each queue will allow for custom limits and event handlers to be configured. When an individual queue’s quality of service tests those limits, an event is raised that will trigger an event handler.
Customers can use this feature to send notifications (such as an email, Slack notification or SMS message). Or they can configure automated actions or even server-side scripted code that runs and handles the event as they best see fit.
Summary
We’re really excited at Gitana about these features. These improvements to scheduling will result in increased throughput and even better performance for our customers. We’re also excited to give our customers more control and visibility into their job executions.