DevOps

How to automate Day-2 actions in live environments

August 16, 2024
10 min READ

While Day-2 operations are critical to maintain optimal user experience and security for production environments, implementing these actions can be difficult without disrupting the performance and stability of a live environment.

Not surprisingly, balancing these considerations and implementing Day-2 operations often requires extensive manual work from the DevOps and Site Reliability Engineers responsible for maintaining the application environment.

This article will show how Quali Torque users automate complex Day-2 actions to streamline these efforts and improve performance, stability, and security of production environments—all without pausing or otherwise disrupting the operation of those environments.

By the end of this article, you’ll be able to ensure continuous optimization of your production environments.

Step 1. Create your Environment as Code

The Environment as Code approach provides the scalability and ease-of-use required to automate the complex actions needed to execute Day-2 actions.

To create an Environment as Code, Torque users need to:

  • Connect the git repositories containing the application resources—including Infrastructure as Code modules and other services—that will be needed to generate the environment to their Torque account by providing the public URL for those repositories
  • Add those resources to the Asset Library in their Torque account so they can be used to define the Environment as Code template in Torque. The platform automatically discovers the IaC modules and other resources from the repositories and “normalizes” them by wrapping them in YAML in the Torque platform so they can be interact with one another more seamlessly.
  • Using Torque’s AI infrastructure orchestration tool, submit natural-language AI prompts describing the cloud resources and configurations needed to generate the application environment. Torque automatically creates the Environment as Code template—in a new YAML file—containing those resources and defining the parameters and dependencies needed to provision the environment.
  • Review, modify (if needed), and save the Environment as Code template. From there, the user can launch the environment using Torque’s native UI and choose to export and operationalize it via their git repositories.

Here’s an image of an example

Quali Torque leverages the user’s IaC modules to define complete application environments in code, which can be deployed, monitored, and maintained continuously.

Defining your Environment as Code provides a single source of truth for all infrastructure and services powering your application.

This makes it easier to perform actions on those resources.

For a brief intro to AI-driven orchestration of Environments as Code, watch this demo:

Step 2. Define your Day-2 actions as code

To start automating your Day-2 actions, you first need to define the action in code. This is accomplished through Workflows in Torque.

Similar to Environment as Code templates, Workflows are defined in YAML and can be synced with the user’s git repositories.

The Workflow defines an action that can be automated. For example:

  • Reboot Cloud Instance: Restart cloud instances automatically, which can enable automated recovery procedures and maintain consistency for reboots across all environments.
  • Generate Temporary Token: Bolster security while simplifying temporary access management by automating the creation of secure, time-limited access tokens for various services.
  • Service Health Check: Improve stability and performance with an automated health check on services in your environment, which can detect issues disrupting performance and trigger automated remediation actions.

You can check out more example Workflows in Torque documentation.

To create a Workflow, those with admin-level permissions in Torque just need to create a YAML in the platform’s native UI defining the scope. By setting the scope, the Torque administrator has the flexibility to automate actions on an entire application environment or just on individual resources within the environment.

Here is an example of a Workflow set to automatically attach Amazon EBS volumes to an active Amazon EC2 instance (note—admins can choose to run separate Workflows to detach EBS volumes automatically as well).

Quali Torque users can define actions that can be performed on live environments on an ad-hoc basis and automated in response to custom triggers.

Step 3. Set triggers to automate the Day-2 actions in your Workflows

Administrators can also set the triggers for when the action in a Workflow is initiated.

This includes setting recurring schedules—for example, initiate a service health check once per day at a specified time—based on cron jobs set by the admin. These can be overridable in the event that users need the flexibility to deactivate an action that may disrupt what they’re working on.

Admins can also automate the execution of a Workflow in response to an event, such as:

  • Drift Detected: If a configuration in an application environment drifts from its intended state.
  • Updates Detected: If resources in an environment have been updated.
  • Approval Request Approved/Denied: If an attempted action in Torque requires approval from an administrator—say, for example, someone attempts to run an unapproved cloud instance size—this trigger will enact a Workflow in response to the approval or denial of that attempt.
  • Environment Active with Error: If a resource required to run an application environment encountered an error.
  • Environment Ending Failed: If the commands executed to terminate the resources in an environment failed, this Workflow can help to ensure the resource does not sit idle.
  • Environment Idle: Similarly, Torque monitors all resources deployed via the user’s Environment as Code templates and identifies those that are “idle,” or which are not actively supporting a workload. This workflow can take automatic action (such as the “power-off VM”) in response.

In addition to cron- and event-based triggers, Workflows can also be executed manually by any user with access to the environment. This allows users to initiate ad hoc Day-2 actions by simply clicking a button, and provides an alternate option for actions that may not need to be schedule or triggered by an event, but which can still be executed more efficiently.

Automating Day-2 actions can have a substantial impact on not only the health of an application, but also on DevOps productivity and engineer burnout.

To learn more about how to automate DevOps processes with Quali Torque, watch this brief demo video: