One of the most common and frustrating daily tasks that DevOps teams struggle with is preventing their teams from running cloud VMs unnecessarily.
Otherwise known as zombie infrastructure, idle cloud VMs are those that run for prolonged periods when no one is using them. These often rank among the biggest sources of wasted cloud budget.
Testing, staging, and sales demo environments are common examples. Often needed for only a brief period of time, these environments may be left running for days, racking up cloud costs in the process.
Here we’ll show how to use Quali Torque to detect, terminate, and prevent idle cloud resources.
Pre-scheduling launch and termination of cloud VMs
In Quali Torque, administrators can set workflows to schedule automatic start and stop times for all VMs that a specific team uses.
Since developers and engineers deploy environments via Quali Torque’s self-service catalog, the platform can automate the launch and termination of the cloud resources supporting those environments based on the workflow.
For instance, workflows can turn on VMs at the beginning of the workday (e.g. 9 AM) and off again at the end of the day (6 PM) from Monday to Friday. This approach ensures that resources are active when needed while preventing unnecessary costs.
Detecting and terminating idle cloud VMs
Many of our users have embraced these workflows to prevent unnecessary costs when ephemeral environments like testing and staging are left running overnight or on the weekends, when no one needs them.
But even with this type of automation, our users have extended this functionality to the infrastructure running to eliminate idle resources during the workday.
Setting a workflow to power-on all cloud VMs at the beginning of the workday ensures uptime for all the environments that developers may need, but can still create idle cloud resources for teams with a large number of environments. Any environment that isn’t used is incurring costs while the cloud VMs are running.
The first place to address this is by looking at all actively deployed environments. Quali Torque provides a view into all environments your team runs, allowing you to see those that have been terminated and those which are still running.
By sorting all environments based on “last access,” you can identify and terminate those which may be operating when no one is using them. You can also see the owner and any other collaborators for the environment so you can ask questions to find out when they might need it again.
The next step is to automate this process.
Through an adjustment to the workflow that deploys cloud VMs at the beginning of the workday, you can exempt those that have been inactive for a specific period of time. For example, setting this condition for 1 day ensures the platform deploys all cloud VMs for environments that have been active recently, while preventing unnecessary costs from idle resources.
Day-2 actions to run cloud VMs and request extensions to schedules
So, what happens if a user needs to run an environment that had been identified as inactive and was not deployed along with the others?
In the platform, the user can manually deploy each cloud VM defined in the template for the environment.
Since the administrator configured the environment and the workflows will de-commission the VMs at the end of the workday, the developer can deploy the environment without incurring risk of misconfigured or zombie infrastructure.
If, at the end of the workday, the user needs to run an environment outside of normal work hours, they can manually delay the automated workflow execution and adjust it according to their specific needs.
This flexibility allows teams to align cloud costs with actual day-to-day needs.
Learn more about optimizing cloud operations and costs with Quali Torque here.