◈ Torque Operate

Your infrastructure doesn’t stop at deployment.
Neither does the intelligence watching it.

Operate is the Day 2 intelligence layer that continuously monitors everything your organization has deployed, reasons about what it finds, and acts autonomously within the boundaries your team defines. Not scheduled scans. Not rigid workflows that fork on conditions. Genuine operational intelligence that evaluates context, determines the right action, and executes, before problems become incidents and waste becomes a budget crisis.

Works with your existing monitoring and cloud tools

Everyone talks about Day 2.
Almost no one actually solves it.
Operate does.

Day 0 is planning. Day 1 is deployment. Day 2 is everything after, and it never ends. Configuration drifts silently. Idle resources accumulate cost no one approved. Security gaps widen between environments and their IaC definitions. Compliance audits surface problems that have existed for months. Operate gives SREs, platform engineers, and FinOps teams continuous, real-time visibility and automated intelligence across every environment they are responsible for, from the moment it deploys to the moment it terminates.

Three Day 2 realities that compound silently and surface at the worst possible moment

Configuration drift goes undetected until it causes an outage or a failed audit
  • Every environment changes over time. Patches applied manually. Settings tweaked under pressure. Temporary fixes become permanent
  • Without continuous monitoring, the gap between what an environment is supposed to be and what it actually is grows silently
  • No one notices until the environment behaves unexpectedly in production, or the auditor asks why the running configuration does not match the IaC definition in Git
Cloud waste accumulates invisibly because no one has real-time visibility across every running environment
  • Idle GPU clusters, forgotten staging environments, orphaned resources from terminated projects each represent a small cost
  • At scale, across hundreds of environments and dozens of teams, the total is substantial
  • Without real-time cost attribution by environment, team, and user, nobody is accountable and the cloud bill arrives monthly long after the waste occurred
Out-of-band changes bypass governance, creating security and compliance gaps nobody tracks
  • Someone needs to make a change quickly, so they log directly into the cloud console, make the change, and move on
  • The IaC definition is not updated. The change is not logged in the platform. The environment now runs a configuration that was never approved
  • Without an approval and audit layer catching these changes, they accumulate until a breach or a compliance failure makes them visible

One control plane. Complete visibility, real-time intelligence, and agents that act rather than alert.

  • Operate monitors every environment Curate has inventoried and Self- Service has deployed. It detects drift the moment it occurs; attributes cost to the team responsible and gives the AI Copilot the context to act rather than alert.
  • Operate is the control plane your SREs, platform engineers, and FinOps teams work from. Every environment, every cost anomaly, every compliance gap, in one real-time view. It surfaces what matters, to the right team, before it becomes a problem.
  • The agents within Operate don’t follow a fixed decision tree. They reason about context, evaluate options, and determine the right action within the policy boundaries your platform team defined. That is the difference between automation and autonomy.

Three capabilities that turn Day 2 from a reactive struggle into a proactive discipline

01, Drift detection and remediation

The moment an environment deviates from its blueprint, Operate knows and can act
Operate continuously compares every live environment against its IaC specification. When something changes outside the governed process, a console action, a manual patch, or an out-of-band script, the deviation is detected in real time with full detail on what changed, what it should be, and what depends on it.

  • Continuous comparison of live state against IaC blueprint, with instant dashboard surfacing
  • Auto-remediation re-aligns the environment, or routes to an approval workflow for human sign-off
  • Dependency mapping shows every environment affected by a drifted resource

02, Real-time cost intelligence

Every dollar attributed, every waste source identified, before the bill arrives
Cost attribution is applied at the point of provisioning, every environment is tagged to the team, project, and business unit that deployed it. Operate surfaces real-time spend data across the full estate, continuously, not on a monthly review.

  • Idle and zombie resource detection flags underutilized environments before waste accumulates
  • Cost policy enforcement blocks deployments that would exceed defined spend limits at launch
  • AI workload anomaly detection catches unexpected GPU utilization spikes before they become runaway costs

03, Approval workflows and compliance audit trail

Every change governed, every action logged, every audit ready without manual effort
Out-of-band changes trigger approval workflows before they enter the platform record. Every action on every environment, who launched it, modified it, extended its TTL, destroyed it, is logged automatically. The compliance record is structural, not dependent on human discipline.

  • Production blueprints require designated approver sign-off with configurable SLA timers
  • Complete, timestamped audit trail generated by the platform, not assembled manually
  • Tag enforcement blocks deployment until required metadata is applied

What Operate delivers

Three capabilities that turn Day 2 from a reactive struggle into a proactive discipline

Every environment you deploy needs to be monitored, optimized, and kept compliant after launch. Operate closes the gap between provisioning and production reality, automatically.

01
DRIFT DETECTION AND REMEDIATION

The moment an environment deviates from its blueprint, Operate knows and acts.

  • Operate continuously compares every live environment against its IaC specification, detecting configuration drift, manual changes, and out-of-band edits the moment they occur
  • Auto-remediation re-aligns the environment or routes it to an approval workflow for human sign-off, based on the policy your team defined
  • Every environment affected by a shared resource change is identified and surfaced immediately, with full context on what changed and what depends on it

The AI Copilot has full visibility into drift state. When it advises on a deviation, it reasons about the specific deviation, its impact across dependent environments, and the appropriate action for that, particular context. The same type of drift in two different environments may warrant different responses. The Copilot understands the difference.

02
REAL-TIME COST INTELLIGENCE

Every dollar attributed, every waste source identified, before the bill arrives.

  • Cost attribution is applied at the point of provisioning. Every environment is tagged to the team, project, and business unit that deployed it, with real-time spend visibility, not monthly review
  • Cost policy enforcement blocks deployments that would exceed defined spend limits at launch
  • The AI agent monitors continuously, evaluates the cost and utilization of each resource, and flags idle and zombie resource detection before they generate unnecessary spend

AI Copilot does not execute a cost optimization script on a schedule. It monitors continuously, evaluates the cost and real profile of each idle resource in context, and recommends the appropriate action based on what it knows about that environment and the team that owns it. Governed boundaries define the limits of what it can do. Within those limits, the decision is the agent’s.

03
APPROVAL WORKFLOWS AND COMPLIANCE AUDIT TRAIL

Every change governed, every action logged, every audit ready without manual effort.

  • Out-of-band changes trigger approval workflows before they enter the platform record. Every action on every environment, who launched it, modified it, extended its TTL, destroyed it, is logged automatically
  • Production blueprints require designated approver sign-off with configurable SLA timers, so nothing bypasses review
  • A complete, timestamped audit trail is generated by the platform, not assembled manually, and is available for compliance review at any time

Drift detected, impact mapped, remediation proposed, in under 60 seconds

Most teams discover configuration drift when something breaks, or when an auditor asks. This video shows how Operate detects a live deviation from IaC specification, surfaces the affected environment and its dependencies, and proposes a remediation action, before the problem has any impact.

How it works

From deployment to continuous governance, across every environment your organization runs

Six operational capabilities. From real-time drift detection to AI-driven cost optimization, running continuously across your entire infrastructure estate.

01
Unified operations dashboard

Every running environment, every team, every space, visible from a single screen

The Operate dashboard surfaces four estate-wide metrics at a glance: active deployments, drifted environments, environments with pending blueprint updates, and IaC asset inventory. Platform engineers and SREs see every running environment across all spaces and teams, with status, owner, cost, and TTL. Quick filters surface errors only, environments approaching expiry, or environments by team. The view is real-time. There are no scheduled refreshes and no stale data. If something changes in a running environment, Operate reflects it immediately.

02
Continuous drift detection

Not just “something changed,” but exactly what changed, what it should be, and what it affects

Operate compares every live environment against its IaC specification continuously. When a deviation is detected, the platform identifies the specific configuration item that changed, the expected value from the blueprint, the current live value, and every other environment that depends on the affected resource. Drift is surfaced immediately in the dashboard with a drifted environments counter. SREs are not hunting through logs to understand what happened. The platform tells them precisely what has diverged and by how much.

The AI Copilot reviews drift events in context, maps downstream impact, and proposes the specific remediation action. For known drift patterns, it can remediate automatically within the policy boundaries the platform team configured.

03
Real-time cost attribution

Every environment tagged at provisioning. Every dollar visible by team, project, and user.

Cost attribution is applied at the moment an environment is deployed, not as a manual tagging exercise after the fact. Every environment carries the team, project, user, and business unit it belongs to as structured metadata. Operate surfaces this as real-time cost data: current daily spend per environment, month-to-date totals by team, and potential savings identified from idle or undersized resources. FinOps teams can drill from a global cost summary down to an individual environment’s spend and usage history. The data is always current. The accountability is always clear.

04
Out-of-band change governance

Changes made outside the governed process are caught, reviewed, and either accepted or remediated

When a change is made directly to a running environment outside the Torque platform, whether through the cloud console, a direct SSH session, or an ad hoc script, Operate detects the deviation and routes it through a configured approval workflow. The change is visible in the platform, attributed to the user who made it, and held pending review. Designated approvers can accept the change and update the IaC baseline to reflect it, or reject it and trigger automatic remediation to restore the environment to its approved state. Nothing bypasses governance unnoticed.

05
Blueprint update management

Every environment running an outdated configuration is flagged before it becomes a problem

When a blueprint is updated, every environment still running an older version is automatically flagged with a Pending Update indicator in the operations dashboard. Platform engineers can see the full list of outdated environments across all teams and spaces, along with what changed in the new version and the risk of remaining on the old one. Environments can be updated individually or in bulk. Users running the affected environments receive notifications proactively, with a one-click update option. Currency across the estate is managed at scale, without manual tracking.

06
Agentic Day 2 operations

Agents that reason about what they find and act accordingly, within governed boundaries, without waiting to be asked

Workflows are powerful when every scenario can be anticipated. Operational infrastructure rarely works that way. Torque’s approach to Day 2 is agentic: SRE, FinOps, and compliance agents that evaluate the actual state of the estate, reason about what they find in context, and determine the right action without waiting to be triggered. An agent responding to a degraded environment does not pick a branch. It assesses what is wrong, considers what it knows about that specific environment, and acts accordingly. Every agent is built around the same principles: informed decisions, safe actions, and strict governance boundaries that define the blast radius without constraining the reasoning. The boundaries are rigid. The thinking within them is not.

This is where Operate connects to AI & Agentic. The agents running here are not a bolt-on. They are the operational expression of the AI Copilot capability, with the same governance model, the same audit trail, and the same policy-enforced boundaries that govern every other Torque action. The difference is that here, they are acting, not just advising.

Frequently Asked Questions

Day 0 is planning and design. Day 1 is initial deployment. Day 2 is everything that happens after: keeping environments running correctly, managing configuration changes, controlling costs, maintaining compliance, and ensuring environments stay aligned with what was intended when they were deployed. Day 2 is where most infrastructure problems occur and where most infrastructure cost accumulates. It is also where most organizations have the weakest tooling, relying on manual processes, scheduled scans, and reactive incident response. Operate replaces that with continuous, automated, real-time governance.

Operate continuously compares the live state of every running environment against the IaC specification stored in Git and indexed by Curate. When a configuration item changes outside the governed process, whether through a direct cloud console action, a manual modification, or an out-of-band script, the deviation is detected in real time and surfaced in the operations dashboard immediately. The detection is not based on scheduled scans. It is continuous. The dashboard reflects drift the moment it occurs, with specific detail on what changed, what the expected value is, and what depends on the affected resource.

Both options are configurable. Platform teams can configure automatic remediation for known, low-risk drift patterns, where Operate detects the deviation and immediately restores the environment to its approved configuration without human intervention. For higher-risk changes, or environments where human sign-off is required by policy, drift triggers an approval workflow that routes to a designated approver. The approver can accept the change (updating the IaC baseline to reflect it) or reject it (triggering automatic restoration). The choice between automatic and approval-gated remediation is set at the blueprint and environment tier level.

Cost attribution is applied structurally at the point of provisioning, not as a manual tagging exercise. Every environment deployed through Self-Service is automatically tagged with the team, project, user, and business unit it belongs to. This metadata is carried through the entire environment lifecycle and is the basis for all cost reporting in Operate. Platform administrators see cost data across all teams and spaces. Team leads see cost data for their team. Individual users see cost data for their own environments. FinOps teams have access to structured cost reports by space, team, project, blueprint type, and user, with CSV export for external reporting.

Torque’s operational agents are not defined by two roles. SRE and FinOps are examples of the kinds of operational responsibilities agents are designed to cover, but the model extends to any area where informed, contextual decisions need to be made continuously and at scale, including security posture, compliance validation, capacity management, and more. What every agent shares is the same underlying design: they reason about the current state of what they are responsible for, evaluate the available options in context, and act within strictly defined governance boundaries. Each agent operates with a specific permission scope, a defined blast radius, and a full audit trail. An agent responsible for cost cannot modify infrastructure configurations. An agent responsible for remediation cannot make financial decisions. The boundaries are platform-enforced. Within them, the decision is the agent’s, not a condition in a script.

Yes, and production is where Operate is most valuable. The drift detection, cost attribution, approval workflows, and compliance audit trail are all designed to operate at the governance level required for production workloads. Production blueprints can be configured with stricter approval requirements, mandatory tag enforcement, and more conservative auto-remediation policies than development or staging environments. The operational dashboard gives platform teams a unified view across all environment types, with the ability to filter and act on each tier appropriately. Operate does not distinguish between pre-production and production, but it respects and enforces whatever policy distinctions the platform team defines.

A workflow is a decision tree. It evaluates conditions and selects from pre-defined branches. It is powerful for deterministic processes where every scenario can be anticipated and scripted in advance. But infrastructure operations are not fully deterministic. The same type of event in two different environments, at two different times, with two different histories, may warrant two different responses. A workflow cannot make that distinction because it was not written to. An agent can, because it reasons about context rather than matching conditions to branches. Torque’s SRE and FinOps agents evaluate the actual state of the environment, not just the triggering event. They consider what they know about the affected resource, the team that owns it, the history of similar issues, and the potential consequences of different responses. They then determine the appropriate action and execute it within the policy boundaries the platform team defined. Those boundaries are strictly enforced. Within them, the decision is the agent’s. This is what makes agentic operations genuinely different from sophisticated workflow automation — and why it handles real operational complexity in ways that workflows, however well-designed, fundamentally cannot.

Every action taken on every environment is logged automatically by the platform with full detail: who deployed it, who modified it, what changed and when, who approved or rejected out-of-band changes, who extended the TTL, and who destroyed it, all with precise timestamps. This log is generated by Operate as a structural output of normal operations, not as a separate compliance process that requires additional configuration. There is nothing to enable and nothing to maintain. When an auditor asks for the change history of a specific environment, the complete record is available immediately, in a structured, exportable format.

Try it yourself

See Operate running against a live infrastructure estate

No installation. No configuration. Connect to a pre-loaded environment where drift has been introduced, costs are accumulating, and pending updates are waiting, and work through the full Operate response.

Live drift scenario with a pre-introduced configuration deviation, showing detection, impact mapping, and the AI Copilot remediation proposal

Real cost attribution data across multiple teams and environments, with idle resource flags and savings recommendations active

Pending update example with an outdated environment and the full update workflow available to explore

Approval workflow demo showing an out-of-band change caught, routed for review, and either accepted or remediated

Ready to stop discovering problems after they happen?

See how Operate continuously monitors your infrastructure estate, catches drift the moment it occurs, and gives your SRE and FinOps teams the visibility and automation to act, not just react, in a live session tailored to your environment.