The proliferation of generative AI in recent years has unleashed a revolution of sorts in the use of Machine Learning applications to execute complex functions supporting advanced research and data science teams.
Behind this revolution is a significant amount of overhead. ML applications require substantial upkeep to ensure they are trained and operating correctly to support the business functions that rely on them while maintaining cost efficiency at scale.
Given these challenges, a successful ML application requires effective and efficient ML ops.
The data scientists and engineers responsible for the success of an ML app are tasked with a wide range of actions to ensure the accuracy and performance of an AI model, such as:
- Checking for issues like data drift or model degradation, which can reduce accuracy over time
- Data preprocessing, feature engineering, and tuning model parameters to retrain AI models in response to new data and changing conditions
- Maintaining documentation, managing version control, and ensuring compliance with regulatory requirements
- Debugging and addressing issues such as bias, errors, or unexpected behavior
- Optimizing the model’s deployment environment for cost and performance, including resource allocation and scaling decisions
This article will show how Quali Torque users can automate the execution of these types of actions to reduce the manual work required to optimize AI models.
How Does Quali Torque Automate Actions on AI Applications?
To accelerate the deployment and management of application environments, Torque defines the user’s Environment as Code.
Known as blueprints in Torque, these Environment as Code templates contain the code needed to provision all infrastructure and services to deliver the application itself.
This allows users to click-and-launch applications, while also providing a single source of truth for the environment—allowing users to monitor the state continuously, perform actions on the blueprint as needed, and set up custom triggers to execute those actions automatically.
To create an Environment as Code blueprint, the user connects the repositories containing their existing cloud resource configurations and other application services.
Once the repository is connected, the user can submit natural-language AI prompts describing the resource they need. Torque automatically creates the Environment as Code blueprint based on those prompts, then allows users with access to the blueprint to deploy the environment and interact with the environment.
To automate actions in Torque, users similarly define the action as code.
These are known as Workflows in Torque, and they allow users to perform the action by executing the code. Users can find pertinent Workflows directly within the UI where they access their application environments, and can initiate the action via a single click in the platform.
Users can also set custom triggers to automate the execution of these actions in response to custom events or recurring schedules.
This brief video shows this process supports AI applications:
Here are some example actions that can optimize AI models automatically via Quali Torque.
1. Adversarial testing for Gen AI models
Adversarial attacks against AI models are deliberate attempts to manipulate input data in a way that causes the model to make a mistake or behave unexpectedly. These attacks exploit the vulnerabilities in machine learning models, particularly in neural networks, by subtly altering inputs—often in ways that are imperceptible to humans—to cause the model to produce incorrect outputs.
To automate adversarial robustness testing, Torque users can create a Workflow defining the script required to execute the test and the model endpoints which will be tested.
Once defined, Torque users can find a quick-link for the Workflow and execute it with a single click in Torque’s native UI.
Torque also allows users to automate the execution of the Workflow in response to events or on a recurring basis using cron jobs.
This allows users to automate recurring adversarial testing while also providing the flexibility to execute it rapidly on an ad hoc basis.
2. Data Quality Assurance
Maintaining data quality assurance in AI models involves several key practices:
- Ensuring data relevance, diversity, and provenance during collection
- Implementing thorough data cleaning and preprocessing, such as handling missing data, removing duplicates, and normalizing features
- Validating data through automated checks and manual reviews
- Performing robust feature engineering to avoid data leakage.
In executing these actions, high-quality annotation and labeling are essential, supported by consistent labeling and regular error analysis.
Continuous monitoring is also needed to detect data drift and bias, while documentation and version control help track changes so domain experts can identify potential issues and execute synthetic data testing to ensure model robustness.
Torque users can create a Workflow to check the quality and integrity of the data fed into the model during inference automatically.
This workflow is a critical step to ensure that the input data is clean, consistent, and adheres to the expected format and distribution, as any issues with the data can negatively impact the model’s performance.
3. Monitoring for Model Drift
Model drift, or changes in the underlying data or environment supporting an AI model, can lead to decreased accuracy, increased errors, and potential biases.
Model drift occurs when the real-world data that the model encounters during deployment begins to differ significantly from the data it was trained on.
Whether to identify and correct data drift or concept drift, Torque users can create Workflows that monitor the performance of a deployed model over time to detect any significant deviations or degradation in its accuracy, precision, or other relevant metrics.
Automating model drift can help identify when the model needs to be retrained or updated to maintain optimal performance, which can be used to trigger additional Torque Workflows to initiate the training or updating of the model.
4. Monitor Inference Accuracy
Inference inaccuracy in AI models can arise from several factors, which generally fall into categories related to data quality, model design, or environmental changes.
Understanding these causes is essential for diagnosing and mitigating issues that lead to poor model performance.
In Torque, users can detect performance risks proactively by creating Workflows to pull the latest AI model endpoints and monitor accuracy automatically.
The ability to leverage event-based triggers, cron jobs for recurring execution, and manual ad-hoc execution provides multiple layers for users to evaluate inference accuracy.
5. Resource Utilization Monitoring
AI workloads can be unpredictable and difficult to optimize, with demand spikes leading to higher costs if not managed efficiently. Meanwhile, the inherent costs of deploying AI models across multiple regions to meet global demand can drive up costs as well.
Continuously monitoring and optimizing the cloud costs for AI workloads requires advanced expertise and sophisticated tracking tools, which themselves can be expensive to implement and manage.
Torque Workflows monitor the computational resources (CPU, GPU, memory, etc.) consumed by the model during inference, which can help with manual intervention and paired with additional workflows to optimize resource allocation and ensure cost-effective deployment of the model in production environments without disrupting performance.
6. AI Model Explainability & Interpretability
The ability to demonstrate the ethical use and decision-making of an AI model is critical to maintaining user trust and complying with increasingly strict regulatory standards.
Using scripts to define and interpret the reasoning behind the model’s predictions or decisions, Torque users can automate the execution of a Workflow to identify potential biases, inconsistencies, or unexpected behaviors, which can be addressed through model refinement or additional training.