Claw Mart
← Back to Blog
April 17, 202610 min readClaw Mart Team

Automate Infrastructure Drift Detection: Build an AI Agent That Reports Changes

Automate Infrastructure Drift Detection: Build an AI Agent That Reports Changes

Automate Infrastructure Drift Detection: Build an AI Agent That Reports Changes

Most infrastructure teams don't discover drift during a calm Tuesday morning review. They discover it during an incident, an audit, or when a deploy mysteriously fails because someone changed a security group in the console three weeks ago and never told anyone.

The gap between what your Infrastructure as Code says should exist and what actually exists in your cloud environment is infrastructure drift, and it's one of the most expensive, boring, and persistent problems in DevOps. Not because people don't care, but because the process of detecting and fixing it is so tedious that teams either skip it or do it so infrequently that the results are overwhelming when they finally look.

Here's the thing: about 70-80% of the drift detection and analysis workflow can be automated right now with AI agents. Not in some speculative future. Today. The remaining 20-30% genuinely needs a human brain. The trick is knowing which is which and building automation around the right parts.

This post walks through exactly how to do that with OpenClaw.


The Manual Drift Detection Workflow (And Why Nobody Does It Consistently)

Let's be honest about what drift detection actually looks like at most companies. Not the idealized version. The real one.

Step 1: Trigger the scan. Someone runs terraform plan against production, kicks off a CloudFormation drift detection job, or fires up driftctl. This takes 5–30 minutes depending on the size of your environment, and that's when it works. State lock conflicts, expired credentials, and provider API rate limits mean you're often troubleshooting before you even get results.

Step 2: Stare at the output. A medium-sized AWS environment might return hundreds of drifted resources. You're reading raw plan output or JSON diffs, trying to figure out which changes matter. Did someone open a security group to 0.0.0.0/0 for debugging and forget to close it? Or did AWS update a managed policy's ARN as part of a routine service update? The tool doesn't know the difference. You do, theoretically, if you squint hard enough.

Average time for this triage step alone: 4–12 hours per scan for a mid-sized environment, per reports from teams using Spacelift and Env0.

Step 3: Root cause analysis. Now you need to figure out who changed what and why. This means cross-referencing CloudTrail logs, Git history, Jira tickets, Slack messages, and—most reliably—walking over to someone's desk and asking. A significant drift incident eats 2–8 hours of an engineer's time just on investigation.

Step 4: Decide what to do. Three options for every drifted resource: revert to the IaC-defined state, update IaC to match reality, or document an exception. This is the step that actually requires human judgment—business context, risk tolerance, downstream dependencies. But it's also the step where people are already mentally exhausted from steps 2 and 3, so they make worse decisions.

Step 5: Remediate and validate. Write the code changes, run terraform apply or import the new state, re-scan to confirm, and update whatever compliance documentation your auditors require. Another 4–20 hours depending on complexity.

Total realistic time cost: 15–50 engineer-hours per week for a team managing a non-trivial cloud footprint. One fintech company publicly shared on the HashiCorp blog that they were spending roughly 15 engineer-hours weekly on drift before they improved their tooling. An insurance company featured in a Spacelift case study cut remediation time from 18 hours to 3 hours per week with better automation—but triage still needed humans.

Most teams don't spend this time. They just don't do it, or they do it monthly, or quarterly, or never. A 2026 Env0 survey found that 61% of companies run drift detection less than once per week, with 22% doing it monthly or less. And 68% of Terraform users in HashiCorp's 2023 survey report drift as a "significant" or "severe" problem.

The math doesn't add up. The problem is severe, but the process is too painful, so people avoid it.


What Makes This Painful (Beyond the Obvious)

Time cost is the headline number, but the real damage is subtler.

Noise kills motivation. When your drift report has 400 items and 380 of them are expected or irrelevant, people stop reading drift reports. This is the "boy who cried wolf" problem, and it's the primary reason drift detection programs fail at most organizations.

Lack of context makes triage impossible for junior engineers. Senior folks can glance at a drifted resource and know from experience whether it matters. Junior engineers can't. This creates a skill bottleneck where your most experienced people are spending their time on the most tedious work.

Delayed detection compounds risk. A security group opened to the internet for a debugging session on Monday is annoying. Discovering it three weeks later during an audit is a compliance finding. Discovering it because an attacker exploited it is a disaster. The value of drift detection degrades exponentially with delay.

Compliance isn't optional. In regulated industries—finance, healthcare, government—undetected drift isn't just a technical debt problem. It's a SOX, HIPAA, or PCI finding. One major bank discovered that over 40% of their production resources had drifted when they implemented automated detection for the first time. Most were "emergency" changes during incidents that were never codified.

Shadow infrastructure is invisible. Resources created entirely outside IaC don't show up in terraform plan at all. They're ghosts. You need tools like driftctl or Cloud Asset Inventory comparisons to even know they exist, and those tools generate even more noise.


What AI Can Handle Right Now

Not everything. But a lot more than most teams realize.

AI agents excel at exactly the parts of drift detection that humans find most painful: processing large volumes of data, correlating across multiple sources, classifying signals vs. noise, and generating clear summaries. Here's the breakdown:

High-confidence automation targets:

  • Continuous monitoring with intelligent filtering. Instead of periodic scans, an AI agent watches for events (CloudTrail, webhook-triggered plan runs) and only alerts when something looks anomalous based on historical patterns.
  • Noise reduction and classification. Train on your team's past triage decisions. "This type of drift on this type of resource has been marked as 'expected' 47 out of 50 times. Auto-classifying as expected."
  • Root cause correlation. Cross-reference drifted resources against CloudTrail events, recent pull requests, Jira tickets, and incident timelines. "This security group change correlates with incident INC-1934 and a Slack message from @sarah at 2:14am on March 3rd."
  • Impact assessment. Analyze the blast radius: what depends on this resource? Is this a security-sensitive change? Does it affect cost? Does it violate any compliance policies?
  • Natural language summarization. Turn 400 lines of terraform plan output into: "3 high-risk drifts detected. 1 security group opened to 0.0.0.0/0 (likely debugging session, correlates with INC-2041). 1 RDS instance class changed from db.r5.large to db.r5.xlarge (no matching PR found—possible manual scaling). 1 S3 bucket policy modified (matches PR #4582, already approved)."
  • Auto-ticketing with context. Create Jira or Linear tickets that include the drift details, probable cause, affected systems, recommended remediation, and relevant links—not just "drift detected."

This isn't theoretical. Teams are already doing this with LLMs analyzing terraform plan output. The difference is doing it in a structured, repeatable way rather than pasting output into a chat window.


Step by Step: Building the Automation with OpenClaw

Here's how to build an infrastructure drift detection agent on OpenClaw that handles the automated portion of this workflow.

Architecture Overview

The agent needs three capabilities:

  1. Scan — Trigger and collect drift detection results on a schedule or via events
  2. Analyze — Classify, correlate, and assess drift findings
  3. Report — Generate summaries, create tickets, and escalate high-risk items

OpenClaw handles the AI reasoning, tool orchestration, and workflow management. You provide the integrations to your specific infrastructure.

Step 1: Define the Agent's Tools

Your OpenClaw agent needs access to the data sources it'll analyze. Define these as tools:

# Tool definitions for the OpenClaw drift detection agent

tools = [
    {
        "name": "run_terraform_plan",
        "description": "Runs terraform plan against a specified workspace and returns the drift output",
        "parameters": {
            "workspace": {"type": "string", "description": "Terraform workspace name"},
            "target_dir": {"type": "string", "description": "Directory containing the Terraform config"}
        }
    },
    {
        "name": "query_cloudtrail",
        "description": "Searches CloudTrail events for changes to a specific resource within a time window",
        "parameters": {
            "resource_arn": {"type": "string"},
            "hours_back": {"type": "integer", "default": 168}
        }
    },
    {
        "name": "search_jira_tickets",
        "description": "Searches Jira for tickets related to a resource or change",
        "parameters": {
            "query": {"type": "string"}
        }
    },
    {
        "name": "check_git_history",
        "description": "Searches recent Git commits and PRs for references to a resource",
        "parameters": {
            "resource_identifier": {"type": "string"},
            "repo": {"type": "string"},
            "days_back": {"type": "integer", "default": 30}
        }
    },
    {
        "name": "create_jira_ticket",
        "description": "Creates a Jira ticket with drift findings and recommended actions",
        "parameters": {
            "summary": {"type": "string"},
            "description": {"type": "string"},
            "priority": {"type": "string", "enum": ["Critical", "High", "Medium", "Low"]},
            "labels": {"type": "array", "items": {"type": "string"}}
        }
    },
    {
        "name": "send_slack_alert",
        "description": "Sends a formatted drift summary to a Slack channel",
        "parameters": {
            "channel": {"type": "string"},
            "message": {"type": "string"},
            "severity": {"type": "string"}
        }
    }
]

Step 2: Build the Agent's System Prompt

This is where you encode your team's institutional knowledge. The system prompt turns a general-purpose AI agent into your drift analyst:

You are an infrastructure drift detection agent. Your job is to:

1. Run terraform plan against each configured workspace
2. Parse the output to identify all drifted resources
3. For each drifted resource, classify its severity:
   - CRITICAL: Security-sensitive changes (security groups, IAM policies, 
     encryption settings, public access)
   - HIGH: Compute/database configuration changes with no matching PR
   - MEDIUM: Changes that correlate with a known incident or approved PR
   - LOW: Known expected drift patterns (AWS-managed policy ARN updates, 
     auto-scaling group desired count changes, etc.)
4. For HIGH and CRITICAL items, investigate root cause by checking CloudTrail, 
   Git history, and Jira tickets
5. Generate a summary report with:
   - Total drifted resources by severity
   - For each HIGH/CRITICAL item: what changed, probable cause, 
     recommended action, affected downstream systems
   - For MEDIUM/LOW items: grouped summary only
6. Create Jira tickets for CRITICAL and HIGH items with full context
7. Send a Slack summary to #infrastructure-drift

KNOWN EXPECTED DRIFT PATTERNS (do not alert on these):
- aws_autoscaling_group.*.desired_capacity (changes with scaling events)
- aws_ecs_service.*.desired_count (changes with scaling events)
- aws_lambda_function.*.last_modified (updates on every invocation)
- aws_db_instance.*.latest_restorable_time (updates continuously)
- Any resource tagged with "drift-exception: true"

ESCALATION RULES:
- Any security group change allowing 0.0.0.0/0: CRITICAL, page on-call
- Any IAM policy change not matching a merged PR: CRITICAL
- Any S3 bucket policy change: HIGH minimum
- Any RDS/Aurora configuration change: HIGH minimum

Step 3: Configure the Schedule and Workflow

Set your OpenClaw agent to run on a schedule. For most teams, every 4–6 hours during business hours is the sweet spot between timeliness and API cost:

# OpenClaw workflow configuration
workflow = {
    "name": "drift-detection-sweep",
    "schedule": "0 */4 * * 1-5",  # Every 4 hours, weekdays
    "agent": "drift-detector-v1",
    "workspaces": [
        {"name": "production-networking", "dir": "terraform/prod/networking"},
        {"name": "production-compute", "dir": "terraform/prod/compute"},
        {"name": "production-data", "dir": "terraform/prod/data"},
        {"name": "staging-all", "dir": "terraform/staging"}
    ],
    "on_failure": {
        "action": "send_slack_alert",
        "channel": "#infrastructure-alerts",
        "message": "Drift detection scan failed. Manual investigation needed."
    }
}

Step 4: Implement the Tool Backends

Each tool defined in Step 1 needs an actual implementation. Here's the CloudTrail investigation tool as an example:

import boto3
from datetime import datetime, timedelta

def query_cloudtrail(resource_arn: str, hours_back: int = 168) -> dict:
    """Query CloudTrail for recent changes to a specific resource."""
    client = boto3.client('cloudtrail')
    
    response = client.lookup_events(
        LookupAttributes=[
            {
                'AttributeKey': 'ResourceName',
                'AttributeValue': resource_arn
            }
        ],
        StartTime=datetime.utcnow() - timedelta(hours=hours_back),
        EndTime=datetime.utcnow(),
        MaxResults=50
    )
    
    events = []
    for event in response.get('Events', []):
        events.append({
            'time': event['EventTime'].isoformat(),
            'event_name': event['EventName'],
            'username': event.get('Username', 'unknown'),
            'source_ip': event.get('CloudTrailEvent', {}).get('sourceIPAddress', 'unknown'),
            'user_agent': event.get('CloudTrailEvent', {}).get('userAgent', 'unknown')
        })
    
    return {
        'resource': resource_arn,
        'events_found': len(events),
        'events': events
    }

Step 5: Test with a Known Drift Scenario

Before trusting this in production, create a deliberate drift:

  1. Manually modify a non-critical resource (add a tag to a test security group via the console)
  2. Run your OpenClaw agent
  3. Verify it detects the change, correctly classifies it, finds your CloudTrail event, and generates a coherent report
  4. Check that the Jira ticket includes the right context
  5. Gradually expand to more workspaces as confidence grows

Step 6: Train the Noise Filter

This is the part that makes the agent actually useful over time. After each scan, review the results and feed corrections back into the system prompt's known patterns:

# After reviewing scan results, update the agent's known patterns:

ADDITIONAL EXPECTED DRIFT PATTERNS (added 2026-03-15):
- aws_elasticache_cluster.*.engine_version_actual 
  (AWS auto-patches minor versions)
- aws_eks_cluster.*.platform_version 
  (EKS updates platform version automatically)
- aws_cloudwatch_log_group.*.retention_in_days where current > configured 
  (compliance team sets minimum retention via SCP)

Every false positive you classify makes the next scan cleaner. Within 2–4 weeks, most teams find their noise drops by 60–80%.


What Still Needs a Human

Automating the detection, correlation, and triage saves enormous time. But some decisions genuinely require human judgment, and you shouldn't try to automate them:

"Should we accept this drift or revert it?" When your on-call engineer scaled up an RDS instance during a 2am incident, that change might be the right call. Reverting it automatically could cause another outage. The agent can surface the context—the incident ticket, the CloudTrail event, who made the change—but a human needs to decide.

Architectural trade-offs. Sometimes drift reveals that the IaC definition was wrong in the first place. Updating the code to match reality is the right move, but it requires understanding the system's design intent.

Risk acceptance in regulated environments. Compliance requires a named human to accept risk. Your agent can prepare the documentation, but someone with authority needs to sign off.

Complex dependency chains. When drift in a VPC peering connection affects routing tables, which affects service discovery, which affects three downstream applications—the agent can map the blast radius, but reasoning about the full business impact requires knowledge the agent doesn't have.

The goal isn't to remove humans from the loop. It's to make sure humans spend their time on judgment calls instead of staring at JSON diffs.


Expected Time and Cost Savings

Based on the patterns we're seeing from teams that have built this kind of automation:

MetricBeforeAfterImprovement
Scan frequencyWeekly or lessEvery 4–6 hours~20x more frequent
Triage time per scan4–12 hours15–30 minutes (review AI summary)85–95% reduction
Root cause investigation2–8 hours per incidentAutomated (human review only)80–90% reduction
Ticket creation30–60 min per ticket (manual context gathering)Automated with full context~95% reduction
Total weekly engineer time15–50 hours3–8 hours70–85% reduction
Mean time to detect critical driftDays to weeksHours80–95% faster

The real win isn't just the time savings—it's that detection actually happens consistently. A process that runs every 4 hours and takes 15 minutes to review will actually get done. A process that takes 12 hours and runs weekly will get skipped the first time the team is busy, and then it never catches up.


Getting Started

You don't need to build the full system on day one. Start here:

  1. Pick your noisiest Terraform workspace. The one with the most drift. Run terraform plan and save the output.
  2. Build a minimal OpenClaw agent with just the plan parser and Slack notification. No Jira integration, no CloudTrail correlation yet. Just classify and summarize.
  3. Run it for two weeks. Collect the false positives and expected patterns. Update the system prompt.
  4. Add CloudTrail correlation. This is where the agent starts providing real root cause value.
  5. Add ticketing. Once you trust the classification, let it create tickets automatically for high-severity items.
  6. Expand to more workspaces.

The infrastructure drift problem isn't going away. Cloud environments get more complex every quarter, teams move faster, and manual changes during incidents are inevitable. The question isn't whether to automate drift detection—it's how quickly you can get an agent handling the grunt work so your engineers can focus on the decisions that actually need their expertise.

If you want to skip the build-from-scratch approach and get a drift detection agent running faster, check out the pre-built infrastructure monitoring agents on Claw Mart. The marketplace has OpenClaw-native agents and tool integrations that handle the common patterns out of the box—Terraform plan parsing, CloudTrail correlation, Slack and Jira integrations—so you can customize from a working baseline instead of starting from zero.

If you've already built something like this internally and want to help other teams skip the pain, consider Clawsourcing it. Package your agent, list it on Claw Mart, and let other infrastructure teams benefit from what you've learned. The community gets better tooling, and you get compensated for the expertise you've already invested. Visit claw-mart.com to learn more about listing your agents.

Claw Mart Daily

Get one AI agent tip every morning

Free daily tips to make your OpenClaw agent smarter. No spam, unsubscribe anytime.

More From the Blog