Cloud Custodian Security
Building Event-Driven AWS Security Remediation with Cloud Custodian
How to automatically detect and fix the most critical AWS security misconfigurations within 90 seconds of occurrence, deployed in under 5 minutes with a single CloudFormation command and zero ongoing maintenance.
Why this matters
In cloud environments, misconfigurations are the leading cause of breaches. Not zero-days. Not sophisticated attackers. A developer creates an S3 bucket and forgets to block public access. A security group gets opened for a quick test and never closed. An IAM user gets granted AdministratorAccess as a shortcut and nobody notices for months.
The average time to detect a cloud misconfiguration is 197 days. By the time a human reviews a weekly compliance report, the window for exploitation has been open for months. Auto-remediation collapses that window to 90 seconds. This is not about replacing security teams. It is about removing the entire class of risk that does not require human judgment.
How the pipeline works in under 5 minutes
The entire system deploys with one CloudFormation command and creates 14 Lambda functions, 14 EventBridge rules, one SNS topic, and one IAM role. From zero to fully automated remediation in under 5 minutes.
Every AWS API call is captured by CloudTrail and delivered to Amazon EventBridge. When a resource is created or modified, EventBridge matches the event against 14 rules and invokes the corresponding Lambda function automatically. The Lambda checks the resource, applies the fix, and sends an email alert. No polling. No schedules. No human intervention.
The key design principle is one Lambda per policy. Each function is small, focused, and independently deployable. A failure in one policy never affects the others. The stack has an EnforceMode parameter that controls all 14 policies simultaneously. Set it to notify during rollout, flip to enforce when confident.
The 14 policies
Nine policies auto-remediate immediately. Five are notify-only because the fix either requires human judgment, cannot be applied to a running resource, or needs root console access.
| Policy | Trigger | Mode |
|---|---|---|
| 🪣 S3 public block | Any S3 API call | Enforce |
| 📦 S3 versioning | Any S3 API call | Enforce |
| 🔒 S3 encryption | Any S3 API call | Enforce |
| 👤 IAM admin policy | AttachUserPolicy | Enforce |
| 🔑 IAM unused credentials | IAM API call | Notify |
| 🛡 EC2 open security groups | AuthorizeSecurityGroupIngress | Enforce |
| 💾 EBS encryption | CreateVolume | Notify |
| 📋 CloudTrail monitoring | CloudTrail API call | Notify |
| 🌐 Root MFA | IAM API call | Notify |
| λ Lambda public access | AddPermission | Enforce |
| 🐳 ECR scan-on-push | CreateRepository | Enforce |
| 🔀 VPC flow logs | CreateVpc | Enforce |
| 🔐 Secrets rotation | Secrets Manager API call | Notify |
| 👁 GuardDuty activation | GuardDuty API call | Enforce |
Testing it live
The test was run entirely from AWS CloudShell with no local setup. Vulnerable resources were created intentionally, the before state captured, then 90 seconds later the after state confirmed every policy had fired.
An S3 bucket was created with public access removed, versioning suspended, and encryption deleted. A security group was created with SSH open to 0.0.0.0/0 and RDP open to the entire internet. An IAM user was granted AdministratorAccess. An ECR repository was created without scan-on-push. A VPC was created with no flow logs. All five misconfigurations existed simultaneously for less than 90 seconds before every one was automatically corrected.
Results
The 90 second window is CloudTrail delivery latency, not Lambda execution time. The Lambda itself runs in under 50 milliseconds. Every remediation also generates a CloudWatch log entry and an SNS email alert, giving a complete audit trail of every action the system took.
What is Cloud Custodian?
Cloud Custodian is an open source cloud security and governance tool built by Capital One and donated to the Cloud Native Computing Foundation. It lets you define security and compliance policies in simple YAML files and execute them against your cloud resources. Think of it as a rules engine for your AWS account you write what you want to be true, and Custodian enforces it.
Without Custodian, enforcing security rules at scale means writing individual Lambda functions, wiring EventBridge rules by hand, building your own logging and alerting, and maintaining all of it separately. With Custodian, you write one YAML policy and a single custodian run command handles everything it packages the Lambda, creates the EventBridge rule, wires the target, and sets up CloudWatch logging automatically.
Why it matters in an enterprise: Most mature cloud security programs have dozens or hundreds of policies running across multiple accounts and regions. Writing and maintaining individual Lambda functions for each would be unmanageable. Custodian gives you a consistent policy language that works across AWS, Azure, and GCP, with built-in dry-run mode, output reporting, and a library of over 300 supported resource types out of the box.
In this post we used Custodian to author the policies and test them in CLI mode, then deployed the same logic as standalone Lambda functions via CloudFormation for production use. The two approaches are complementary Custodian for authoring and iteration, CloudFormation for repeatable infrastructure deployment.
Glossary
If you are new to AWS security or cloud infrastructure, here is a plain-language breakdown of every service and term used in this post.
| Term | What it means |
|---|---|
| Cloud Custodian | Open source policy engine for cloud security. Write rules in YAML, Custodian enforces them. Supports AWS, Azure, and GCP. |
| CloudTrail | AWS service that logs every API call made in your account. Who did what, when, from where. Required for event-driven remediation because it is the source of truth for all AWS activity. |
| EventBridge | AWS event routing service. Receives events from CloudTrail, matches them against rules you define, and triggers actions like invoking a Lambda function. The traffic cop of the pipeline. |
| Lambda | AWS serverless compute. Code that runs in response to an event with no server to manage. Each remediation policy in this post is a Lambda function that runs in under 50 milliseconds and costs fractions of a cent per execution. |
| CloudFormation | AWS infrastructure-as-code service. Define your entire infrastructure in a YAML or JSON template and deploy it with one command. Used in this post to create all 14 Lambdas, 14 EventBridge rules, the IAM role, and the SNS topic in a single deployment. |
| SNS | Simple Notification Service. AWS messaging service used here to send email alerts every time a policy detects or remediates a misconfiguration. |
| IAM | Identity and Access Management. Controls who and what can do what in your AWS account. The IAM role in this post gives the Lambda functions exactly the permissions they need and nothing more. |
| S3 | Simple Storage Service. AWS object storage. One of the most commonly misconfigured services public buckets, missing encryption, and disabled versioning are among the top findings in every AWS security assessment. |
| Security Group | Virtual firewall for EC2 instances and other AWS resources. Controls inbound and outbound network traffic. An open security group with port 22 or 3389 exposed to 0.0.0.0/0 means anyone on the internet can attempt to connect. |
| GuardDuty | AWS threat detection service. Uses machine learning to identify malicious activity and unauthorized behavior in your account. Disabling it is a known attacker technique to prevent detection during a breach. |
| ECR | Elastic Container Registry. AWS service for storing Docker container images. Scan-on-push means every image is scanned for known CVEs before it can be deployed, catching vulnerabilities before they reach production. |
| VPC Flow Logs | Network traffic logs for your Virtual Private Cloud. Captures accepted and rejected connections. Essential for forensics and incident response without them you cannot trace what happened on your network during a security event. |
| EBS | Elastic Block Store. Persistent disk storage attached to EC2 instances. Unencrypted EBS volumes expose data at rest if the underlying hardware is ever accessed outside AWS. |
| Secrets Manager | AWS service for storing and rotating secrets like API keys, database passwords, and certificates. Secrets that never rotate are a major risk a compromised credential stays valid indefinitely. |
| EnforceMode | The CloudFormation parameter in this stack that controls all 14 policies at once. Set to notify to alert only, set to enforce to auto-remediate. One parameter change redeploys the entire stack. |
| Misconfiguration | A cloud resource that is not set up according to security best practices. Not a hack or an exploit just a setting that was left wrong. The leading cause of cloud data breaches, responsible for more incidents than malware or zero-days. |
| Auto-remediation | Automatically fixing a security misconfiguration without human involvement. As opposed to alerting, which tells a human there is a problem and waits for them to fix it. The difference between a 90-second resolution time and a 197-day one. |
The bottom line
The 197-day detection gap is not a people problem. It is an architecture problem. Security teams reviewing weekly reports will always lose the race against misconfiguration at cloud scale. Event-driven auto-remediation removes that race entirely.