Automated GCP Infrastructure Auditing with gcloud MCP
Learn how to automatically detect security vulnerabilities and configuration issues in GCP infrastructure using AI agents and gcloud MCP.
Overview
Cloud infrastructure management grows increasingly complex over time. Dozens of services, hundreds of resources, and constantly changing configurations. Infrastructure administrators must battle security vulnerabilities, cost waste, and configuration errors daily. However, manual inspections are time-consuming and prone to missing critical issues.
To solve this problem, I built an automated infrastructure auditing system using gcloud MCP (Model Context Protocol) and AI agents. Through a parallel agent architecture, it simultaneously analyzes 16 GCP services, automatically identifying security risks, cost optimization opportunities, and operational issues.
Problem Background
Infrastructure Administrator Challenges
Common problems faced by infrastructure administrators in production environments:
- Service sprawl: Mix of Cloud Run, Cloud Functions, App Engine, Compute Engine, and more
- Security blind spots: API keys exposed in environment variables, overly permissive firewall rules
- Cost leakage: Unused resources, over-provisioned instances
- Technical debt: End-of-life OS, deprecated runtime versions
The traditional approach involves checking each service individually. But this has limitations:
# Traditional approach: sequential service-by-service inspection
gcloud compute instances list
gcloud run services list
gcloud functions list
gcloud sql instances list
# ... manually run dozens of commands
Using this method to inspect the entire infrastructure can take hours or even days.
Limitations of Existing Tools
Google Cloud’s Security Command Center and Cloud Asset Inventory are excellent tools. However:
- They rely on static rule-based detection
- Cross-service relationship analysis is limited
- Difficult to prioritize based on business context
- Lack of immediate remediation suggestions
Solution Approach
Introducing gcloud MCP
MCP (Model Context Protocol) is a protocol that enables AI models to interact with external tools. gcloud MCP wraps the Google Cloud CLI as an MCP server, allowing AI agents to directly query and manage GCP resources.
Key advantages:
- Natural language interface: Query using natural language instead of complex gcloud commands
- Context awareness: AI understands and analyzes relationships between resources
- Automated reports: Generates structured analysis results and improvement recommendations
Parallel Agent Architecture
Instead of a single agent sequentially checking all services, I applied a parallel sub-agent pattern:
flowchart TB
subgraph Orchestration["Orchestration Layer"]
Main["Main Orchestrator"]
end
subgraph Agents["Parallel Analysis Agents"]
A1["Agent 1<br/>Compute Engine"]
A2["Agent 2<br/>Cloud Run"]
A3["Agent 3<br/>Cloud Functions"]
A4["..."]
A16["Agent 16<br/>Secret Manager"]
end
subgraph Analysis["Analysis Layer"]
Expert["Infrastructure<br/>Expert Agent"]
end
subgraph Output["Output"]
Report["Final Report"]
end
Main --> A1 & A2 & A3 & A4 & A16
A1 & A2 & A3 & A4 & A16 --> Expert
Expert --> Report
Each sub-agent independently analyzes a specific service:
| Agent | Service | Analysis Items |
|---|---|---|
| Agent 1 | Compute Engine | VM status, OS version, snapshots |
| Agent 2 | Cloud Run | Service config, env vars, scaling |
| Agent 3 | Cloud Functions | Runtime, triggers, secrets |
| Agent 4 | Cloud SQL | DB version, backups, security |
| … | … | … |
| Agent 16 | App Engine | Version management, domains, resources |
Implementation Steps
Step 1: Setting Up gcloud MCP
First, configure the gcloud MCP server. It can be used with Claude Desktop or other MCP-compatible clients:
{
"mcpServers": {
"gcloud": {
"command": "npx",
"args": ["-y", "@anthropics/gcloud-mcp"],
"env": {
"GOOGLE_APPLICATION_CREDENTIALS": "/path/to/credentials.json"
}
}
}
}
Step 2: Define Service-Specific Analysis Agents
Create specialized analysis prompts for each GCP service:
# Compute Engine Analysis Agent
## Goal
Analyze all Compute Engine resources in the project and identify security and operational issues.
## Analysis Items
1. VM instance list and status
2. Machine types and resource allocation
3. OS image versions (EOL status)
4. Disk and snapshot configuration
5. Network interfaces and firewall rules
6. Metadata (SSH keys, startup scripts, etc.)
## Output Format
- Resource summary table
- List of discovered issues (by severity)
- Recommended actions
Step 3: Parallel Execution Orchestration
The main orchestrator runs all sub-agents simultaneously:
# Conceptual code example
async def run_infrastructure_audit():
agents = [
Agent("compute-engine", compute_prompt),
Agent("cloud-run", cloud_run_prompt),
Agent("cloud-functions", functions_prompt),
# ... 16 agents
]
# Parallel execution
results = await asyncio.gather(*[
agent.analyze() for agent in agents
])
# Aggregate results
return aggregate_results(results)
Step 4: Result Aggregation and Report Generation
The infrastructure expert agent synthesizes all results to generate a prioritized report:
# Risk Assessment Criteria
## Critical (Immediate Action Required)
- Credentials exposed to the internet
- Fully open firewall rules
- End-of-life OS
## High (Action Within 1 Week)
- API keys in environment variables
- Deletion protection not enabled
- Databases without backups
## Medium (Action Within 1 Month)
- Deprecated runtime versions
- Unused resources
- Inadequate labeling
Real-world Examples
Sample Analysis Results
Running the parallel agent system produces reports like this:
Infrastructure Overview
| Category | Service | Resource Count | Status |
|---|---|---|---|
| Compute | Compute Engine VM | 1 | Attention needed |
| Compute | Cloud Run services | 23 | Security review needed |
| Compute | Cloud Functions | 54 | Runtime upgrade needed |
| Database | Cloud SQL | 2 | 1 inactive |
| Storage | Cloud Storage | 27 | 15 with security gaps |
| Networking | VPC | 2 | Firewall review needed |
Major Issues Discovered
Security Vulnerabilities (Critical)
-
API Keys Exposed in Environment Variables
- Location: Multiple Cloud Run/Functions services
- Risk: Service abuse if credentials are stolen
- Action: Migrate to Secret Manager immediately
-
RDP Port Fully Open
- Location: default VPC firewall rule
- Risk: Exposure to brute force attacks
- Action: Restrict to specific IP ranges
-
End-of-Life OS
- Location: cdp-sftp-prod VM (CentOS 7)
- Risk: No security patches
- Action: Migrate to Rocky Linux or Ubuntu LTS
Cost Optimization Opportunities
- Stopped MySQL Instance: Only incurring storage costs
- 80+ App Engine Versions: Unused versions need cleanup
- Empty BigQuery Datasets: 10 datasets can be deleted
Auto-Generated Mermaid Diagrams
The system also automatically generates Mermaid diagrams to visualize infrastructure:
graph TB
subgraph Internet
User[Users]
end
subgraph GCP["Google Cloud Platform"]
subgraph Compute["Compute Services"]
AE[App Engine]
CR[Cloud Run x23]
CF[Cloud Functions x54]
VM[Compute Engine]
end
subgraph Data["Data Services"]
SQL[(Cloud SQL)]
BQ[(BigQuery)]
FS[(Firestore)]
GCS[(Cloud Storage)]
end
subgraph Messaging["Messaging"]
PS[Pub/Sub]
SCH[Cloud Scheduler]
end
end
User --> AE
AE --> SQL
CR --> SQL
CF --> PS
PS --> CR
SCH --> CF
Automating Regular Scans
The Need for Periodic Auditing
Infrastructure changes daily. New services are deployed, configurations change, and new vulnerabilities are discovered. One-time audits are not enough.
Automation with Cloud Scheduler
Regular infrastructure audits can be automated:
# Weekly infrastructure audit schedule
schedule: "0 9 * * 1" # Every Monday at 9 AM
target:
type: cloud-function
function: infrastructure-audit-trigger
notification:
- email: infra-team@company.com
- slack: #infra-alerts
Change Tracking and Trend Analysis
By storing periodic scan results:
- Track security posture changes over time
- Identify newly emerged and resolved issues
- Analyze infrastructure growth trends
- Maintain compliance audit history
Immediate Remediation
Another strength of gcloud MCP is the ability to immediately fix discovered issues.
Example: Secret Manager Migration
Migrating API keys exposed in environment variables to Secret Manager:
# 1. Create secret
gcloud secrets create openai-api-key --replication-policy="automatic"
# 2. Set secret value
echo -n "sk-xxx..." | gcloud secrets versions add openai-api-key --data-file=-
# 3. Update Cloud Run service
gcloud run services update my-service \
--update-secrets=OPENAI_API_KEY=openai-api-key:latest
AI agents can automatically generate these remediation commands and execute them after approval.
Example: Firewall Rule Hardening
# Delete dangerous RDP rule
gcloud compute firewall-rules delete allow-rdp-all
# Create new rule allowing only specific IPs
gcloud compute firewall-rules create allow-rdp-office \
--allow tcp:3389 \
--source-ranges="203.0.113.0/24" \
--target-tags="windows-server"
Conclusion
Combining gcloud MCP with parallel agent architecture enables:
- Time savings: Complete audits that took days manually in minutes
- Consistency: Repeatable inspections with the same criteria
- Comprehensiveness: Analyze cross-service relationships
- Immediate action: Auto-generate remediation commands for discovered issues
Infrastructure administrators are freed from repetitive inspection tasks and can focus on more important architectural decisions and strategic work.
Next Steps
- Install gcloud MCP: Start from the GitHub repository
- Customize analysis agents: Adjust to your organization’s security policies and compliance requirements
- Set up regular scans: Configure weekly/monthly automated audits with Cloud Scheduler
- Integrate notifications: Connect with Slack, Email, PagerDuty for immediate response
Start the new paradigm of cloud infrastructure management with AI agents.
Was this helpful?
Your support helps me create better content. Buy me a coffee! ☕