Project Overview
The GKE Intelligent Monitoring System is an advanced monitoring solution I developed for the GKE Turns 10 Hackathon that bridges traditional Kubernetes management with AI-driven intelligence. By combining Google's AI Development Kit (ADK) with the Model Context Protocol (MCP), I created a system that doesn't just monitor clusters—it understands them.
This isn't another metrics dashboard. It's an intelligent assistant that analyzes cluster health, predicts issues before they escalate, and provides actionable remediation—all through natural language interaction.
Why I Built This
Managing Kubernetes clusters at scale is complex. DevOps teams spend countless hours monitoring dashboards, analyzing logs, and troubleshooting issues. I experienced this firsthand and realized that while we have powerful monitoring tools, they lack intelligence—they show you what's happening, but not why or what to do about it.
I wanted to create something different: a system that combines real-time monitoring with AI-driven insights to provide:
- Proactive problem detection before issues impact users
- Intelligent troubleshooting that understands context
- Automated remediation for common failure patterns
- Natural language interaction instead of complex CLI commands
Technical Architecture
System Design
I architected the solution with three interconnected layers:
1. ADK Agent Layer - The AI brain that interprets requests, makes decisions, and orchestrates actions
2. MCP Server Layer - The tooling framework that provides Kubernetes management capabilities
3. GKE Integration Layer - Direct interface with cluster APIs and Google Cloud services
This separation allows the AI to focus on intelligence while the MCP layer handles the technical execution safely and efficiently.
┌─────────────────┐
│ User/Client │
└────────┬────────┘
│
┌────▼────────┐
│ ADK Agent │ ◄─── AI Decision Making
│ (FastAPI) │
└────┬────────┘
│
┌────▼────────┐
│ MCP Server │ ◄─── Tool Orchestration
│ + K8s Tools│
└────┬────────┘
│
┌────▼────────┐
│ GKE Cluster │ ◄─── Kubernetes API
│ Resources │
└─────────────┘
Technology Stack
- AI Layer: Google ADK with Gemini 2.0 Flash via Vertex AI
- Protocol: Model Context Protocol (MCP) with FastMCP
- Backend: Python 3.11 with async/await patterns
- Kubernetes: Official Python client for K8s API
- Cloud Integration: Google Cloud Monitoring and Logging
- Deployment: Docker containers with Kubernetes manifests
- Security: RBAC-based access control and service accounts
Key Technical Challenges
1. AI-Kubernetes Bridge
Challenge: Creating a safe interface between AI decision-making and Kubernetes operations.
Solution: I implemented the MCP server as a controlled gateway with explicit tools. Each tool has strict input validation, RBAC permission checks, detailed error handling, and audit logging to ensure only authorized, traceable operations.
2. Context-Aware Troubleshooting
Challenge: Providing intelligent troubleshooting that understands cluster context, not just individual metrics.
Solution: Built a multi-stage analysis system:
def suggest_troubleshooting(self, pod_name: str, namespace: str = "default"):
"""AI-powered troubleshooting with contextual analysis"""
pod_info = self._get_pod_details(pod_name, namespace)
events = self._get_pod_events(pod_name, namespace)
logs = self._get_recent_logs(pod_name, namespace)
context = {
"status": pod_info.status,
"conditions": pod_info.conditions,
"events": events,
"logs": logs[-100:],
"resources": pod_info.spec.containers[0].resources
}
analysis = self._analyze_with_ai(context)
return {
"diagnosis": analysis.root_cause,
"recommendations": analysis.remediation_steps,
"priority": analysis.severity
}
The system correlates pod status, events, logs, and resource metrics for comprehensive diagnostics.
3. Predictive Issue Detection
Solution: Pattern recognition learning from historical data—analyzes pod restart patterns, monitors resource utilization trends, correlates deployment changes with failures, and identifies cascading failure risks.
4. Natural Language Interface
Solution: Leveraged ADK's LlmAgent to translate natural language into precise Kubernetes operations:
# User: "Is my checkout service healthy?"
# ADK Agent translates to:
tools = [
"get_deployment_status(name='checkout-service')",
"list_pods(label_selector='app=checkout')",
"get_service_status(name='checkout-service')",
"get_gke_cluster_metrics()"
]
5. Real-Time Metric Collection
Solution: Asynchronous metric aggregation system:
async def get_gke_cluster_metrics(self):
"""Collect comprehensive cluster metrics"""
tasks = [
self._get_node_metrics(),
self._get_pod_metrics(),
self._get_network_metrics(),
self._get_storage_metrics()
]
results = await asyncio.gather(*tasks)
return {
"nodes": results[0],
"pods": results[1],
"network": results[2],
"storage": results[3],
"timestamp": datetime.utcnow()
}
Core Features
🔍 Intelligent Monitoring – Real-time health tracking with AI-driven insights into node status, pod lifecycle, deployment health, and service endpoints
🤖 AI Troubleshooting – Automated detection and diagnosis with root cause analysis and step-by-step remediation guidance
🛠️ Automated Remediation – Smart fixes for image pull errors, failing pods, scaling, and recovery procedures with safety checks
📊 Predictive Analytics – Forecasts capacity constraints, stability issues, and performance degradation before impact
💬 Natural Language Queries – "Show me all failing pods" • "Why is auth service down?" • "Scale API deployment" • "What's consuming CPU?"
🔧 Advanced Management – Dynamic scaling, log analysis, network testing, resource optimization, YAML management
MCP Tools I Developed
I built 15+ specialized tools that the AI can use to manage clusters:
Monitoring Tools:
get_cluster_info- Cluster health and node statuslist_pods- Pod inventory with resource usageget_deployment_status- Deployment health monitoringget_service_status- Service endpoint validationget_gke_cluster_metrics- GKE-specific performance data
Troubleshooting Tools:
get_pod_logs- Log retrieval and analysisdescribe_pod- Detailed pod inspectionsuggest_troubleshooting- AI-powered diagnosticsautomate_remediation- Intelligent problem resolutionnetwork_connectivity_test- Network debugging
Management Tools:
scale_deployment- Dynamic scaling operationsexec_pod_command- Container command executiondelete_resource- Safe resource deletionapply_manifest- YAML deployment
Implementation Highlights
ADK Agent Configuration
from google.adk.agents import LlmAgent
from mcp.client import ClientSession
# Initialize the AI agent with Kubernetes context
agent = LlmAgent(
model="gemini-2.0-flash-exp",
system_prompt="""You are an expert Kubernetes administrator
with deep knowledge of GKE clusters. You help users monitor,
troubleshoot, and manage their clusters efficiently.""",
tools=mcp_tools
)
# Enable intelligent conversation flow
session = ClientSession(
transport="stdio",
server_params={
"command": "python",
"args": ["k8s_mcp_server.py"]
}
)
MCP Server Setup
from mcp.server.fastmcp import FastMCP
from kubernetes import client, config
# Initialize MCP server with Kubernetes access
mcp = FastMCP("k8s-monitor")
# Load in-cluster config for GKE deployment
config.load_incluster_config()
# Register monitoring tools
@mcp.tool()
def get_cluster_info() -> dict:
"""Get comprehensive cluster information"""
v1 = client.CoreV1Api()
nodes = v1.list_node()
return {
"cluster_version": nodes.items[0].status.node_info.kube_proxy_version,
"node_count": len(nodes.items),
"nodes": [{
"name": node.metadata.name,
"status": node.status.conditions[-1].status,
"capacity": node.status.capacity
} for node in nodes.items]
}
Deployment Architecture
# Kubernetes deployment with RBAC
apiVersion: apps/v1
kind: Deployment
metadata:
name: gke-monitor
namespace: monitoring
spec:
replicas: 2
selector:
matchLabels:
app: gke-monitor
template:
spec:
serviceAccountName: gke-monitor-sa
containers:
- name: adk-agent
image: gcr.io/project/gke-monitor:latest
env:
- name: GCP_PROJECT_ID
value: "your-project"
- name: MCP_SERVICE_URL
value: "http://localhost:8080"
resources:
requests:
memory: "512Mi"
cpu: "500m"
Performance & Scale
- Response Time: < 2s for most queries
- Concurrent Users: 100+ simultaneous requests
- Cluster Size: Tested with 1000+ pods across 50+ nodes
- Uptime: 99.9% availability with failover
- Resource Usage: ~512MB memory, 0.5 CPU per instance
Getting Started
# Clone and configure
git clone https://github.com/w3sqr/k8s-mcp-and-adk-agent
cd k8s-mcp-and-adk-agent
export GCP_PROJECT_ID="your-project-id"
export GKE_CLUSTER_NAME="your-cluster"
# Deploy
kubectl apply -f k8s-manifests/k8s-mcp-rbac.yaml
kubectl apply -f k8s-manifests/k8s-mcp-deployment.yaml
kubectl apply -f deployment.yaml
# Verify
kubectl get pods -n monitoring
kubectl port-forward svc/adk-agent 8000:8000
Security
RBAC Configuration – Minimal-privilege service accounts
Secret Management – Kubernetes secrets, never in code
Audit Logging – Full context for all AI operations
Network Policies – Component isolation
Future Enhancements
- Multi-cluster support for fleet management
- Custom ML models for anomaly detection
- Slack/Discord integration
- Cost optimization recommendations
- GitOps integration for automated PRs
Contributing
Open-source project welcoming contributions in: additional MCP tools, monitoring integrations, documentation, performance optimizations, and feature requests.
See the GitHub repository for guidelines.
Built with Google ADK, MCP, and GKE for the GKE Turns 10 Hackathon
