GKE Intelligent Monitoring System

GKE Intelligent Monitoring System

An AI-powered monitoring solution I built for Google Kubernetes Engine, combining ADK agents with MCP servers to deliver intelligent cluster management, automated troubleshooting, and predictive insights.

Samir Saqer / September 17, 2025

GKEADKMCP

Project Overview

The GKE Intelligent Monitoring System is an advanced monitoring solution I developed for the GKE Turns 10 Hackathon that bridges traditional Kubernetes management with AI-driven intelligence. By combining Google's AI Development Kit (ADK) with the Model Context Protocol (MCP), I created a system that doesn't just monitor clusters—it understands them.

This isn't another metrics dashboard. It's an intelligent assistant that analyzes cluster health, predicts issues before they escalate, and provides actionable remediation—all through natural language interaction.

GKE Intelligent Monitoring in Action

Why I Built This

Managing Kubernetes clusters at scale is complex. DevOps teams spend countless hours monitoring dashboards, analyzing logs, and troubleshooting issues. I experienced this firsthand and realized that while we have powerful monitoring tools, they lack intelligence—they show you what's happening, but not why or what to do about it.

I wanted to create something different: a system that combines real-time monitoring with AI-driven insights to provide:

  • Proactive problem detection before issues impact users
  • Intelligent troubleshooting that understands context
  • Automated remediation for common failure patterns
  • Natural language interaction instead of complex CLI commands

Technical Architecture

System Design

I architected the solution with three interconnected layers:

1. ADK Agent Layer - The AI brain that interprets requests, makes decisions, and orchestrates actions
2. MCP Server Layer - The tooling framework that provides Kubernetes management capabilities
3. GKE Integration Layer - Direct interface with cluster APIs and Google Cloud services

This separation allows the AI to focus on intelligence while the MCP layer handles the technical execution safely and efficiently.

Text
┌─────────────────┐
  User/Client    
└────────┬────────┘
         
    ┌────▼────────┐
     ADK Agent    ◄─── AI Decision Making
      (FastAPI)  
    └────┬────────┘
         
    ┌────▼────────┐
     MCP Server   ◄─── Tool Orchestration
      + K8s Tools│
    └────┬────────┘
         
    ┌────▼────────┐
     GKE Cluster  ◄─── Kubernetes API
      Resources  
    └─────────────┘

Technology Stack

  • AI Layer: Google ADK with Gemini 2.0 Flash via Vertex AI
  • Protocol: Model Context Protocol (MCP) with FastMCP
  • Backend: Python 3.11 with async/await patterns
  • Kubernetes: Official Python client for K8s API
  • Cloud Integration: Google Cloud Monitoring and Logging
  • Deployment: Docker containers with Kubernetes manifests
  • Security: RBAC-based access control and service accounts

Key Technical Challenges

1. AI-Kubernetes Bridge

Challenge: Creating a safe interface between AI decision-making and Kubernetes operations.

Solution: I implemented the MCP server as a controlled gateway with explicit tools. Each tool has strict input validation, RBAC permission checks, detailed error handling, and audit logging to ensure only authorized, traceable operations.

2. Context-Aware Troubleshooting

Challenge: Providing intelligent troubleshooting that understands cluster context, not just individual metrics.

Solution: Built a multi-stage analysis system:

Python
def suggest_troubleshooting(self, pod_name: str, namespace: str = "default"):
    """AI-powered troubleshooting with contextual analysis"""
    pod_info = self._get_pod_details(pod_name, namespace)
    events = self._get_pod_events(pod_name, namespace)
    logs = self._get_recent_logs(pod_name, namespace)
    
    context = {
        "status": pod_info.status,
        "conditions": pod_info.conditions,
        "events": events,
        "logs": logs[-100:],
        "resources": pod_info.spec.containers[0].resources
    }
    
    analysis = self._analyze_with_ai(context)
    return {
        "diagnosis": analysis.root_cause,
        "recommendations": analysis.remediation_steps,
        "priority": analysis.severity
    }

The system correlates pod status, events, logs, and resource metrics for comprehensive diagnostics.

3. Predictive Issue Detection

Solution: Pattern recognition learning from historical data—analyzes pod restart patterns, monitors resource utilization trends, correlates deployment changes with failures, and identifies cascading failure risks.

4. Natural Language Interface

Solution: Leveraged ADK's LlmAgent to translate natural language into precise Kubernetes operations:

Python
# User: "Is my checkout service healthy?"
# ADK Agent translates to:
tools = [
    "get_deployment_status(name='checkout-service')",
    "list_pods(label_selector='app=checkout')",
    "get_service_status(name='checkout-service')",
    "get_gke_cluster_metrics()"
]

5. Real-Time Metric Collection

Solution: Asynchronous metric aggregation system:

Python
async def get_gke_cluster_metrics(self):
    """Collect comprehensive cluster metrics"""
    tasks = [
        self._get_node_metrics(),
        self._get_pod_metrics(),
        self._get_network_metrics(),
        self._get_storage_metrics()
    ]
    results = await asyncio.gather(*tasks)
    return {
        "nodes": results[0],
        "pods": results[1],
        "network": results[2],
        "storage": results[3],
        "timestamp": datetime.utcnow()
    }

Core Features

🔍 Intelligent Monitoring – Real-time health tracking with AI-driven insights into node status, pod lifecycle, deployment health, and service endpoints

🤖 AI Troubleshooting – Automated detection and diagnosis with root cause analysis and step-by-step remediation guidance

🛠️ Automated Remediation – Smart fixes for image pull errors, failing pods, scaling, and recovery procedures with safety checks

📊 Predictive Analytics – Forecasts capacity constraints, stability issues, and performance degradation before impact

💬 Natural Language Queries – "Show me all failing pods" • "Why is auth service down?" • "Scale API deployment" • "What's consuming CPU?"

🔧 Advanced Management – Dynamic scaling, log analysis, network testing, resource optimization, YAML management

MCP Tools I Developed

I built 15+ specialized tools that the AI can use to manage clusters:

Monitoring Tools:

  • get_cluster_info - Cluster health and node status
  • list_pods - Pod inventory with resource usage
  • get_deployment_status - Deployment health monitoring
  • get_service_status - Service endpoint validation
  • get_gke_cluster_metrics - GKE-specific performance data

Troubleshooting Tools:

  • get_pod_logs - Log retrieval and analysis
  • describe_pod - Detailed pod inspection
  • suggest_troubleshooting - AI-powered diagnostics
  • automate_remediation - Intelligent problem resolution
  • network_connectivity_test - Network debugging

Management Tools:

  • scale_deployment - Dynamic scaling operations
  • exec_pod_command - Container command execution
  • delete_resource - Safe resource deletion
  • apply_manifest - YAML deployment

Implementation Highlights

ADK Agent Configuration

Python
from google.adk.agents import LlmAgent
from mcp.client import ClientSession

# Initialize the AI agent with Kubernetes context
agent = LlmAgent(
    model="gemini-2.0-flash-exp",
    system_prompt="""You are an expert Kubernetes administrator 
    with deep knowledge of GKE clusters. You help users monitor, 
    troubleshoot, and manage their clusters efficiently.""",
    tools=mcp_tools
)

# Enable intelligent conversation flow
session = ClientSession(
    transport="stdio",
    server_params={
        "command": "python",
        "args": ["k8s_mcp_server.py"]
    }
)

MCP Server Setup

Python
from mcp.server.fastmcp import FastMCP
from kubernetes import client, config

# Initialize MCP server with Kubernetes access
mcp = FastMCP("k8s-monitor")

# Load in-cluster config for GKE deployment
config.load_incluster_config()

# Register monitoring tools
@mcp.tool()
def get_cluster_info() -> dict:
    """Get comprehensive cluster information"""
    v1 = client.CoreV1Api()
    nodes = v1.list_node()
    
    return {
        "cluster_version": nodes.items[0].status.node_info.kube_proxy_version,
        "node_count": len(nodes.items),
        "nodes": [{
            "name": node.metadata.name,
            "status": node.status.conditions[-1].status,
            "capacity": node.status.capacity
        } for node in nodes.items]
    }

Deployment Architecture

YAML
# Kubernetes deployment with RBAC
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gke-monitor
  namespace: monitoring
spec:
  replicas: 2
  selector:
    matchLabels:
      app: gke-monitor
  template:
    spec:
      serviceAccountName: gke-monitor-sa
      containers:
      - name: adk-agent
        image: gcr.io/project/gke-monitor:latest
        env:
        - name: GCP_PROJECT_ID
          value: "your-project"
        - name: MCP_SERVICE_URL
          value: "http://localhost:8080"
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"

Performance & Scale

  • Response Time: < 2s for most queries
  • Concurrent Users: 100+ simultaneous requests
  • Cluster Size: Tested with 1000+ pods across 50+ nodes
  • Uptime: 99.9% availability with failover
  • Resource Usage: ~512MB memory, 0.5 CPU per instance

Getting Started

Bash
# Clone and configure
git clone https://github.com/w3sqr/k8s-mcp-and-adk-agent
cd k8s-mcp-and-adk-agent

export GCP_PROJECT_ID="your-project-id"
export GKE_CLUSTER_NAME="your-cluster"

# Deploy
kubectl apply -f k8s-manifests/k8s-mcp-rbac.yaml
kubectl apply -f k8s-manifests/k8s-mcp-deployment.yaml
kubectl apply -f deployment.yaml

# Verify
kubectl get pods -n monitoring
kubectl port-forward svc/adk-agent 8000:8000
ADK UI Interface

Security

RBAC Configuration – Minimal-privilege service accounts
Secret Management – Kubernetes secrets, never in code
Audit Logging – Full context for all AI operations
Network Policies – Component isolation

Future Enhancements

  • Multi-cluster support for fleet management
  • Custom ML models for anomaly detection
  • Slack/Discord integration
  • Cost optimization recommendations
  • GitOps integration for automated PRs

Contributing

Open-source project welcoming contributions in: additional MCP tools, monitoring integrations, documentation, performance optimizations, and feature requests.

See the GitHub repository for guidelines.


Built with Google ADK, MCP, and GKE for the GKE Turns 10 Hackathon