Skip to content

Monitoring

LeanCore includes a comprehensive monitoring stack for operational visibility.

Monitoring Stack

ServicePurpose
PrometheusMetrics collection and alerting
GrafanaDashboards and visualization
TempoDistributed tracing
LangfuseAI pipeline trace capture
cAdvisorContainer metrics
Node ExporterHost system metrics

Grafana Dashboards

Three pre-built dashboards provide full operational visibility:

1. System Overview

Host and container health at a glance:

PanelWhat it Shows
CPU Usage %Timeseries with yellow >60%, red >80%
Memory Usage %Timeseries with yellow >70%, red >85%
Disk UsageGauge showing partition usage, yellow >75%, red >90%
Network I/ORX/TX bytes per second per interface
Container StatusCount of healthy containers
Container CPUPer-container CPU usage
Container MemoryPer-container memory usage
Container Restarts (15m)Bar gauge, yellow at 1, red at 3

2. Application Performance

JVM and HTTP performance:

PanelWhat it Shows
JVM HeapUsed vs committed vs max memory
GC Pause DurationAverage garbage collection pause time
HTTP Request RateTotal req/s and 5xx/s
HTTP Latencyp50, p95, p99 response time
HTTP Errors4xx/5xx breakdown by status code
Connection PoolActive, idle, pending, max connections
Thread CountLive, daemon, peak thread counts

3. MCP Health Grid

Connector status monitoring:

PanelWhat it Shows
Server StatusUP/DOWN grid (green/red per server)
Health Response TimeScrape duration per server
Container Restarts (1h)Restart count per MCP container
Container MemoryMemory usage per connector

Distributed Tracing

Every HTTP request through the backend produces an OpenTelemetry trace:

  • Search traces by service name
  • Filter by duration, status code, HTTP method
  • View full span waterfall for any request
  • Each API call shows: controller -> service -> repository -> external HTTP calls

Alert Rules

14 alert rules are configured for production:

Critical Alerts (Email + WhatsApp)

  • Container down
  • OOM killed
  • Disk usage >90%
  • Memory available <10%
  • Backend API down
  • MCP server down

Warning Alerts (Email)

  • CPU >80%
  • Memory available <20%
  • Disk usage >75%
  • Restart loop detected
  • JVM heap >85%
  • Connection pool >80%
  • HTTP 5xx rate >5%
  • p95 latency >5 seconds

AI Pipeline Tracing

Langfuse captures detailed AI pipeline execution data:

  • Every pipeline stage is traced (Coordinator, Expert, Validator, Grounding)
  • Token usage and cost per model
  • Tool call inputs and outputs
  • Quality scores and validation results
  • Outcome scores for response quality

LeanCore AI - Hire smarter. Not more.