Monitoring
LeanCore includes a comprehensive monitoring stack for operational visibility.
Monitoring Stack
| Service | Purpose |
|---|---|
| Prometheus | Metrics collection and alerting |
| Grafana | Dashboards and visualization |
| Tempo | Distributed tracing |
| Langfuse | AI pipeline trace capture |
| cAdvisor | Container metrics |
| Node Exporter | Host system metrics |
Grafana Dashboards
Three pre-built dashboards provide full operational visibility:
1. System Overview
Host and container health at a glance:
| Panel | What it Shows |
|---|---|
| CPU Usage % | Timeseries with yellow >60%, red >80% |
| Memory Usage % | Timeseries with yellow >70%, red >85% |
| Disk Usage | Gauge showing partition usage, yellow >75%, red >90% |
| Network I/O | RX/TX bytes per second per interface |
| Container Status | Count of healthy containers |
| Container CPU | Per-container CPU usage |
| Container Memory | Per-container memory usage |
| Container Restarts (15m) | Bar gauge, yellow at 1, red at 3 |
2. Application Performance
JVM and HTTP performance:
| Panel | What it Shows |
|---|---|
| JVM Heap | Used vs committed vs max memory |
| GC Pause Duration | Average garbage collection pause time |
| HTTP Request Rate | Total req/s and 5xx/s |
| HTTP Latency | p50, p95, p99 response time |
| HTTP Errors | 4xx/5xx breakdown by status code |
| Connection Pool | Active, idle, pending, max connections |
| Thread Count | Live, daemon, peak thread counts |
3. MCP Health Grid
Connector status monitoring:
| Panel | What it Shows |
|---|---|
| Server Status | UP/DOWN grid (green/red per server) |
| Health Response Time | Scrape duration per server |
| Container Restarts (1h) | Restart count per MCP container |
| Container Memory | Memory usage per connector |
Distributed Tracing
Every HTTP request through the backend produces an OpenTelemetry trace:
- Search traces by service name
- Filter by duration, status code, HTTP method
- View full span waterfall for any request
- Each API call shows: controller -> service -> repository -> external HTTP calls
Alert Rules
14 alert rules are configured for production:
Critical Alerts (Email + WhatsApp)
- Container down
- OOM killed
- Disk usage >90%
- Memory available <10%
- Backend API down
- MCP server down
Warning Alerts (Email)
- CPU >80%
- Memory available <20%
- Disk usage >75%
- Restart loop detected
- JVM heap >85%
- Connection pool >80%
- HTTP 5xx rate >5%
- p95 latency >5 seconds
AI Pipeline Tracing
Langfuse captures detailed AI pipeline execution data:
- Every pipeline stage is traced (Coordinator, Expert, Validator, Grounding)
- Token usage and cost per model
- Tool call inputs and outputs
- Quality scores and validation results
- Outcome scores for response quality