← home/opsdemo

AI Ops Center

Production intelligence powered by LLM reasoning

12
services
3
active alerts
4.2min
MTTR (AI)
847
resolved
INCIDENT #2847 — API Latency Spike
started: 12:03  |  affected: 3 services
P1

AI Root Cause Analysis

LLM reasoning
Click Analyze to start AI-powered root cause analysis

Remediation Plan

Rollback deploy #4521
git revert abc1234 && git push
Apply hotfix: batch query
SELECT * FROM preferences WHERE user_id IN (...)
Scale connection pool temporarily
max_connections: 100 → 200
Verify service recovery
p99 latency < 500ms for 5 min
Waiting for AI analysis...

Service Status

api-gateway
2,340mscritical
user-service
890mswarning
db-primary
critical
auth-service
12mshealthy
payment-service
45mshealthy
notification-svc
23mshealthy
clientapi-gwuser-svcdb
auth-svc
payment-svc

Live Log Stream

12:03:24ERRapi-gatewayupstream timeout after 30000ms on POST /api/users/preferences
12:03:25WRNuser-serviceconnection pool utilization at 95% (95/100)
12:03:26ERRdb-primarymax_connections reached: 100/100 — rejecting new connections
12:03:27ERRapi-gatewayupstream timeout after 30000ms on GET /api/users/12847
12:03:28WRNuser-serviceconnection pool utilization at 98% (98/100)
12:03:29ERRdb-primarytoo many connections for role "app_user"
12:03:30ERRapi-gatewaycircuit breaker OPEN for user-service — 23 failures in 60s
12:03:31INFauth-servicehealth check OK — no anomalies detected
12:03:32ERRuser-serviceconnection pool exhausted — 0 available connections
12:03:33WRNapi-gatewayresponse time p99 = 28,400ms (threshold: 500ms)

Traditional monitoring alerts when thresholds are exceeded — but only an LLM can read unstructured logs, understand code semantics, correlate events across services, and reason about causality to identify the actual root cause.