Docs·4ff474d·Updated Mar 14, 2026·43 ADRs
Back
ADR-015accepted

ADR-015: Observability Stack (Grafana/Loki/Prometheus)

ADR-015: Observability Stack (Grafana/Loki/Prometheus)

Date: 2025-12-29 Status: Accepted Deciders: Development Team Related: infrastructure/docker/docker-compose.yml

Context

With 9 distributed microservices, we needed:

  • Centralized logging (debug across services)
  • Metrics collection (performance monitoring)
  • Visualization (dashboards)
  • Alerting (production issues)

Decision

Full observability stack: Grafana + Loki + Prometheus + Promtail.

Architecture

┌─────────────────────────────────────────────┐
│            Grafana (Port 3011)              │
│         (Visualization + Dashboards)        │
└─────────────┬───────────────────────────────┘
              │
    ┌─────────┴─────────┐
    │                   │
┌───▼─────┐      ┌──────▼──────┐
│  Loki   │      │ Prometheus  │
│ (Logs)  │      │  (Metrics)  │
│  3100   │      │    9090     │
└───▲─────┘      └──────▲──────┘
    │                   │
┌───┴─────┐      ┌──────┴──────┐
│Promtail │      │Service /    │
│(Shipper)│      │metrics      │
└─────────┘      └─────────────┘

Components

1. Loki (Port 3100) - Log Aggregation

  • Collects logs from all services
  • Indexed by labels (service, level, request_id)
  • Query with LogQL

2. Promtail - Log Shipper

  • Scrapes Docker container logs
  • Adds labels and metadata
  • Ships to Loki

3. Prometheus (Port 9090) - Metrics

  • Scrapes /metrics endpoints
  • Time-series database
  • Query with PromQL

4. Grafana (Port 3011) - Visualization

  • Dashboards for logs + metrics
  • Alerts and notifications
  • Explore mode for ad-hoc queries

Service Instrumentation

Logging:

import { logger } from '@karmyq/shared/utils/logger';

logger.info('Request processed', {
  requestId: req.id,
  userId: req.user.id,
  duration: Date.now() - startTime
});

Metrics:

// Prometheus client
import prometheus from 'prom-client';

const requestDuration = new prometheus.Histogram({
  name: 'http_request_duration_ms',
  help: 'Duration of HTTP requests in ms',
  labelNames: ['method', 'route', 'status_code']
});

Access

Consequences

Positive

  • Centralized Logs: See all services in one place
  • Correlation: Trace requests across services via requestId
  • Performance Monitoring: Identify slow endpoints
  • Production Ready: Alert on errors, high latency
  • Historical Analysis: Query past metrics and logs

Negative

  • Resource Overhead: 3 additional containers
  • Storage: Logs and metrics consume disk
  • Learning Curve: PromQL, LogQL syntax
  • Configuration: Initial dashboard setup takes time

Alternatives Considered

Alternative 1: Cloud Services (Datadog, New Relic)

  • Why rejected: Cost prohibitive, vendor lock-in

Alternative 2: ELK Stack (Elasticsearch, Logstash, Kibana)

  • Why rejected: Heavy resource usage, complex setup

Alternative 3: No Observability

  • Why rejected: Debugging distributed systems impossible

References