Skip to content

📊 Loki-Stack

🎯 Overview

Loki is a horizontally-scalable, highly-available log aggregation system designed for Kubernetes environments.
It efficiently collects, stores, and indexes logs, making them searchable via Grafana.

Deployment Mode

All Unibeam environments run Loki in SimpleScalable mode, which provides a balance between simplicity and scalability for medium-sized deployments handling up to ~1TB/day of logs.


🏗️ How the Loki Stack Works

Loki is a horizontally-scalable, highly-available log aggregation system designed for Kubernetes environments.
It efficiently collects, stores, and indexes logs, making them searchable via Grafana.

Loki Stack Components

Component Purpose
Write Handles log ingestion and writing data to storage (replaces Distributor + Ingester).
Read Handles log queries and serves data to Grafana (replaces Querier + Query Frontend).
Backend Manages background tasks like compaction and index maintenance (replaces Compactor).
Gateway Optional nginx gateway for routing and authentication (disabled in our setup).
Memberlist-KV Provides distributed key-value store for cluster coordination and metadata.

Promtail Integration

Promtail runs as a DaemonSet in the promtail namespace, tails logs from pods, and pushes them to Loki's Write component.


🎨 Deployment Architecture

SimpleScalable Mode

The environments use Loki's SimpleScalable deployment mode with three main components:

┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│    Write    │      │    Read     │      │   Backend   │
│  (3 pods)   │      │  (3 pods)   │      │  (3 pods)   │
├─────────────┤      ├─────────────┤      ├─────────────┤
│ • Ingestion │      │ • Queries   │      │ • Compactor │
│ • WAL       │      │ • Log Fetch │      │ • Retention │
│ • Storage   │      │ • Grafana   │      │ • Cleanup   │
└─────────────┘      └─────────────┘      └─────────────┘
       │                    │                    │
       └────────────────────┴────────────────────┘
                    ┌───────▼────────┐
                    │   AWS S3       │
                    │ (ub-global-us- │
                    │  loki-data)    │
                    └────────────────┘

Canary Disabled

The Loki canary component is disabled (lokiCanary.enabled: false) as it's optional for testing purposes.


📋 Helm Values Configuration

Global Settings

# Deployment mode configuration
deploymentMode: SimpleScalable

# Loki core configuration
loki:
  auth_enabled: false
  
  # Server configuration
  server:
    grpc_server_max_recv_msg_size: 104857600  # 100MB
    grpc_server_max_send_msg_size: 104857600
    http_server_read_timeout: 600s
    http_server_write_timeout: 600s

🔧 Component Configuration

Write Component (Ingestion)

write:
  enabled: true
  replicas: 1
  autoscaling:
    enabled: false
  
  # Persistence for WAL (Write-Ahead Log)
  persistence:
    volumeClaimsEnabled: true
    accessModes:
      - ReadWriteOnce
    size: 100Gi
    storageClass: gp3
    enableStatefulSetAutoDeletePVC: false

Write Component Scaling

The write component uses StatefulSets with persistent volumes. When scaling, refer to the Loki scaling documentation for proper WAL handling.

Read Component (Queries)

read:
  enabled: true
  replicas: 1
  autoscaling:
    enabled: false
  
  # Legacy read target disabled (using 3-target mode)
  legacyReadTarget: false

Backend Component

backend:
  enabled: true
  replicas: 1
  autoscaling:
    enabled: false
  
  # Persistence for backend processing
  persistence:
    volumeClaimsEnabled: true
    accessModes:
      - ReadWriteOnce
    size: 50Gi
    storageClass: gp3
    enableStatefulSetAutoDeletePVC: false

🗂️ AWS S3 Integration for Log Storage

Loki uses AWS S3 as its object storage backend for scalable, durable, and cost-effective log retention.

Storage Configuration

loki:
  storage:
    # S3 bucket names
    bucketNames:
      chunks: ub-global-us-loki-data
      ruler: ub-global-us-loki-data
      admin: ub-global-us-loki-data
    
    # Storage type
    type: s3
    
    # S3 configuration
    s3:
      s3: s3://ub-global-us-loki-data
      region: us-east-1
    
    # Object store configuration
    object_store:
      type: s3
      s3:
        region: us-east-1

Schema Configuration

loki:
  schemaConfig:
    configs:
      - from: 2025-11-01
        store: tsdb              # Time Series Database index
        object_store: aws        # AWS S3 for chunks
        schema: v13              # Latest schema version
        index:
          prefix: index_east_2025_
          period: 24h

Schema Details

  • Store Type: tsdb - Modern time-series database index format
  • Object Store: aws - Utilizes AWS S3 for chunk storage
  • Schema Version: v13 - Latest production-ready schema
  • Index Period: 24h - Daily index rotation

Authentication

Authentication to AWS S3 is managed using IAM roles and Pod Identity:

  • Loki pods use Pod Identity to assume IAM roles
  • IAM policies grant access to the S3 bucket
  • No static credentials required

Security Best Practice

Pod Identity ensures secure, temporary credentials are used without storing long-lived access keys in the cluster.


⚙️ Advanced Configuration

Limits and Performance

loki:
  limits_config:
    # Stream limits
    max_global_streams_per_user: 100000
    max_entries_limit_per_query: 100000
    
    # Retention settings
    reject_old_samples: false
    reject_old_samples_max_age: 9600h  # 400 days
    retention_period: 876000h          # ~100 years (effectively unlimited)
    
    # Volume tracking
    volume_enabled: true
    
    # Query optimization
    max_query_length: 8760h            # 1 year
    query_timeout: 600s                # 10 minutes
    split_queries_by_interval: 12h

Retention Configuration

While retention_period is set to a high value, actual retention is managed by the compactor settings and S3 lifecycle policies.

Query Optimization

loki:
  query_range:
    cache_results: true
    max_retries: 5
    align_queries_with_step: true
    cache_volume_results: true
  
  frontend:
    compress_responses: true
    max_outstanding_per_tenant: 100
    log_queries_longer_than: 30s

Compactor Settings

compactor:
  retention_enabled: false           # Retention handled by S3 lifecycle
  retention_delete_delay: 2h
  retention_delete_worker_count: 150
  compaction_interval: 10m

Compaction

The compactor runs every 10 minutes to optimize storage by merging small chunks and managing index files.


🚨 Ruler and Alerting

Ruler Configuration

ruler:
  enabled: true
  alertmanager_url: http://alertmanager-operated.monitoring.svc.cluster.local:9093
  enable_api: true
  enable_alertmanager_v2: true
  enable_alertmanager_discovery: true

loki:
  rulerConfig:
    wal:
      dir: /var/loki/ruler-wal
    storage:
      type: local
      local:
        directory: /rules

Alertmanager Integration

Loki's ruler integrates with Prometheus Alertmanager in the monitoring namespace for centralized alert management.


🔐 Security and Service Accounts

Service Account Configuration

serviceAccount:
  create: true
  name: loki
  automountServiceAccountToken: true
  
write:
  serviceAccount:
    create: false  # Uses shared service account
    
backend:
  serviceAccount:
    create: false  # Uses shared service account

IAM Integration

The Loki service account is annotated with IAM role ARN for Pod Identity to access AWS S3 securely.


🔍 Querying Logs

From Grafana

  1. Navigate to Explore in Grafana
  2. Select Loki datasource
  3. Use LogQL to query logs:
{namespace="sim-service"} |= "error"
  1. Filter by labels: namespace, app, pod, container
  2. Time range: Adjust based on your needs

Common Query Patterns

# All errors in production namespace
{namespace=~"sim|sms|sia"} |= "error"

# High request latency
{namespace="mno-service"} | json | duration > 5s

# Failed authentications
{app="sia-service"} |~ "authentication failed"

Query Performance

  • Use specific label filters to reduce data scanned
  • Limit time ranges for faster queries
  • Leverage indexed labels like namespace and pod

📊 Monitoring and Observability

Metrics

Loki exposes Prometheus metrics on port 3100:

  • /metrics - Standard Prometheus metrics
  • Query rates, error rates, latency
  • Storage metrics (S3 operations, cache hits)
  • Ingestion metrics (bytes received, streams active)

Health Checks

# Check if Loki is ready
kubectl exec -n loki loki-write-0 -- wget -qO- http://localhost:3100/ready

# Check component status
kubectl exec -n loki loki-backend-0 -- wget -qO- http://localhost:3100/services

🛠️ Troubleshooting

Common Issues

Slow Queries

Query Timeout

If queries are timing out, check: - Time range (reduce if too large) - Query complexity (simplify filters) - Backend pod resources (scale if needed)

# Check backend logs
kubectl logs -n loki -l app.kubernetes.io/component=backend --tail=100

Missing Logs

Ingestion Issues

Check the write component and Promtail:

# Check write component
kubectl logs -n loki -l app.kubernetes.io/component=write --tail=100

# Check Promtail
kubectl logs -n promtail -l app.kubernetes.io/name=promtail --tail=100

# Verify S3 connectivity
kubectl exec -n loki loki-write-0 -- wget -qO- http://localhost:3100/services

Storage Issues

# Check S3 bucket access
aws s3 ls s3://ub-global-us-loki-data/ --profile production

# Verify IAM role permissions
kubectl describe pod -n loki loki-write-0 | grep -A5 "Service Account"

# Check PVC status
kubectl get pvc -n loki

🔄 Maintenance Operations

Scaling Components

# Scale read replicas for query load
kubectl scale statefulset -n loki loki-read --replicas=3

# Scale write replicas (follow WAL migration process)
# See: https://grafana.com/docs/loki/latest/operations/storage/wal/#how-to-scale-updown

Write Component Scaling

Scaling the write component requires careful WAL (Write-Ahead Log) migration. Always follow the official Loki documentation.

Updating Configuration

# Update values in ArgoCD repository
cd /path/to/argocd/infra-applications/values/grafana-loki/

# Edit environment-specific values
vi values-ub-global-us.yaml

# Commit and push
git add .
git commit -m "Update Loki configuration"
git push

# ArgoCD will automatically sync changes


🔗 External Resources


Last Updated: January 2025
Maintained by: DevOps Team