📊 Loki-Stack¶

2025-12-172025-12-17

🎯 Overview¶

Loki is a horizontally-scalable, highly-available log aggregation system designed for Kubernetes environments.
It efficiently collects, stores, and indexes logs, making them searchable via Grafana.

Deployment Mode

All Unibeam environments run Loki in SimpleScalable mode, which provides a balance between simplicity and scalability for medium-sized deployments handling up to ~1TB/day of logs.

🏗️ How the Loki Stack Works¶

Loki is a horizontally-scalable, highly-available log aggregation system designed for Kubernetes environments.
It efficiently collects, stores, and indexes logs, making them searchable via Grafana.

Loki Stack Components

Component	Purpose
Write	Handles log ingestion and writing data to storage (replaces Distributor + Ingester).
Read	Handles log queries and serves data to Grafana (replaces Querier + Query Frontend).
Backend	Manages background tasks like compaction and index maintenance (replaces Compactor).
Gateway	Optional nginx gateway for routing and authentication (disabled in our setup).
Memberlist-KV	Provides distributed key-value store for cluster coordination and metadata.

Promtail Integration

Promtail runs as a DaemonSet in the promtail namespace, tails logs from pods, and pushes them to Loki's Write component.

🎨 Deployment Architecture¶

SimpleScalable Mode¶

The environments use Loki's SimpleScalable deployment mode with three main components:

┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│    Write    │      │    Read     │      │   Backend   │
│  (3 pods)   │      │  (3 pods)   │      │  (3 pods)   │
├─────────────┤      ├─────────────┤      ├─────────────┤
│ • Ingestion │      │ • Queries   │      │ • Compactor │
│ • WAL       │      │ • Log Fetch │      │ • Retention │
│ • Storage   │      │ • Grafana   │      │ • Cleanup   │
└─────────────┘      └─────────────┘      └─────────────┘
       │                    │                    │
       └────────────────────┴────────────────────┘
                            │
                    ┌───────▼────────┐
                    │   AWS S3       │
                    │ (ub-global-us- │
                    │  loki-data)    │
                    └────────────────┘

Canary Disabled

The Loki canary component is disabled (lokiCanary.enabled: false) as it's optional for testing purposes.

📋 Helm Values Configuration¶

Global Settings¶

# Deployment mode configuration
deploymentMode: SimpleScalable

# Loki core configuration
loki:
  auth_enabled: false
  
  # Server configuration
  server:
    grpc_server_max_recv_msg_size: 104857600  # 100MB
    grpc_server_max_send_msg_size: 104857600
    http_server_read_timeout: 600s
    http_server_write_timeout: 600s

🔧 Component Configuration¶

Write Component (Ingestion)¶

write:
  enabled: true
  replicas: 1
  autoscaling:
    enabled: false
  
  # Persistence for WAL (Write-Ahead Log)
  persistence:
    volumeClaimsEnabled: true
    accessModes:
      - ReadWriteOnce
    size: 100Gi
    storageClass: gp3
    enableStatefulSetAutoDeletePVC: false

Write Component Scaling

The write component uses StatefulSets with persistent volumes. When scaling, refer to the Loki scaling documentation for proper WAL handling.

Read Component (Queries)¶

read:
  enabled: true
  replicas: 1
  autoscaling:
    enabled: false
  
  # Legacy read target disabled (using 3-target mode)
  legacyReadTarget: false

Backend Component¶

backend:
  enabled: true
  replicas: 1
  autoscaling:
    enabled: false
  
  # Persistence for backend processing
  persistence:
    volumeClaimsEnabled: true
    accessModes:
      - ReadWriteOnce
    size: 50Gi
    storageClass: gp3
    enableStatefulSetAutoDeletePVC: false

🗂️ AWS S3 Integration for Log Storage¶

Loki uses AWS S3 as its object storage backend for scalable, durable, and cost-effective log retention.

Storage Configuration¶

loki:
  storage:
    # S3 bucket names
    bucketNames:
      chunks: ub-global-us-loki-data
      ruler: ub-global-us-loki-data
      admin: ub-global-us-loki-data
    
    # Storage type
    type: s3
    
    # S3 configuration
    s3:
      s3: s3://ub-global-us-loki-data
      region: us-east-1
    
    # Object store configuration
    object_store:
      type: s3
      s3:
        region: us-east-1

Schema Configuration¶

loki:
  schemaConfig:
    configs:
      - from: 2025-11-01
        store: tsdb              # Time Series Database index
        object_store: aws        # AWS S3 for chunks
        schema: v13              # Latest schema version
        index:
          prefix: index_east_2025_
          period: 24h

Schema Details

Store Type: tsdb - Modern time-series database index format
Object Store: aws - Utilizes AWS S3 for chunk storage
Schema Version: v13 - Latest production-ready schema
Index Period: 24h - Daily index rotation

Authentication¶

Authentication to AWS S3 is managed using IAM roles and Pod Identity:

Loki pods use Pod Identity to assume IAM roles
IAM policies grant access to the S3 bucket
No static credentials required

Security Best Practice

Pod Identity ensures secure, temporary credentials are used without storing long-lived access keys in the cluster.

⚙️ Advanced Configuration¶

Limits and Performance¶

loki:
  limits_config:
    # Stream limits
    max_global_streams_per_user: 100000
    max_entries_limit_per_query: 100000
    
    # Retention settings
    reject_old_samples: false
    reject_old_samples_max_age: 9600h  # 400 days
    retention_period: 876000h          # ~100 years (effectively unlimited)
    
    # Volume tracking
    volume_enabled: true
    
    # Query optimization
    max_query_length: 8760h            # 1 year
    query_timeout: 600s                # 10 minutes
    split_queries_by_interval: 12h

Retention Configuration

While retention_period is set to a high value, actual retention is managed by the compactor settings and S3 lifecycle policies.

Query Optimization¶

loki:
  query_range:
    cache_results: true
    max_retries: 5
    align_queries_with_step: true
    cache_volume_results: true
  
  frontend:
    compress_responses: true
    max_outstanding_per_tenant: 100
    log_queries_longer_than: 30s

Compactor Settings¶

compactor:
  retention_enabled: false           # Retention handled by S3 lifecycle
  retention_delete_delay: 2h
  retention_delete_worker_count: 150
  compaction_interval: 10m

Compaction

The compactor runs every 10 minutes to optimize storage by merging small chunks and managing index files.

🚨 Ruler and Alerting¶

Ruler Configuration¶

ruler:
  enabled: true
  alertmanager_url: http://alertmanager-operated.monitoring.svc.cluster.local:9093
  enable_api: true
  enable_alertmanager_v2: true
  enable_alertmanager_discovery: true

loki:
  rulerConfig:
    wal:
      dir: /var/loki/ruler-wal
    storage:
      type: local
      local:
        directory: /rules

Alertmanager Integration

Loki's ruler integrates with Prometheus Alertmanager in the monitoring namespace for centralized alert management.

🔐 Security and Service Accounts¶

Service Account Configuration¶

serviceAccount:
  create: true
  name: loki
  automountServiceAccountToken: true
  
write:
  serviceAccount:
    create: false  # Uses shared service account
    
backend:
  serviceAccount:
    create: false  # Uses shared service account

IAM Integration

The Loki service account is annotated with IAM role ARN for Pod Identity to access AWS S3 securely.

🔍 Querying Logs¶

From Grafana¶

Navigate to Explore in Grafana
Select Loki datasource
Use LogQL to query logs:

{namespace="sim-service"} |= "error"

Filter by labels: namespace, app, pod, container
Time range: Adjust based on your needs

Common Query Patterns¶

# All errors in production namespace
{namespace=~"sim|sms|sia"} |= "error"

# High request latency
{namespace="mno-service"} | json | duration > 5s

# Failed authentications
{app="sia-service"} |~ "authentication failed"

Query Performance

Use specific label filters to reduce data scanned
Limit time ranges for faster queries
Leverage indexed labels like namespace and pod

📊 Monitoring and Observability¶

Metrics¶

Loki exposes Prometheus metrics on port 3100:

/metrics - Standard Prometheus metrics
Query rates, error rates, latency
Storage metrics (S3 operations, cache hits)
Ingestion metrics (bytes received, streams active)

Health Checks¶

# Check if Loki is ready
kubectl exec -n loki loki-write-0 -- wget -qO- http://localhost:3100/ready

# Check component status
kubectl exec -n loki loki-backend-0 -- wget -qO- http://localhost:3100/services

🛠️ Troubleshooting¶

Common Issues¶

Slow Queries¶

Query Timeout

If queries are timing out, check: - Time range (reduce if too large) - Query complexity (simplify filters) - Backend pod resources (scale if needed)

# Check backend logs
kubectl logs -n loki -l app.kubernetes.io/component=backend --tail=100

Missing Logs¶

Ingestion Issues

Check the write component and Promtail:

# Check write component
kubectl logs -n loki -l app.kubernetes.io/component=write --tail=100

# Check Promtail
kubectl logs -n promtail -l app.kubernetes.io/name=promtail --tail=100

# Verify S3 connectivity
kubectl exec -n loki loki-write-0 -- wget -qO- http://localhost:3100/services

Storage Issues¶

# Check S3 bucket access
aws s3 ls s3://ub-global-us-loki-data/ --profile production

# Verify IAM role permissions
kubectl describe pod -n loki loki-write-0 | grep -A5 "Service Account"

# Check PVC status
kubectl get pvc -n loki

🔄 Maintenance Operations¶

Scaling Components¶

# Scale read replicas for query load
kubectl scale statefulset -n loki loki-read --replicas=3

# Scale write replicas (follow WAL migration process)
# See: https://grafana.com/docs/loki/latest/operations/storage/wal/#how-to-scale-updown