📊 Loki-Stack¶
🎯 Overview¶
Loki is a horizontally-scalable, highly-available log aggregation system designed for Kubernetes environments.
It efficiently collects, stores, and indexes logs, making them searchable via Grafana.
Deployment Mode
All Unibeam environments run Loki in SimpleScalable mode, which provides a balance between simplicity and scalability for medium-sized deployments handling up to ~1TB/day of logs.
🏗️ How the Loki Stack Works¶
Loki is a horizontally-scalable, highly-available log aggregation system designed for Kubernetes environments.
It efficiently collects, stores, and indexes logs, making them searchable via Grafana.
Loki Stack Components
| Component | Purpose |
|---|---|
| Write | Handles log ingestion and writing data to storage (replaces Distributor + Ingester). |
| Read | Handles log queries and serves data to Grafana (replaces Querier + Query Frontend). |
| Backend | Manages background tasks like compaction and index maintenance (replaces Compactor). |
| Gateway | Optional nginx gateway for routing and authentication (disabled in our setup). |
| Memberlist-KV | Provides distributed key-value store for cluster coordination and metadata. |
Promtail Integration
Promtail runs as a DaemonSet in the promtail namespace, tails logs from pods, and pushes them to Loki's Write component.
🎨 Deployment Architecture¶
SimpleScalable Mode¶
The environments use Loki's SimpleScalable deployment mode with three main components:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Write │ │ Read │ │ Backend │
│ (3 pods) │ │ (3 pods) │ │ (3 pods) │
├─────────────┤ ├─────────────┤ ├─────────────┤
│ • Ingestion │ │ • Queries │ │ • Compactor │
│ • WAL │ │ • Log Fetch │ │ • Retention │
│ • Storage │ │ • Grafana │ │ • Cleanup │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
└────────────────────┴────────────────────┘
│
┌───────▼────────┐
│ AWS S3 │
│ (ub-global-us- │
│ loki-data) │
└────────────────┘
Canary Disabled
The Loki canary component is disabled (lokiCanary.enabled: false) as it's optional for testing purposes.
📋 Helm Values Configuration¶
Global Settings¶
# Deployment mode configuration
deploymentMode: SimpleScalable
# Loki core configuration
loki:
auth_enabled: false
# Server configuration
server:
grpc_server_max_recv_msg_size: 104857600 # 100MB
grpc_server_max_send_msg_size: 104857600
http_server_read_timeout: 600s
http_server_write_timeout: 600s
🔧 Component Configuration¶
Write Component (Ingestion)¶
write:
enabled: true
replicas: 1
autoscaling:
enabled: false
# Persistence for WAL (Write-Ahead Log)
persistence:
volumeClaimsEnabled: true
accessModes:
- ReadWriteOnce
size: 100Gi
storageClass: gp3
enableStatefulSetAutoDeletePVC: false
Write Component Scaling
The write component uses StatefulSets with persistent volumes. When scaling, refer to the Loki scaling documentation for proper WAL handling.
Read Component (Queries)¶
read:
enabled: true
replicas: 1
autoscaling:
enabled: false
# Legacy read target disabled (using 3-target mode)
legacyReadTarget: false
Backend Component¶
backend:
enabled: true
replicas: 1
autoscaling:
enabled: false
# Persistence for backend processing
persistence:
volumeClaimsEnabled: true
accessModes:
- ReadWriteOnce
size: 50Gi
storageClass: gp3
enableStatefulSetAutoDeletePVC: false
🗂️ AWS S3 Integration for Log Storage¶
Loki uses AWS S3 as its object storage backend for scalable, durable, and cost-effective log retention.
Storage Configuration¶
loki:
storage:
# S3 bucket names
bucketNames:
chunks: ub-global-us-loki-data
ruler: ub-global-us-loki-data
admin: ub-global-us-loki-data
# Storage type
type: s3
# S3 configuration
s3:
s3: s3://ub-global-us-loki-data
region: us-east-1
# Object store configuration
object_store:
type: s3
s3:
region: us-east-1
Schema Configuration¶
loki:
schemaConfig:
configs:
- from: 2025-11-01
store: tsdb # Time Series Database index
object_store: aws # AWS S3 for chunks
schema: v13 # Latest schema version
index:
prefix: index_east_2025_
period: 24h
Schema Details
- Store Type:
tsdb- Modern time-series database index format - Object Store:
aws- Utilizes AWS S3 for chunk storage - Schema Version:
v13- Latest production-ready schema - Index Period:
24h- Daily index rotation
Authentication¶
Authentication to AWS S3 is managed using IAM roles and Pod Identity:
- Loki pods use Pod Identity to assume IAM roles
- IAM policies grant access to the S3 bucket
- No static credentials required
Security Best Practice
Pod Identity ensures secure, temporary credentials are used without storing long-lived access keys in the cluster.
⚙️ Advanced Configuration¶
Limits and Performance¶
loki:
limits_config:
# Stream limits
max_global_streams_per_user: 100000
max_entries_limit_per_query: 100000
# Retention settings
reject_old_samples: false
reject_old_samples_max_age: 9600h # 400 days
retention_period: 876000h # ~100 years (effectively unlimited)
# Volume tracking
volume_enabled: true
# Query optimization
max_query_length: 8760h # 1 year
query_timeout: 600s # 10 minutes
split_queries_by_interval: 12h
Retention Configuration
While retention_period is set to a high value, actual retention is managed by the compactor settings and S3 lifecycle policies.
Query Optimization¶
loki:
query_range:
cache_results: true
max_retries: 5
align_queries_with_step: true
cache_volume_results: true
frontend:
compress_responses: true
max_outstanding_per_tenant: 100
log_queries_longer_than: 30s
Compactor Settings¶
compactor:
retention_enabled: false # Retention handled by S3 lifecycle
retention_delete_delay: 2h
retention_delete_worker_count: 150
compaction_interval: 10m
Compaction
The compactor runs every 10 minutes to optimize storage by merging small chunks and managing index files.
🚨 Ruler and Alerting¶
Ruler Configuration¶
ruler:
enabled: true
alertmanager_url: http://alertmanager-operated.monitoring.svc.cluster.local:9093
enable_api: true
enable_alertmanager_v2: true
enable_alertmanager_discovery: true
loki:
rulerConfig:
wal:
dir: /var/loki/ruler-wal
storage:
type: local
local:
directory: /rules
Alertmanager Integration
Loki's ruler integrates with Prometheus Alertmanager in the monitoring namespace for centralized alert management.
🔐 Security and Service Accounts¶
Service Account Configuration¶
serviceAccount:
create: true
name: loki
automountServiceAccountToken: true
write:
serviceAccount:
create: false # Uses shared service account
backend:
serviceAccount:
create: false # Uses shared service account
IAM Integration
The Loki service account is annotated with IAM role ARN for Pod Identity to access AWS S3 securely.
🔍 Querying Logs¶
From Grafana¶
- Navigate to Explore in Grafana
- Select Loki datasource
- Use LogQL to query logs:
- Filter by labels:
namespace,app,pod,container - Time range: Adjust based on your needs
Common Query Patterns¶
# All errors in production namespace
{namespace=~"sim|sms|sia"} |= "error"
# High request latency
{namespace="mno-service"} | json | duration > 5s
# Failed authentications
{app="sia-service"} |~ "authentication failed"
Query Performance
- Use specific label filters to reduce data scanned
- Limit time ranges for faster queries
- Leverage indexed labels like
namespaceandpod
📊 Monitoring and Observability¶
Metrics¶
Loki exposes Prometheus metrics on port 3100:
/metrics- Standard Prometheus metrics- Query rates, error rates, latency
- Storage metrics (S3 operations, cache hits)
- Ingestion metrics (bytes received, streams active)
Health Checks¶
# Check if Loki is ready
kubectl exec -n loki loki-write-0 -- wget -qO- http://localhost:3100/ready
# Check component status
kubectl exec -n loki loki-backend-0 -- wget -qO- http://localhost:3100/services
🛠️ Troubleshooting¶
Common Issues¶
Slow Queries¶
Query Timeout
If queries are timing out, check: - Time range (reduce if too large) - Query complexity (simplify filters) - Backend pod resources (scale if needed)
Missing Logs¶
Ingestion Issues
Check the write component and Promtail:
# Check write component
kubectl logs -n loki -l app.kubernetes.io/component=write --tail=100
# Check Promtail
kubectl logs -n promtail -l app.kubernetes.io/name=promtail --tail=100
# Verify S3 connectivity
kubectl exec -n loki loki-write-0 -- wget -qO- http://localhost:3100/services
Storage Issues¶
# Check S3 bucket access
aws s3 ls s3://ub-global-us-loki-data/ --profile production
# Verify IAM role permissions
kubectl describe pod -n loki loki-write-0 | grep -A5 "Service Account"
# Check PVC status
kubectl get pvc -n loki
🔄 Maintenance Operations¶
Scaling Components¶
# Scale read replicas for query load
kubectl scale statefulset -n loki loki-read --replicas=3
# Scale write replicas (follow WAL migration process)
# See: https://grafana.com/docs/loki/latest/operations/storage/wal/#how-to-scale-updown
Write Component Scaling
Scaling the write component requires careful WAL (Write-Ahead Log) migration. Always follow the official Loki documentation.
Updating Configuration¶
# Update values in ArgoCD repository
cd /path/to/argocd/infra-applications/values/grafana-loki/
# Edit environment-specific values
vi values-ub-global-us.yaml
# Commit and push
git add .
git commit -m "Update Loki configuration"
git push
# ArgoCD will automatically sync changes
📚 Related Documentation¶
🔗 External Resources¶
Last Updated: January 2025
Maintained by: DevOps Team