Skip to content

📚 DevOps & Platform Overview

Welcome to the Unibeam DevOps & Platform Documentation!
This section contains comprehensive guides for infrastructure, CI/CD, monitoring, troubleshooting, and operational best practices for the Unibeam platform running on AWS EKS.


📖 Table of Contents

🌍 Environments

🏗️ Infrastructure & Architecture

🚀 CI/CD & GitOps

📊 Monitoring & Observability

🛠️ How-To Guides

💡 Tips & Tricks

🔧 Platform Components


🌐 Platform Architecture

Repository Structure

The Unibeam platform is organized across multiple Git repositories, each serving a specific purpose:

Repository Purpose Key Contents
argocd GitOps deployment manifests App-of-apps, values files, application definitions
kubernetes Helm charts & K8s manifests Service charts, infra charts, Kustomize overlays
unibeam-workload-sia-prod-terraform Production infrastructure Terraform modules for AWS, networking, security
iac Infrastructure as Code Additional Terraform, CloudFormation templates
troubleshooting-docs Documentation This MkDocs site with all guides

Microservices Architecture

┌─────────────────────────────────────────────────────────────┐
│                    AWS EKS Cluster                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Infrastructure Namespace (monitoring, loki, kafka)         │
│  ├── kube-prometheus-stack  (Prometheus, Grafana)          │
│  ├── loki-stack             (Loki, Promtail)                │
│  ├── kafka                  (Strimzi Kafka)                 │
│  └── ingress-nginx          (Ingress controller)            │
│                                                             │
│  Application Namespaces                                     │
│  ├── sim-service            (SIM provisioning)              │
│  ├── sms-service            (SMS gateway)                   │
│  ├── sia-service            (Authentication)                │
│  ├── mno-service            (MNO integration)               │
│  ├── audit-service          (Audit logging)                 │
│  ├── dashboard-service      (Admin dashboard)               │
│  ├── timer-service          (Scheduled tasks)               │
│  └── scheduled-jobs         (Cron jobs)                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘
         │                    │                    │
         ▼                    ▼                    ▼
   MongoDB Atlas      Redis Cloud         AWS S3
   (Private Link)     (Peering)        (Object Storage)

🎯 Key Concepts

GitOps Workflow

All deployments follow a GitOps approach using ArgoCD:

  1. Code Changes → Push to service repository
  2. CI/CD Pipeline → GitHub Actions builds Docker image
  3. Image Push → Push to AWS ECR
  4. Manifest Update → Update ArgoCD values files
  5. Auto Sync → ArgoCD detects changes and deploys

Deployment Pattern

  • All infrastructure changes go through Terraform
  • All application changes go through ArgoCD
  • No manual kubectl operations in production

Environment Structure

Environment Branch Purpose Sync Policy
Development dev Feature testing Auto-sync enabled
Demo demo Client demonstrations Auto-sync enabled
QA qa Quality assurance Manual sync
Production main Live workloads Manual sync with approval

Infrastructure as Code

  • Terraform manages all AWS resources (VPC, EKS, IAM, S3, RDS, etc.)
  • Helm packages applications with environment-specific values
  • Kustomize provides additional overlays when needed

🚀 Getting Started

Prerequisites

  • AWS CLI configured with SSO credentials
  • kubectl installed and configured
  • Helm 3.x installed
  • ArgoCD CLI (optional but recommended)
  • Terraform 1.5+ (for infrastructure changes)

Quick Start Guide

1. Connect to EKS Cluster

# Update kubeconfig
aws eks update-kubeconfig --region us-east-1 --name ub-global-us

# Verify connection
kubectl get nodes

See Terraform Infrastructure for AWS configuration details.

2. Access ArgoCD UI

# Port forward to ArgoCD server
kubectl port-forward svc/argocd-server -n argocd 8080:443

# Get admin password
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

Navigate to https://localhost:8080 and login.

See ArgoCD Operations for more details.

3. View Monitoring Dashboards

# Port forward to Grafana
kubectl port-forward svc/kube-prometheus-stack-grafana -n monitoring 3000:80

# Default credentials: admin / prom-operator

Navigate to http://localhost:3000.

4. Query Logs in Loki

# Port forward to Loki
kubectl port-forward svc/loki-gateway -n loki 3100:80

# Query logs via LogCLI
logcli query '{namespace="sim-service"}'

See Loki Stack for configuration details.


📦 Service Repositories

Core Services

Service Repository Namespace Port
SIM Service sim-service sim 8080
SMS Service sms-service sms 8080
SIA Service sia-api sia 8080
MNO Service mno-service mno 8080
Audit Service audit-service audit 8080
Dashboard sia-dashboard dashboard 8080
Timer Service timer-service timer 8080
Scheduled Jobs scheduled-jobs scheduled-jobs -

Simulator Services

Service Repository Purpose
SMSC Simulator smsc-java-applet-simulator SMS gateway testing
Applet Simulator applet-simulator-service SIM applet testing
Demo Server demo-server Client demos and POCs

🔧 Infrastructure Components

AWS Services Used

  • Compute: EKS, EC2, Lambda
  • Networking: VPC, Route53, CloudFront, ALB/NLB
  • Storage: S3, EFS, EBS
  • Database: RDS (if used), MongoDB Atlas (external), Redis Cloud (external)
  • Security: IAM, Secrets Manager, ACM, Security Groups
  • Messaging: MSK (Kafka), SNS, SQS
  • Monitoring: CloudWatch (integration with Prometheus)

Kubernetes Add-ons

  • Cert-Manager: Automatic TLS certificate management
  • Ingress-Nginx: HTTP/HTTPS routing
  • Prometheus Stack: Metrics collection and alerting
  • Loki Stack: Log aggregation and querying
  • Kafka (Strimzi): Event streaming
  • ArgoCD: GitOps continuous delivery

See Terraform Infrastructure for complete infrastructure details.


📊 Monitoring & Logging

Metrics (Prometheus)

  • Namespace: monitoring
  • Components: Prometheus, Alertmanager, Grafana
  • Retention: 15 days (configurable)
  • Scrape Interval: 30 seconds

Key Dashboards: - Kubernetes cluster metrics - Application performance (RED metrics) - Resource utilization (CPU, memory, disk) - Kafka metrics - MongoDB and Redis metrics

Logs (Loki)

  • Namespace: loki
  • Components: Loki (write, read, backend), Promtail
  • Storage: AWS S3 (ub-global-us-loki-data)
  • Retention: ~100 years (effectively unlimited, managed by S3 lifecycle)

Log Sources: - All Kubernetes pod logs - Ingress access logs - Application logs (stdout/stderr)

See Loki Stack for detailed configuration.

Alerting

Alertmanager Configuration: - Slack notifications for critical alerts - Email notifications for warnings - PagerDuty integration (if configured)

Common Alerts: - High error rate - Pod restart loops - Resource exhaustion - Certificate expiration


🐛 Troubleshooting

Common Issues

Pods Not Starting

# Check pod status
kubectl get pods -n <namespace>

# Describe pod for events
kubectl describe pod <pod-name> -n <namespace>

# Check logs
kubectl logs <pod-name> -n <namespace>

Service Unreachable

# Check service endpoints
kubectl get endpoints -n <namespace>

# Check ingress
kubectl get ingress -n <namespace>

# Test service connectivity
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- curl http://<service-name>.<namespace>.svc.cluster.local

High Resource Usage

# Check resource usage
kubectl top nodes
kubectl top pods -n <namespace>

# Check HPA status
kubectl get hpa -n <namespace>

See Kubernetes Tips for more troubleshooting commands.

Getting Help

  • Check existing documentation in this site
  • Review pod logs and events
  • Query Loki for application logs
  • Check Grafana dashboards for metrics
  • Open an issue in the relevant repository

🔐 Security Best Practices

IAM & Authentication

  • ✅ Use Pod Identity for AWS service access (recommended)
  • ✅ Use IRSA for legacy services if Pod Identity isn't available
  • ✅ Apply least privilege IAM policies
  • ❌ Never hardcode AWS credentials
  • ❌ Don't use root credentials

Secrets Management

  • ✅ Store secrets in AWS Secrets Manager
  • ✅ Use external-secrets-operator to sync secrets to K8s
  • ✅ Rotate credentials regularly
  • ❌ Never commit secrets to Git
  • ❌ Don't use plain Kubernetes secrets for sensitive data

Network Security

Layered Security Approach: 1. VPC Level: Public/Private subnet segregation 2. Firewall Level: Network firewall rules (DMZ/Non-DMZ) 3. Security Groups: Instance-level filtering 4. Network Policies: Pod-to-pod communication control 5. Service Mesh: mTLS between services (if applicable)


📚 Additional Resources

Internal Documentation

External References


🤝 Contributing

We welcome contributions to improve documentation and platform operations!

How to Contribute:

  1. Documentation Updates: Submit PRs to troubleshooting-docs repository
  2. Infrastructure Changes: Follow Terraform workflow in unibeam-workload-sia-prod-terraform
  3. Service Updates: Update respective service repositories
  4. CI/CD Improvements: Update workflows in CICD or service repositories

Guidelines:

  • Follow existing documentation style and structure
  • Use MkDocs Material theme conventions
  • Include diagrams for complex concepts
  • Test all code examples before submitting
  • Update relevant documentation when making changes

📞 Support & Contact

For questions, issues, or support:

  1. Check Documentation: Search this site first
  2. Review Logs: Check Loki and application logs
  3. Check Metrics: Review Grafana dashboards
  4. Open Issues: Create issues in relevant repositories
  5. Contact Team: Reach out to DevOps team via Slack

Documentation Status

Some links in this document point to files that are being created or migrated.
If you encounter a broken link, please check back later or contact the DevOps team.

Platform Status

  • EKS Cluster: ✅ Operational
  • ArgoCD: ✅ Syncing
  • Monitoring: ✅ Collecting metrics
  • Logging: ✅ Ingesting logs
  • All Services: ✅ Healthy

Last Updated

This documentation is continuously updated as the platform evolves.
Maintained by: Unibeam DevOps Team


For the latest updates, visit the troubleshooting-docs repository.