📚 DevOps & Platform Overview¶

2025-12-172025-12-17

Welcome to the Unibeam DevOps & Platform Documentation!
This section contains comprehensive guides for infrastructure, CI/CD, monitoring, troubleshooting, and operational best practices for the Unibeam platform running on AWS EKS.

📖 Table of Contents¶

🌍 Environments¶

Server List & Environments - Complete list of all environments
POC & Development Environments
Production Environments
Campaign Deployments

🏗️ Infrastructure & Architecture¶

Terraform Infrastructure - Complete infrastructure as code documentation

AWS SSO Mapping - AWS SSO setup and user mapping
MongoDB Atlas Configuration - MongoDB Atlas setup and TIM preload

🚀 CI/CD & GitOps¶

ArgoCD Repository Structure - Complete ArgoCD repo organization and workflows
ArgoCD Operations - ArgoCD deployment and management
Maven CI/CD Flow - Java service build and deployment pipeline
GitHub Actions Workflows - CI/CD workflow configuration

📊 Monitoring & Observability¶

Loki Stack - Log aggregation with Loki and Promtail
Prometheus HA Setup - High availability Prometheus configuration

Loki Status & Health - Loki operational status checks
Redis Labs Monitoring - Redis Cloud monitoring and alerts

🛠️ How-To Guides¶

How to Deploy Applications - Complete deployment guide
Grafana Backup & Restore - Backup Grafana dashboards and datasources
Helm Chart Upgrades - Upgrade standalone Helm deployments
S3 Mount for Pods - Mount S3 buckets in Kubernetes pods

💡 Tips & Tricks¶

Kubernetes Tips - Essential kubectl commands and best practices

🔧 Platform Components¶

Networking - Mikrotik TIM - Network configuration and routing

🌐 Platform Architecture¶

Repository Structure¶

The Unibeam platform is organized across multiple Git repositories, each serving a specific purpose:

Repository	Purpose	Key Contents
argocd	GitOps deployment manifests	App-of-apps, values files, application definitions
kubernetes	Helm charts & K8s manifests	Service charts, infra charts, Kustomize overlays
unibeam-workload-sia-prod-terraform	Production infrastructure	Terraform modules for AWS, networking, security
iac	Infrastructure as Code	Additional Terraform, CloudFormation templates
troubleshooting-docs	Documentation	This MkDocs site with all guides

Microservices Architecture¶

┌─────────────────────────────────────────────────────────────┐
│                    AWS EKS Cluster                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Infrastructure Namespace (monitoring, loki, kafka)         │
│  ├── kube-prometheus-stack  (Prometheus, Grafana)          │
│  ├── loki-stack             (Loki, Promtail)                │
│  ├── kafka                  (Strimzi Kafka)                 │
│  └── ingress-nginx          (Ingress controller)            │
│                                                             │
│  Application Namespaces                                     │
│  ├── sim-service            (SIM provisioning)              │
│  ├── sms-service            (SMS gateway)                   │
│  ├── sia-service            (Authentication)                │
│  ├── mno-service            (MNO integration)               │
│  ├── audit-service          (Audit logging)                 │
│  ├── dashboard-service      (Admin dashboard)               │
│  ├── timer-service          (Scheduled tasks)               │
│  └── scheduled-jobs         (Cron jobs)                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘
         │                    │                    │
         ▼                    ▼                    ▼
   MongoDB Atlas      Redis Cloud         AWS S3
   (Private Link)     (Peering)        (Object Storage)

🎯 Key Concepts¶

GitOps Workflow¶

All deployments follow a GitOps approach using ArgoCD:

Code Changes → Push to service repository
CI/CD Pipeline → GitHub Actions builds Docker image
Image Push → Push to AWS ECR
Manifest Update → Update ArgoCD values files
Auto Sync → ArgoCD detects changes and deploys

Deployment Pattern

All infrastructure changes go through Terraform
All application changes go through ArgoCD
No manual kubectl operations in production

Environment Structure¶

Environment	Branch	Purpose	Sync Policy
Development	`dev`	Feature testing	Auto-sync enabled
Demo	`demo`	Client demonstrations	Auto-sync enabled
QA	`qa`	Quality assurance	Manual sync
Production	`main`	Live workloads	Manual sync with approval

Infrastructure as Code¶

Terraform manages all AWS resources (VPC, EKS, IAM, S3, RDS, etc.)
Helm packages applications with environment-specific values
Kustomize provides additional overlays when needed

🚀 Getting Started¶

Prerequisites¶

AWS CLI configured with SSO credentials
kubectl installed and configured
Helm 3.x installed
ArgoCD CLI (optional but recommended)
Terraform 1.5+ (for infrastructure changes)

Quick Start Guide¶

1. Connect to EKS Cluster¶

# Update kubeconfig
aws eks update-kubeconfig --region us-east-1 --name ub-global-us

# Verify connection
kubectl get nodes

See Terraform Infrastructure for AWS configuration details.

2. Access ArgoCD UI¶

# Port forward to ArgoCD server
kubectl port-forward svc/argocd-server -n argocd 8080:443

# Get admin password
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

Navigate to https://localhost:8080 and login.

See ArgoCD Operations for more details.

3. View Monitoring Dashboards¶

# Port forward to Grafana
kubectl port-forward svc/kube-prometheus-stack-grafana -n monitoring 3000:80

# Default credentials: admin / prom-operator

Navigate to http://localhost:3000.

4. Query Logs in Loki¶

# Port forward to Loki
kubectl port-forward svc/loki-gateway -n loki 3100:80

# Query logs via LogCLI
logcli query '{namespace="sim-service"}'

See Loki Stack for configuration details.

📦 Service Repositories¶

Core Services¶

Service	Repository	Namespace	Port
SIM Service	`sim-service`	`sim`	8080
SMS Service	`sms-service`	`sms`	8080
SIA Service	`sia-api`	`sia`	8080
MNO Service	`mno-service`	`mno`	8080
Audit Service	`audit-service`	`audit`	8080
Dashboard	`sia-dashboard`	`dashboard`	8080
Timer Service	`timer-service`	`timer`	8080
Scheduled Jobs	`scheduled-jobs`	`scheduled-jobs`	-

Simulator Services¶

Service	Repository	Purpose
SMSC Simulator	`smsc-java-applet-simulator`	SMS gateway testing
Applet Simulator	`applet-simulator-service`	SIM applet testing
Demo Server	`demo-server`	Client demos and POCs

🔧 Infrastructure Components¶

AWS Services Used¶

Compute: EKS, EC2, Lambda
Networking: VPC, Route53, CloudFront, ALB/NLB
Storage: S3, EFS, EBS
Database: RDS (if used), MongoDB Atlas (external), Redis Cloud (external)
Security: IAM, Secrets Manager, ACM, Security Groups
Messaging: MSK (Kafka), SNS, SQS
Monitoring: CloudWatch (integration with Prometheus)

Kubernetes Add-ons¶

Cert-Manager: Automatic TLS certificate management
Ingress-Nginx: HTTP/HTTPS routing
Prometheus Stack: Metrics collection and alerting
Loki Stack: Log aggregation and querying
Kafka (Strimzi): Event streaming
ArgoCD: GitOps continuous delivery

See Terraform Infrastructure for complete infrastructure details.

📊 Monitoring & Logging¶

Metrics (Prometheus)¶

Namespace: monitoring
Components: Prometheus, Alertmanager, Grafana
Retention: 15 days (configurable)
Scrape Interval: 30 seconds

Key Dashboards: - Kubernetes cluster metrics - Application performance (RED metrics) - Resource utilization (CPU, memory, disk) - Kafka metrics - MongoDB and Redis metrics

Logs (Loki)¶

Namespace: loki
Components: Loki (write, read, backend), Promtail
Storage: AWS S3 (ub-global-us-loki-data)
Retention: ~100 years (effectively unlimited, managed by S3 lifecycle)

Log Sources: - All Kubernetes pod logs - Ingress access logs - Application logs (stdout/stderr)

See Loki Stack for detailed configuration.

Alerting¶

Alertmanager Configuration: - Slack notifications for critical alerts - Email notifications for warnings - PagerDuty integration (if configured)

Common Alerts: - High error rate - Pod restart loops - Resource exhaustion - Certificate expiration

🐛 Troubleshooting¶

Common Issues¶

Pods Not Starting¶

# Check pod status
kubectl get pods -n <namespace>

# Describe pod for events
kubectl describe pod <pod-name> -n <namespace>

# Check logs
kubectl logs <pod-name> -n <namespace>

Service Unreachable¶

# Check service endpoints
kubectl get endpoints -n <namespace>

# Check ingress
kubectl get ingress -n <namespace>

# Test service connectivity
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- curl http://<service-name>.<namespace>.svc.cluster.local

High Resource Usage¶

# Check resource usage
kubectl top nodes
kubectl top pods -n <namespace>

# Check HPA status
kubectl get hpa -n <namespace>

See Kubernetes Tips for more troubleshooting commands.

Getting Help¶

Check existing documentation in this site
Review pod logs and events
Query Loki for application logs
Check Grafana dashboards for metrics
Open an issue in the relevant repository

🔐 Security Best Practices¶

IAM & Authentication¶

✅ Use Pod Identity for AWS service access (recommended)
✅ Use IRSA for legacy services if Pod Identity isn't available
✅ Apply least privilege IAM policies
❌ Never hardcode AWS credentials
❌ Don't use root credentials

Secrets Management¶

✅ Store secrets in AWS Secrets Manager
✅ Use external-secrets-operator to sync secrets to K8s
✅ Rotate credentials regularly
❌ Never commit secrets to Git
❌ Don't use plain Kubernetes secrets for sensitive data

Network Security¶

Layered Security Approach: 1. VPC Level: Public/Private subnet segregation 2. Firewall Level: Network firewall rules (DMZ/Non-DMZ) 3. Security Groups: Instance-level filtering 4. Network Policies: Pod-to-pod communication control 5. Service Mesh: mTLS between services (if applicable)

📚 Additional Resources¶

Internal Documentation¶

Server List & Environments - All environments overview
ArgoCD Repository Guide - Complete ArgoCD structure

Terraform Infrastructure - Infrastructure as code
Loki Stack Configuration - Log aggregation setup

External References¶

🤝 Contributing¶

We welcome contributions to improve documentation and platform operations!

How to Contribute:

Documentation Updates: Submit PRs to troubleshooting-docs repository
Infrastructure Changes: Follow Terraform workflow in unibeam-workload-sia-prod-terraform
Service Updates: Update respective service repositories
CI/CD Improvements: Update workflows in CICD or service repositories

Guidelines:

Follow existing documentation style and structure
Use MkDocs Material theme conventions
Include diagrams for complex concepts
Test all code examples before submitting
Update relevant documentation when making changes

📞 Support & Contact¶

For questions, issues, or support:

Check Documentation: Search this site first
Review Logs: Check Loki and application logs
Check Metrics: Review Grafana dashboards
Open Issues: Create issues in relevant repositories
Contact Team: Reach out to DevOps team via Slack

Documentation Status

Some links in this document point to files that are being created or migrated.
If you encounter a broken link, please check back later or contact the DevOps team.

Platform Status

EKS Cluster: ✅ Operational
ArgoCD: ✅ Syncing
Monitoring: ✅ Collecting metrics
Logging: ✅ Ingesting logs
All Services: ✅ Healthy

Last Updated

This documentation is continuously updated as the platform evolves.
Maintained by: Unibeam DevOps Team

For the latest updates, visit the troubleshooting-docs repository.