📚 DevOps & Platform Overview¶
Welcome to the Unibeam DevOps & Platform Documentation!
This section contains comprehensive guides for infrastructure, CI/CD, monitoring, troubleshooting, and operational best practices for the Unibeam platform running on AWS EKS.
📖 Table of Contents¶
🌍 Environments¶
- Server List & Environments - Complete list of all environments
- POC & Development Environments
- Production Environments
- Campaign Deployments
🏗️ Infrastructure & Architecture¶
- Terraform Infrastructure - Complete infrastructure as code documentation
- AWS SSO Mapping - AWS SSO setup and user mapping
- MongoDB Atlas Configuration - MongoDB Atlas setup and TIM preload
🚀 CI/CD & GitOps¶
- ArgoCD Repository Structure - Complete ArgoCD repo organization and workflows
- ArgoCD Operations - ArgoCD deployment and management
- Maven CI/CD Flow - Java service build and deployment pipeline
- GitHub Actions Workflows - CI/CD workflow configuration
📊 Monitoring & Observability¶
- Loki Stack - Log aggregation with Loki and Promtail
- Prometheus HA Setup - High availability Prometheus configuration
- Loki Status & Health - Loki operational status checks
- Redis Labs Monitoring - Redis Cloud monitoring and alerts
🛠️ How-To Guides¶
- How to Deploy Applications - Complete deployment guide
- Grafana Backup & Restore - Backup Grafana dashboards and datasources
- Helm Chart Upgrades - Upgrade standalone Helm deployments
- S3 Mount for Pods - Mount S3 buckets in Kubernetes pods
💡 Tips & Tricks¶
- Kubernetes Tips - Essential kubectl commands and best practices
🔧 Platform Components¶
- Networking - Mikrotik TIM - Network configuration and routing
🌐 Platform Architecture¶
Repository Structure¶
The Unibeam platform is organized across multiple Git repositories, each serving a specific purpose:
| Repository | Purpose | Key Contents |
|---|---|---|
| argocd | GitOps deployment manifests | App-of-apps, values files, application definitions |
| kubernetes | Helm charts & K8s manifests | Service charts, infra charts, Kustomize overlays |
| unibeam-workload-sia-prod-terraform | Production infrastructure | Terraform modules for AWS, networking, security |
| iac | Infrastructure as Code | Additional Terraform, CloudFormation templates |
| troubleshooting-docs | Documentation | This MkDocs site with all guides |
Microservices Architecture¶
┌─────────────────────────────────────────────────────────────┐
│ AWS EKS Cluster │
├─────────────────────────────────────────────────────────────┤
│ │
│ Infrastructure Namespace (monitoring, loki, kafka) │
│ ├── kube-prometheus-stack (Prometheus, Grafana) │
│ ├── loki-stack (Loki, Promtail) │
│ ├── kafka (Strimzi Kafka) │
│ └── ingress-nginx (Ingress controller) │
│ │
│ Application Namespaces │
│ ├── sim-service (SIM provisioning) │
│ ├── sms-service (SMS gateway) │
│ ├── sia-service (Authentication) │
│ ├── mno-service (MNO integration) │
│ ├── audit-service (Audit logging) │
│ ├── dashboard-service (Admin dashboard) │
│ ├── timer-service (Scheduled tasks) │
│ └── scheduled-jobs (Cron jobs) │
│ │
└─────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
MongoDB Atlas Redis Cloud AWS S3
(Private Link) (Peering) (Object Storage)
🎯 Key Concepts¶
GitOps Workflow¶
All deployments follow a GitOps approach using ArgoCD:
- Code Changes → Push to service repository
- CI/CD Pipeline → GitHub Actions builds Docker image
- Image Push → Push to AWS ECR
- Manifest Update → Update ArgoCD values files
- Auto Sync → ArgoCD detects changes and deploys
Deployment Pattern
- All infrastructure changes go through Terraform
- All application changes go through ArgoCD
- No manual kubectl operations in production
Environment Structure¶
| Environment | Branch | Purpose | Sync Policy |
|---|---|---|---|
| Development | dev |
Feature testing | Auto-sync enabled |
| Demo | demo |
Client demonstrations | Auto-sync enabled |
| QA | qa |
Quality assurance | Manual sync |
| Production | main |
Live workloads | Manual sync with approval |
Infrastructure as Code¶
- Terraform manages all AWS resources (VPC, EKS, IAM, S3, RDS, etc.)
- Helm packages applications with environment-specific values
- Kustomize provides additional overlays when needed
🚀 Getting Started¶
Prerequisites¶
- AWS CLI configured with SSO credentials
- kubectl installed and configured
- Helm 3.x installed
- ArgoCD CLI (optional but recommended)
- Terraform 1.5+ (for infrastructure changes)
Quick Start Guide¶
1. Connect to EKS Cluster¶
# Update kubeconfig
aws eks update-kubeconfig --region us-east-1 --name ub-global-us
# Verify connection
kubectl get nodes
See Terraform Infrastructure for AWS configuration details.
2. Access ArgoCD UI¶
# Port forward to ArgoCD server
kubectl port-forward svc/argocd-server -n argocd 8080:443
# Get admin password
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d
Navigate to https://localhost:8080 and login.
See ArgoCD Operations for more details.
3. View Monitoring Dashboards¶
# Port forward to Grafana
kubectl port-forward svc/kube-prometheus-stack-grafana -n monitoring 3000:80
# Default credentials: admin / prom-operator
Navigate to http://localhost:3000.
4. Query Logs in Loki¶
# Port forward to Loki
kubectl port-forward svc/loki-gateway -n loki 3100:80
# Query logs via LogCLI
logcli query '{namespace="sim-service"}'
See Loki Stack for configuration details.
📦 Service Repositories¶
Core Services¶
| Service | Repository | Namespace | Port |
|---|---|---|---|
| SIM Service | sim-service |
sim |
8080 |
| SMS Service | sms-service |
sms |
8080 |
| SIA Service | sia-api |
sia |
8080 |
| MNO Service | mno-service |
mno |
8080 |
| Audit Service | audit-service |
audit |
8080 |
| Dashboard | sia-dashboard |
dashboard |
8080 |
| Timer Service | timer-service |
timer |
8080 |
| Scheduled Jobs | scheduled-jobs |
scheduled-jobs |
- |
Simulator Services¶
| Service | Repository | Purpose |
|---|---|---|
| SMSC Simulator | smsc-java-applet-simulator |
SMS gateway testing |
| Applet Simulator | applet-simulator-service |
SIM applet testing |
| Demo Server | demo-server |
Client demos and POCs |
🔧 Infrastructure Components¶
AWS Services Used¶
- Compute: EKS, EC2, Lambda
- Networking: VPC, Route53, CloudFront, ALB/NLB
- Storage: S3, EFS, EBS
- Database: RDS (if used), MongoDB Atlas (external), Redis Cloud (external)
- Security: IAM, Secrets Manager, ACM, Security Groups
- Messaging: MSK (Kafka), SNS, SQS
- Monitoring: CloudWatch (integration with Prometheus)
Kubernetes Add-ons¶
- Cert-Manager: Automatic TLS certificate management
- Ingress-Nginx: HTTP/HTTPS routing
- Prometheus Stack: Metrics collection and alerting
- Loki Stack: Log aggregation and querying
- Kafka (Strimzi): Event streaming
- ArgoCD: GitOps continuous delivery
See Terraform Infrastructure for complete infrastructure details.
📊 Monitoring & Logging¶
Metrics (Prometheus)¶
- Namespace:
monitoring - Components: Prometheus, Alertmanager, Grafana
- Retention: 15 days (configurable)
- Scrape Interval: 30 seconds
Key Dashboards: - Kubernetes cluster metrics - Application performance (RED metrics) - Resource utilization (CPU, memory, disk) - Kafka metrics - MongoDB and Redis metrics
Logs (Loki)¶
- Namespace:
loki - Components: Loki (write, read, backend), Promtail
- Storage: AWS S3 (
ub-global-us-loki-data) - Retention: ~100 years (effectively unlimited, managed by S3 lifecycle)
Log Sources: - All Kubernetes pod logs - Ingress access logs - Application logs (stdout/stderr)
See Loki Stack for detailed configuration.
Alerting¶
Alertmanager Configuration: - Slack notifications for critical alerts - Email notifications for warnings - PagerDuty integration (if configured)
Common Alerts: - High error rate - Pod restart loops - Resource exhaustion - Certificate expiration
🐛 Troubleshooting¶
Common Issues¶
Pods Not Starting¶
# Check pod status
kubectl get pods -n <namespace>
# Describe pod for events
kubectl describe pod <pod-name> -n <namespace>
# Check logs
kubectl logs <pod-name> -n <namespace>
Service Unreachable¶
# Check service endpoints
kubectl get endpoints -n <namespace>
# Check ingress
kubectl get ingress -n <namespace>
# Test service connectivity
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- curl http://<service-name>.<namespace>.svc.cluster.local
High Resource Usage¶
# Check resource usage
kubectl top nodes
kubectl top pods -n <namespace>
# Check HPA status
kubectl get hpa -n <namespace>
See Kubernetes Tips for more troubleshooting commands.
Getting Help¶
- Check existing documentation in this site
- Review pod logs and events
- Query Loki for application logs
- Check Grafana dashboards for metrics
- Open an issue in the relevant repository
🔐 Security Best Practices¶
IAM & Authentication¶
- ✅ Use Pod Identity for AWS service access (recommended)
- ✅ Use IRSA for legacy services if Pod Identity isn't available
- ✅ Apply least privilege IAM policies
- ❌ Never hardcode AWS credentials
- ❌ Don't use root credentials
Secrets Management¶
- ✅ Store secrets in AWS Secrets Manager
- ✅ Use external-secrets-operator to sync secrets to K8s
- ✅ Rotate credentials regularly
- ❌ Never commit secrets to Git
- ❌ Don't use plain Kubernetes secrets for sensitive data
Network Security¶
Layered Security Approach: 1. VPC Level: Public/Private subnet segregation 2. Firewall Level: Network firewall rules (DMZ/Non-DMZ) 3. Security Groups: Instance-level filtering 4. Network Policies: Pod-to-pod communication control 5. Service Mesh: mTLS between services (if applicable)
📚 Additional Resources¶
Internal Documentation¶
- Server List & Environments - All environments overview
- ArgoCD Repository Guide - Complete ArgoCD structure
- Terraform Infrastructure - Infrastructure as code
- Loki Stack Configuration - Log aggregation setup
External References¶
- Kubernetes Documentation
- ArgoCD Documentation
- Helm Documentation
- Terraform AWS Provider
- Prometheus Documentation
- Grafana Loki Documentation
🤝 Contributing¶
We welcome contributions to improve documentation and platform operations!
How to Contribute:
- Documentation Updates: Submit PRs to
troubleshooting-docsrepository - Infrastructure Changes: Follow Terraform workflow in
unibeam-workload-sia-prod-terraform - Service Updates: Update respective service repositories
- CI/CD Improvements: Update workflows in
CICDor service repositories
Guidelines:
- Follow existing documentation style and structure
- Use MkDocs Material theme conventions
- Include diagrams for complex concepts
- Test all code examples before submitting
- Update relevant documentation when making changes
📞 Support & Contact¶
For questions, issues, or support:
- Check Documentation: Search this site first
- Review Logs: Check Loki and application logs
- Check Metrics: Review Grafana dashboards
- Open Issues: Create issues in relevant repositories
- Contact Team: Reach out to DevOps team via Slack
Documentation Status
Some links in this document point to files that are being created or migrated.
If you encounter a broken link, please check back later or contact the DevOps team.
Platform Status
- EKS Cluster: ✅ Operational
- ArgoCD: ✅ Syncing
- Monitoring: ✅ Collecting metrics
- Logging: ✅ Ingesting logs
- All Services: ✅ Healthy
Last Updated
This documentation is continuously updated as the platform evolves.
Maintained by: Unibeam DevOps Team
For the latest updates, visit the troubleshooting-docs repository.