Unibeam ATnT Prod

2025-12-172025-12-17

TOC¶

High Level Architecture Overview
Network Architecture Overview
AWS Components
SaaS Components
Unibeam Components
EKS Components
System Wide Applications
Network Detailed Architecture Overview
- CloudFlare:
  - Cloudflare LoadBalancer HealthCheck configurations
- AWS Global Accelerator
  - SIM-Accelerator details
EKS/K8s Integrations with AWS
- AWS Pod Identity Add-on
- Loki/Thanos/Prometheus
Repo OverView
CICD Flow
CLI
general-app/
infra-applications/
services-applications/
strating debug node
Stress Test Kafka

Technical high-level architecture overview for the ATNT application, which is a multi-region, active-passive (warm) deployment on AWS using EKS. The architecture includes various AWS components, SaaS integrations, and Kubernetes components to ensure high availability, scalability, and security.

High Level Architecture Overview¶

Active/Passive workload, us-east-1 is the main region with considerable workload.

Network Architecture Overview¶

☁️ Cloud AWS Components¶

EKS - Managed service that simplifies running Kubernetes on AWS
- Control Plane - AWS manages the Kubernetes control plane, including the API servers and etcd database, ensuring high availability and automatic patching
  - Worker - AWS manages the worker nodes, which are the EC2 instances that run containerized applications.
EC2 - Amazon Elastic Compute Cloud
ECR - Amazon Elastic Container Registry
Secret Manager - AWS service to securely store and manage sensitive information such as API keys, passwords, and other secrets used by applications
S3 - Amazon Simple Storage Service
VPC - Amazon Virtual Private Cloud
TGW - Amazon Transit Gateway
Firewall - AWS managed firewall service that provides network security for VPCs Ingress/Egress
ALB - Application Load Balancer, which routes HTTP/HTTPS traffic to the appropriate targets based on rules
NLB - Network Load Balancer, which handles TCP/UDP traffic
Global Accelerator - AWS networking service that enhances the availability and performance of applications with global users by routing traffic through the AWS global network
Route53 - AWS managed DNS service, used for internal DNS resolution

SaaS Components¶

Atlas - A cloud-based database service that provides a fully managed MongoDB database solution, ACL is based on IAM Roles that are attached to Atlas, each application have its own role and permission set.
- MongoDB is running in Active/Active, with multiple regionalized private endpoints, in failure event no operation is needed.
RedisLabs - A cloud-based service that provides a fully managed Redis database solution, ACL based on username/password authentication, each application has its own username and password for accessing Redis which is stored in AWS Secrets Manager.
- Redis Cluster is running in Active/Active mode, with replication across regions for high availability and low latency access.
- Continues backup every couple of hours

[!note] Atlas MongoDB - connected via AWS Private Link service, only TLS1.2 is currently supported. Redis - Connected via peering connection, TLS1.2 and v1.3 support.

Unibeam Components¶

US-EAST-1 region will run majority workload, this includes replication sets for each application, for increased load. US-WEST-2 will be running in “light mode” with less workers, and be increased once more load will be introduced.

Each region, will have IPSEC tunnel for SMSC endpoint termination, sms-service will be run in every region to insure SMPP bind is available, and to be in "warm" state.

Unibeam workloads will be running with HPA (Horizontal Pod Autoscaler) to automatically scale the number of pods based on CPU utilization or other select metrics, ensuring optimal performance and resource utilization.

Karpenter will be used to automatically provision and manage compute resources (worker-nodes) for workloads, ensuring efficient resource utilization and scaling based on demand. * Monitoring (Unscheduled Pods) - automatically provision new worker nodes to accommodate unscheduled pods, ensuring that workloads are always running and available.

EKS Components¶

Workers - EC2 Instance aggregated by worker groups, labeled per workload type.
- Unibeam Workers - Running Unibeam workloads
  - SIM, SMS, MNO, API, Dashboard, Timer, Scheduled-Jobs, Audit
- Kafka Workers - Running Kafka workloads (Broker, Coordinator)
- Spot Workers - Running Spot workloads (for increased capacity and cost savings) - for stateless applications
  - Grafana
  - Loki Querier
  - Loki Distributor
  - ArgoCD Components (except Redis)
  - Strimzi
  - AWS Load Balancer Controller
  - Reflector
- Monitor Workers - Running Monitoring workloads and CD
  - Kube-Prometheus-Stack - A monitoring system and time series database that collects metrics from Kubernetes clusters
    - AlertManager - Handles alerts generated by Prometheus and sends notifications based on configured rules
    - Grafana - A visualization tool that provides dashboards and graphs for monitoring metrics collected by Prometheus
    - Prometheus - A time series database that collects and stores metrics from various sources
    - Node Exporter - Collects hardware and OS metrics from the host system
    - Kube State Metrics - Exposes metrics about the state of Kubernetes objects
  - Loki - A log aggregation system that collects and stores logs from various sources
    - Ingester - Processes incoming logs and stores them in a time series database
    - Querier - Provides a query interface for retrieving logs from Loki
    - Distributor - Distributes incoming logs to the appropriate ingester
    - Index Gateway - Manages the indexing of logs for efficient querying
    - Compactor - Compacts and optimizes stored logs for better performance
  - ArgoCD - A continuous delivery tool for Kubernetes that automates the deployment of applications
    - ArgoCD Server - Provides the web UI and API for managing applications
    - ArgoCD Repo Server - Handles Git repository interactions and application definitions
    - ArgoCD Application Controller - Monitors the state of applications and ensures they are in sync with the desired state defined in Git
    - ArgoCD Dex - Provides authentication and authorization for ArgoCD
    - ArgoCD Redis - Used for caching and storing application state information
  - Thanos - designed to provide a highly available, long-term storage solution for Prometheus metrics
    - Sidecar - runs alongside Prometheus to enable Thanos features
    - Store Gateway - provides access to historical metrics stored in object storage (e.g., S3)
    - Querier - provides a unified query interface for accessing metrics from multiple Prometheus instances
    - Compactor - compacts and optimizes stored metrics for better performance
  - Promtail - agent designed to collect, process, and ship logs to Loki
  - Karpenter - designed to automatically provision and manage compute resources (nodes) for workloads

K8s workloads deployed with nodeSelectors and tolerations to ensure they run on the appropriate worker groups. This allows for efficient resource utilization and workload management.

[!note] Preferred EC2 workers are t4g instances, which are powered by AWS Graviton 2 processors for better performance and cost efficiency.

[!warning] Amazon EC2 Spot Instances are discounted, unused AWS compute capacity that you can bid on, offering savings of up to 90% compared to On-Demand prices. However, AWS can reclaim them with just a 2-minute warning when demand increases.

System Wide Applications¶

Additional application without dedicated workers: * AWS Pod Identity Add-on - Allows Kubernetes pods to assume IAM roles for accessing AWS services securely * CoreDNS - Provides DNS services for Kubernetes clusters, enabling service discovery and name resolution * Amazon VPC CNI - Enable pod networking within your cluster, VPC support * Kube Proxy - Endpoint Services support * Amazon EBS CSI Driver - Enable Amazon Elastic Block Storage (EBS) support for EKS * AWS Load Balancer Controller - Manages AWS Elastic Load Balancers for Kubernetes services * mktxp-exporter - Exports metrics from the mktxp application for monitoring (Mikrotik-IPSEC) * twistlock-defender - A security tool that provides runtime protection for containerized applications * reflector - Kubernetes controller that can be used to replicate secrets, configmaps and certificates * Strimzi - A Kubernetes operator that simplifies the deployment and management of Apache Kafka clusters on Kubernetes * csi-secrets-store - A Kubernetes CSI driver that allows Kubernetes applications to access secrets stored in external secret management systems like AWS Secrets Manager or HashiCorp Vault * secrets-store-csi-driver-provider-aws - A provider for the CSI Secrets Store driver that allows Kubernetes applications to access AWS Secrets Manager secrets

[!note] K8s workloads are deployed across these worker groups based on their specific requirements and resource needs. Each worker group is optimized for the workloads it runs, ensuring efficient resource utilization and performance.

Network Detailed Architecture Overview¶

CloudFlare:¶

Cloudflare acts as a global CDN and DDoS protection service, providing a secure entry point for the application. It performs health checks on the Application Load Balancer (ALB) to ensure availability.

Global DNS:

api.us.unibeam.com

is resolved to the Cloudflare Load Balancer, which then routes traffic to the appropriate ALB based on the region. traffic steering is based on the region of the user, directing them to the nearest ALB for optimal performance. possible steering options: * Proximity steering - directs traffic to the nearest region based on user location * Off - Cloudflare will route pools in failover order * Dynamic steering - Route traffic to the fastest pool based on measured latency from health checks * Geo steering - Route to specific pools based on the Cloudflare region serving the request * Proximity steering - Route requests to the closest physical pool * Random steering - Route to a healthy pool at random or weighted random. * Least outstanding requests steering - Route traffic based on pool weights and number of pending requests.

Note

Current setting is set to Least outstanding requests steering, which optimizes the load distribution across the ALBs based on the number of pending requests, with Fallback pool to us-west-2 region

graph TD
        A[Cloudflare-Global-DNS] -->|api.us.unibeam.com| B[Cloudflare-LB]
    B[Cloudflare-LB] -->|us-east-1| C[ALB-API-East:443]
    B -->|us-west-2| D[ALB-API-West:443]

Cloudflare LoadBalancer HealthCheck configurations¶

Accelerator Health Check

Interval - 60 seconds
Timeout - 5 seconds
Health Check Path - /health
Expect Status Code - 200
Protocol - TCP
Response Body - {"status":"UP"}

AWS Global Accelerator¶

AWS Global Accelerator provides a static IP address that serves as a fixed entry point for the application. It routes traffic to the nearest Network Load Balancer (NLB) based on health checks and routing policies.

SIM-Accelerator details¶

Endpoints are configured based on weighted routing policies, Traffic dial is set us-east-1 100% and us-west-2 20% to ensure that the majority of traffic is directed to the us-east-1 region, while a smaller portion is directed to the us-west-2 region for redundancy and failover.

75.2.108.23

3.33.243.63

2600:9000:a403:180c:6614:11ab:4d5b:1a99

2600:9000:a700:38a5:62e0:1fe9:a8b5:4bc8

Health Check configurations:

Accelerator Health Check

Interval - 30 seconds
Timeout - 3 seconds
Health Check Port - 9506
Protocol - TCP

graph TD
        A[AWS-Accelerator] -->|us-east-1| B[AWS-NLB:9506]
    A -->|us-west-2| C[AWS-NLB:9506]

¶

EKS/K8s Integrations with AWS¶

AWS Pod Identity Add-on¶

Kubernetes add-on that allows pods to securely access AWS services, it assigns temporary AWS IAM credentials to pods using Kubernetes Service Accounts

Temporary IAM Credentials
- Pods receive short-lived AWS credentials (via AWS_STS AssumeRole calls).
Fine-Grained Permissions
- Assign IAM roles per namespace or workload (least privilege).
No Hardcoded Secrets
- Eliminates the need for AWS_ACCESS_KEY_ID & AWS_SECRET_ACCESS_KEY.
Works with EKS (Amazon EKS Optimized)
- Native integration with Amazon EKS (but can work with other K8s clusters).

graph TD
    A[Pod] -->|Uses| B[Kubernetes Service Account]
    B -->|Assumes| C[AWS IAM Role]
    C -->|Grants Access to| D[S3, SecretManager]

Loki/Thanos/Prometheus¶

Loki/Thanos uses two kinds of storage types: EBS, S3 - EBS - Used for short-term storage of logs, providing fast access and retrieval - S3 - Used for long-term storage of logs, providing durability and cost-effective storage Prometheus metrics are stored in EBS, service integrations include: * Redis Labs - For caching and fast access to frequently queried metrics * Atlas MongoDB - For storing and querying metrics data * CloudWatch - For monitoring and alerting on metrics data

[!note] Thanos stores data in S3 designated buckets, with region replication, providing high availability and durability for metrics and visibility for both regions.

Repo OverView¶

CICD Flow¶

graph LR
    A[ARGOCD_REPO] -->|Application definition| B[HELM_VALUES]
    B -->|kubernetes-repo| C[HELM_Directory]

This repo contains application definitions for argo (infra, apps) components
Branch name is the EKS cluster name represented in AWS
Every POC ENV should have it own Public Subnet and Private Subnet
- This is because some env's have IPSEC tunnel with SMSC termination
- routing separation so that only env_poc related pod can access the IPSEC encryption domain end point (SMSC)
- Public subnet since it needs ALB/NLB
- ALB is used for (dashboard, SIA-API, MNO)
- NLB SIM-Service TCP 9506 for applet connectivity since its not possible to have DNS in java applets
  - The sim-service eipalloc (aws account public ip allocation) - this ip should be in the account before service provision
```
    service:
      type: LoadBalancer
      port: 9506
      eipallocations: eipalloc-051bfcc9868a99eb6
      loadBalancerIP: {}
```
- ENV_POC should have it own workers nodes with label env_name: true
- With Helm deployments nodeSelector arg can be controlled via argocd definition like this:
```
      helm:
        valueFiles:
          - $values/infra-applications/mtn-poc/redis/values.yaml
        values: |
          nodeSelector:
            mtn: "true"
```
  Merging of Values: Inline values (in values) take precedence over valueFiles. If the external values.yaml already specifies a nodeSelector, the inline values here will override it.
ArgoCD uses deployment key in repo settings

CLI¶

Connecting via CLI when sso is enabled:

argocd login --sso argocd.poc.tinyrt.com

`general-app/`¶

This directory contains general configuration files and scripts for setting up ArgoCD.

_secret-template.yaml: Template for GitHub repository secrets.
projects.yaml: Configuration for ArgoCD projects.
setup.sh: Script to manually install ArgoCD with basic configuration.

`infra-applications/`¶

This directory contains infrastructure-related configurations.

global¶

This directory contains cluster scoped infrastructure configurations
<env_name>¶

Env scoped folders for infrastructure configurations
kustomization.yaml: Kustomize configuration for infrastructure components.

`services-applications/`¶

This directory contains the ArgoCD application manifests for different environments.

`strating debug node`¶

apiVersion: v1
kind: Pod
metadata:
  name: debug-tools
  namespace: default
  labels:
    app: debug-tools
spec:
  containers:
  - name: netshoot
    image: nicolaka/netshoot
    command: ["sleep", "infinity"]
    resources:
      requests:
        cpu: "100m"
        memory: "128Mi"
      limits:
        cpu: "500m"
        memory: "512Mi"
  restartPolicy: Never

Stress Test Kafka¶

:arrow: https://github.com/msfidelis/kafka-stress

Producer:

kafka-stress --bootstrap-servers localhost:9092 --events 30000 --topic kafka-stress
Sent 10000 messages to topic kafka-stress with 0 errors
Tests finished in 2.790912875s. Produce 0 messages with mean time 0.00/s using hash balance algorithm
kafka-stress --bootstrap-servers localhost:9092 --events 500000 --topic kafka-stress
Tests finished in 4.876597833s. Produce 0 messages with mean time 0.00/s using hash balance algorith

Unibeam ATnT Prod

TOC¶

High Level Architecture Overview¶

Network Architecture Overview¶

☁️ Cloud AWS Components¶

SaaS Components¶

Unibeam Components¶

EKS Components¶

System Wide Applications¶

Network Detailed Architecture Overview¶

CloudFlare:¶

Cloudflare LoadBalancer HealthCheck configurations¶

AWS Global Accelerator¶

SIM-Accelerator details¶

`graph TD A[AWS-Accelerator] -->|us-east-1| B[AWS-NLB:9506] A -->|us-west-2| C[AWS-NLB:9506]`
¶

EKS/K8s Integrations with AWS¶

AWS Pod Identity Add-on¶

Loki/Thanos/Prometheus¶

Repo OverView¶

CICD Flow¶

CLI¶

`general-app/`¶

`infra-applications/`¶

`global`¶

`<env_name>`¶

`services-applications/`¶

`strating debug node`¶

Stress Test Kafka¶

Unibeam ATnT Prod

TOC¶

High Level Architecture Overview¶

Network Architecture Overview¶

☁️ Cloud AWS Components¶

SaaS Components¶

Unibeam Components¶

EKS Components¶

System Wide Applications¶

Network Detailed Architecture Overview¶

CloudFlare:¶

Cloudflare LoadBalancer HealthCheck configurations¶

AWS Global Accelerator¶

SIM-Accelerator details¶

graph TD A[AWS-Accelerator] -->|us-east-1| B[AWS-NLB:9506] A -->|us-west-2| C[AWS-NLB:9506] ¶

EKS/K8s Integrations with AWS¶

AWS Pod Identity Add-on¶

Loki/Thanos/Prometheus¶

Repo OverView¶

CICD Flow¶

CLI¶

general-app/¶

infra-applications/¶

global¶

<env_name>¶

services-applications/¶

strating debug node¶

Stress Test Kafka¶

`graph TD A[AWS-Accelerator] -->|us-east-1| B[AWS-NLB:9506] A -->|us-west-2| C[AWS-NLB:9506]`
¶

`general-app/`¶

`infra-applications/`¶

`global`¶

`<env_name>`¶

`services-applications/`¶

`strating debug node`¶