Skip to content

๐Ÿ—๏ธ Single-Region Multi-AZ Architecture Strategyยถ

Executive Summary

We are consolidating operations to a Single Region (US-East) utilizing Two Availability Zones (AZs).

The Decision Point: Our Party_A prefers Option 1 (Active/Passive 100/0) to avoid data locking.
Party_B recommends Option 2 (Active/Warm 90/10) to eliminate "Cold Standby" failure risks.

The Verdict: Locking is not a risk in Multi-AZ due to sub-millisecond latency. The real risk is a passive environment that silently drifts and fails when you need it most.


๐Ÿ“Š Reports & Analysisยถ

Report v2ยถ


๐Ÿ›๏ธ Architecture Overviewยถ

We are moving from a complex Multi-Region setup (us-east-1 and us-west-2) to a highly resilient Single-Region Multi-AZ design within a single region. Both zones (A and B) are identical, reducing cross-region data transfer costs while maintaining high availability through redundant network paths and security appliances.

Visual Flowยถ

flowchart TD
    %% --- External Entry Points ---
    CF[Cloudflare HTTPS] -->|Traffic| IGW_A
    CF -->|Traffic| IGW_B
    GA[Global Accelerator TCP] -->|Traffic| IGW_A
    GA -->|Traffic| IGW_B

    %% --- Region Scope ---
    subgraph US_West_2 [Region: us-west-2]

        %% --- Workload VPC (Ingress & Compute) ---
        subgraph WL_VPC [WL-VPC]
            
            %% Availability Zone A
            subgraph AZ_A [Availability Zone A - Active]
                IGW_A[Internet Gateway A] -->|All Traffic| FW_Ing_A[Ingress Firewall A]
                FW_Ing_A -->|HTTPS| ALB_A[ALB Zone A]
                FW_Ing_A -->|TCP/9506| NLB_A[NLB Zone A]
                ALB_A --> EKS_A[EKS Cluster A]
                NLB_A --> EKS_A
            end

            %% Availability Zone B
            subgraph AZ_B [Availability Zone B - Passive/Warm]
                IGW_B[Internet Gateway B] -->|All Traffic| FW_Ing_B[Ingress Firewall B]
                FW_Ing_B -->|HTTPS| ALB_B[ALB Zone B]
                FW_Ing_B -->|TCP/9506| NLB_B[NLB Zone B]
                ALB_B --> EKS_B[EKS Cluster B]
                NLB_B --> EKS_B
            end

            %% Shared Data Layer
            subgraph Data["๐Ÿ’พ Shared Data Layer"]
                Redis[Redis A/A]
                Mongo[Mongo Atlas Cluster]
            end

            EKS_A & EKS_B <--> Redis & Mongo
        end

        %% --- Transit Gateway ---
        EKS_A -->|Egress| TGW[Transit Gateway]
        EKS_B -->|Egress| TGW

        %% --- Security VPC (Inspection) ---
        subgraph SEC_VPC [SEC-VPC Inspection]
            TGW --> FW_Eg_A[Egress Firewall A]
            TGW --> FW_Eg_B[Egress Firewall B]
        end

        %% --- DMZ VPC (Exit) ---
        subgraph DMZ_VPC [DMZ-VPC Edge]
            FW_Eg_A --> NAT_A[NAT Gateway A]
            FW_Eg_A --> MIK_A[Mikrotik VPN A]
            
            FW_Eg_B --> NAT_B[NAT Gateway B]
            FW_Eg_B --> MIK_B[Mikrotik VPN B]
        end

    end

    %% --- External Destinations ---
    NAT_A --> Internet((Public Internet))
    NAT_B --> Internet
    MIK_A -->|Tunnel| VTC[Remote Site: VTC_SAG]
    MIK_B -->|Tunnel| STC[Remote Site: STC_SAG]

๐Ÿš Traffic Entry Pointsยถ

The architecture utilizes distinct entry points for different traffic types to ensure optimized routing and security:

HTTPS Traffic (API/Dashboard)ยถ

Entry Point: Cloudflare - Flow: Cloudflare resolves to AWS Internet Gateway (IGW) โ†’ Ingress Firewall Endpoints (Zone A/B) โ†’ Application Load Balancer (ALB) - Resilience: Cloudflare performs health checks and steers traffic between healthy AZs

TCP Traffic (SIM-Service)ยถ

Entry Point: AWS Global Accelerator - Flow: Global Accelerator (Static IPs) โ†’ Network Load Balancer (NLB) on port 9506 - Resilience: Traffic is routed over the AWS global network to the nearest healthy endpoint, bypassing public internet congestion


๐ŸŒ Network Resilience Strategyยถ

To eliminate single points of failure, the network infrastructure is split into parallel lanes:

Ingress Pathยถ

  • Multi-AZ Firewalls: AWS Network Firewall endpoints are deployed in both AZs. If one zone fails, traffic is automatically routed through the healthy zone's firewall.
  • Parallel Load Balancers: Separate ALB and NLB instances in each AZ ensure no single point of failure for traffic ingestion.

Egress Pathยถ

  • Transit Gateway Routing: Traffic from both AZs routes via Transit Gateway to a dedicated Security VPC for centralized inspection.
  • Security VPC Inspection: Egress firewalls in both AZs inspect traffic before it exits through NAT Gateways or VPN tunnels.

VPN & SMSC Connectivityยถ

  • Mikrotik VPN Gateways: Deployed in both AZs within the DMZ-VPC, connecting to remote sites (VTC_SAG & STC_SAG) ensuring continuous connectivity even if one AZ becomes unavailable.
  • SMSC Routing: Traffic destined for SMSC endpoints routes via Transit Gateway to the Security VPC for inspection before exiting.
  • NAT Gateways: Deployed in both AZs to handle general internet-bound traffic with automatic failover capabilities.

Kafka Topic Separationยถ

Dedicated Kafka topics are created for each availability zone to isolate traffic for SMS-Service processing: - sms-service-us-west-2a - sms-service-us-west-2b


๐Ÿ›‘ Addressing the "Locking" Fearยถ

The primary concern from leadership is that running Active/Active (or even Active/Warm) will cause database deadlocks or race conditions. Here is why that is technically unfounded in this specific topology:

1. Speed of Light vs. Latencyยถ

Latency Comparison

Latency: 70-90ms between US-East and US-West

This causes sync issues and requires complex locking.

Latency: < 2ms between AZs in the same region

To Redis and MongoDB, this looks like a local network.

2. The Data Layer Handles Consensusยถ

We do not write custom locking logic in the application code. We rely on the enterprise engines:

Service Mechanism Why it won't deadlock
MongoDB Atlas Primary/Secondary Election Only one node is ever the Primary writer, regardless of where traffic comes from. Apps in both Zone A and Zone B write to the same Primary.
Redis Active-Active (CRDTs) Uses "Conflict-Free Replicated Data Types". It mathematically merges writes from both zones. It is designed specifically for this.

Conclusion

There is no application-level locking risk. The database engines arbitrate consistency faster than the application can perceive it.


๐Ÿšฆ Traffic Strategy Optionsยถ

We have three ways to route traffic between Zone A and Zone B.

Option 1: Active / Passive (100% / 0%)ยถ

The "Safe" Traditional Choice

How it works

Zone A takes 100% of traffic. Zone B is deployed but idle.

Party_A's Argument: "Simple. No side effects in Zone B."

The Hidden Danger: Zone B is a "Cold Path".

  • Did the Terraform apply correctly? Maybe.
  • Is the cache warm? No.
  • Will the firewall rules allow traffic? We won't know until we failover.

Risk: The False Sense of Security

In a disaster, you flip the switch to Zone B, and it immediately crashes because it hasn't seen a real packet in months.


Option 2: Active / Warm (90% / 10%)ยถ

The Party_B Recommendation โœ…

How it works

We send 90% of traffic to A, and a trickle (10%) to B.

Why: This validates the Entire Path (Cloudflare โ†’ NetFW โ†’ ALB โ†’ EKS โ†’ Mongo) 24/7.

Benefits:

  • Warm Caches: Redis in Zone B is already populated
  • Proven Config: We know deployments worked because 10% of users are using it successfully
  • Blast Radius: If a locking issue did mysteriously exist, it only affects 10% of users, not 100%

The Sweet Spot

This acts as a continuous "Health Check" for your Disaster Recovery plan.


Option 3: Active / Active (50% / 50%)ยถ

The Nirvana State

How it works

Perfect load balancing.

Why:

  • Maximum infrastructure ROI (no idle servers)
  • Instant failover (just capacity degradation)

Constraint: Requires strict idempotency in all Kafka consumers and APIs.


๐Ÿ“Š Risk & Value Comparisonยถ

Metric Option 1 (100/0) Option 2 (90/10) Option 3 (50/50)
Complexity Low Medium High
Failover Speed Slow (Minutes) Fast (Seconds) Instant
Confidence โš ๏ธ Low โœ… High โœ… High
Data Risk None Low Managed
Wasted $ High Low None

๐Ÿ Final Recommendationยถ

Adopt Option 2 (90/10)

It bridges the gap between the Party_A's desire for safety and Party_B's need for reliability. It proves the system works without exposing the entire user base to full Active/Active complexity immediately.

Why This Worksยถ

  • Safety: 90% of traffic stays in proven Zone A
  • Validation: 10% continuously tests Zone B's readiness
  • Confidence: Real-world traffic validates the entire stack
  • Failover: Zone B is already serving production traffic successfully

๐Ÿ“ Next Stepsยถ

  1. Phase 1: Deploy identical infrastructure to both zones
  2. Phase 2: Configure 90/10 traffic split via Global Accelerator and Cloudflare
  3. Phase 3: Monitor Zone B performance for 2 weeks
  4. Phase 4: Document failover procedures
  5. Phase 5: Run quarterly DR drills with full cutover to Zone B

๐Ÿ“ Reference Architecture Diagramsยถ

Ingress Resilient Architectureยถ

Egress Resilient Architectureยถ

Full Resilient Architectureยถ


SMSC Endpointsยถ

graph TD
    subgraph EKS_Cluster [EKS Cluster SMS-Service Layer]
        SMSPod1[SMS Pod AZ1]
        SMSPod2[SMS Pod AZ2]
    end

    subgraph LoadBalancing [The Magic Layer]
        NLB[Internal AWS NLB TCP 2775]
        TG[Target Group: IP Mode]
    end

    subgraph Routing_Layer [DMZ VPC Routing Logic]
        RT_AZ1[Route Table AZ1]
        RT_AZ2[Route Table AZ2]
    end

    subgraph VPN_Gateways [The Bouncers]
        MikrotikA[Mikrotik A AZ1]
        MikrotikB[Mikrotik B AZ2]
    end

    subgraph Remote_World [The SMSC Targets]
        VTC[VTC SMSC Endpoint]
        STC[STC SMSC Endpoint]
    end

    %% Flows
    AppPod1 -- "Connects to LB DNS" --> NLB
    AppPod2 -- "Connects to LB DNS" --> NLB
    
    NLB -- "Distributes Connections" --> TG
    TG -. "Target 1 (IP)" .-> VTC
    TG -. "Target 2 (IP)" .-> STC
    
    %% The Physical Path (Hidden Complexity)
    TG -- "Traffic for VTC" --> RT_AZ1
    TG -- "Traffic for STC" --> RT_AZ2
    
    RT_AZ1 -- "Route: VTC IP -> ENI A" --> MikrotikA
    RT_AZ1 -- "Route: STC IP -> ENI B (Cross AZ)" --> MikrotikB
    
    RT_AZ2 -- "Route: VTC IP -> ENI A (Cross AZ)" --> MikrotikA
    RT_AZ2 -- "Route: STC IP -> ENI B" --> MikrotikB

    MikrotikA <== "IPSec Tunnel A" ==> VTC
    MikrotikB <== "IPSec Tunnel B" ==> STC

    style NLB fill:#ff9900,stroke:#333,stroke-width:2px,color:white
    style MikrotikA fill:#cc0000,stroke:#333,color:white
    style MikrotikB fill:#cc0000,stroke:#333,color:white
    style VTC fill:#009900,color:white
    style STC fill:#009900,color:white