🏗️ Single-Region Multi-AZ Architecture Strategy¶

2025-12-182025-12-18

Executive Summary

We are consolidating operations to a Single Region (US-East) utilizing Two Availability Zones (AZs).

The Decision Point: Our Party_A prefers Option 1 (Active/Passive 100/0) to avoid data locking.
Party_B recommends Option 2 (Active/Warm 90/10) to eliminate "Cold Standby" failure risks.

The Verdict: Locking is not a risk in Multi-AZ due to sub-millisecond latency. The real risk is a passive environment that silently drifts and fails when you need it most.

📊 Reports & Analysis¶

Report v2¶

🏛️ Architecture Overview¶

We are moving from a complex Multi-Region setup (us-east-1 and us-west-2) to a highly resilient Single-Region Multi-AZ design within a single region. Both zones (A and B) are identical, reducing cross-region data transfer costs while maintaining high availability through redundant network paths and security appliances.

Visual Flow¶

flowchart TD
    %% --- External Entry Points ---
    CF[Cloudflare HTTPS] -->|Traffic| IGW_A
    CF -->|Traffic| IGW_B
    GA[Global Accelerator TCP] -->|Traffic| IGW_A
    GA -->|Traffic| IGW_B

    %% --- Region Scope ---
    subgraph US_West_2 [Region: us-west-2]

        %% --- Workload VPC (Ingress & Compute) ---
        subgraph WL_VPC [WL-VPC]
            
            %% Availability Zone A
            subgraph AZ_A [Availability Zone A - Active]
                IGW_A[Internet Gateway A] -->|All Traffic| FW_Ing_A[Ingress Firewall A]
                FW_Ing_A -->|HTTPS| ALB_A[ALB Zone A]
                FW_Ing_A -->|TCP/9506| NLB_A[NLB Zone A]
                ALB_A --> EKS_A[EKS Cluster A]
                NLB_A --> EKS_A
            end

            %% Availability Zone B
            subgraph AZ_B [Availability Zone B - Passive/Warm]
                IGW_B[Internet Gateway B] -->|All Traffic| FW_Ing_B[Ingress Firewall B]
                FW_Ing_B -->|HTTPS| ALB_B[ALB Zone B]
                FW_Ing_B -->|TCP/9506| NLB_B[NLB Zone B]
                ALB_B --> EKS_B[EKS Cluster B]
                NLB_B --> EKS_B
            end

            %% Shared Data Layer
            subgraph Data["💾 Shared Data Layer"]
                Redis[Redis A/A]
                Mongo[Mongo Atlas Cluster]
            end

            EKS_A & EKS_B <--> Redis & Mongo
        end

        %% --- Transit Gateway ---
        EKS_A -->|Egress| TGW[Transit Gateway]
        EKS_B -->|Egress| TGW

        %% --- Security VPC (Inspection) ---
        subgraph SEC_VPC [SEC-VPC Inspection]
            TGW --> FW_Eg_A[Egress Firewall A]
            TGW --> FW_Eg_B[Egress Firewall B]
        end

        %% --- DMZ VPC (Exit) ---
        subgraph DMZ_VPC [DMZ-VPC Edge]
            FW_Eg_A --> NAT_A[NAT Gateway A]
            FW_Eg_A --> MIK_A[Mikrotik VPN A]
            
            FW_Eg_B --> NAT_B[NAT Gateway B]
            FW_Eg_B --> MIK_B[Mikrotik VPN B]
        end

    end

    %% --- External Destinations ---
    NAT_A --> Internet((Public Internet))
    NAT_B --> Internet
    MIK_A -->|Tunnel| VTC[Remote Site: VTC_SAG]
    MIK_B -->|Tunnel| STC[Remote Site: STC_SAG]

🚏 Traffic Entry Points¶

The architecture utilizes distinct entry points for different traffic types to ensure optimized routing and security:

HTTPS Traffic (API/Dashboard)¶

Entry Point: Cloudflare - Flow: Cloudflare resolves to AWS Internet Gateway (IGW) → Ingress Firewall Endpoints (Zone A/B) → Application Load Balancer (ALB) - Resilience: Cloudflare performs health checks and steers traffic between healthy AZs

TCP Traffic (SIM-Service)¶

Entry Point: AWS Global Accelerator - Flow: Global Accelerator (Static IPs) → Network Load Balancer (NLB) on port 9506 - Resilience: Traffic is routed over the AWS global network to the nearest healthy endpoint, bypassing public internet congestion

🌐 Network Resilience Strategy¶

To eliminate single points of failure, the network infrastructure is split into parallel lanes:

Ingress Path¶

Multi-AZ Firewalls: AWS Network Firewall endpoints are deployed in both AZs. If one zone fails, traffic is automatically routed through the healthy zone's firewall.
Parallel Load Balancers: Separate ALB and NLB instances in each AZ ensure no single point of failure for traffic ingestion.

Egress Path¶

Transit Gateway Routing: Traffic from both AZs routes via Transit Gateway to a dedicated Security VPC for centralized inspection.
Security VPC Inspection: Egress firewalls in both AZs inspect traffic before it exits through NAT Gateways or VPN tunnels.

VPN & SMSC Connectivity¶

Mikrotik VPN Gateways: Deployed in both AZs within the DMZ-VPC, connecting to remote sites (VTC_SAG & STC_SAG) ensuring continuous connectivity even if one AZ becomes unavailable.
SMSC Routing: Traffic destined for SMSC endpoints routes via Transit Gateway to the Security VPC for inspection before exiting.
NAT Gateways: Deployed in both AZs to handle general internet-bound traffic with automatic failover capabilities.

Kafka Topic Separation¶

Dedicated Kafka topics are created for each availability zone to isolate traffic for SMS-Service processing: - sms-service-us-west-2a - sms-service-us-west-2b

🛑 Addressing the "Locking" Fear¶

The primary concern from leadership is that running Active/Active (or even Active/Warm) will cause database deadlocks or race conditions. Here is why that is technically unfounded in this specific topology:

1. Speed of Light vs. Latency¶

Latency Comparison

Multi-Region (Old Way)Multi-AZ (New Way)

Latency: 70-90ms between US-East and US-West

This causes sync issues and requires complex locking.

Latency: < 2ms between AZs in the same region

To Redis and MongoDB, this looks like a local network.

2. The Data Layer Handles Consensus¶

We do not write custom locking logic in the application code. We rely on the enterprise engines:

Service	Mechanism	Why it won't deadlock
MongoDB Atlas	Primary/Secondary Election	Only one node is ever the Primary writer, regardless of where traffic comes from. Apps in both Zone A and Zone B write to the same Primary.
Redis	Active-Active (CRDTs)	Uses "Conflict-Free Replicated Data Types". It mathematically merges writes from both zones. It is designed specifically for this.

Conclusion

There is no application-level locking risk. The database engines arbitrate consistency faster than the application can perceive it.

🚦 Traffic Strategy Options¶

We have three ways to route traffic between Zone A and Zone B.

Option 1: Active / Passive (100% / 0%)¶

The "Safe" Traditional Choice

How it works

Zone A takes 100% of traffic. Zone B is deployed but idle.

Party_A's Argument: "Simple. No side effects in Zone B."

The Hidden Danger: Zone B is a "Cold Path".

Did the Terraform apply correctly? Maybe.
Is the cache warm? No.
Will the firewall rules allow traffic? We won't know until we failover.

Risk: The False Sense of Security

In a disaster, you flip the switch to Zone B, and it immediately crashes because it hasn't seen a real packet in months.

Option 2: Active / Warm (90% / 10%)¶

The Party_B Recommendation ✅

How it works

We send 90% of traffic to A, and a trickle (10%) to B.

Why: This validates the Entire Path (Cloudflare → NetFW → ALB → EKS → Mongo) 24/7.

Benefits:

Warm Caches: Redis in Zone B is already populated
Proven Config: We know deployments worked because 10% of users are using it successfully
Blast Radius: If a locking issue did mysteriously exist, it only affects 10% of users, not 100%

The Sweet Spot

This acts as a continuous "Health Check" for your Disaster Recovery plan.

Option 3: Active / Active (50% / 50%)¶

The Nirvana State

How it works

Perfect load balancing.

Why:

Maximum infrastructure ROI (no idle servers)
Instant failover (just capacity degradation)

Constraint: Requires strict idempotency in all Kafka consumers and APIs.

📊 Risk & Value Comparison¶

Metric	Option 1 (100/0)	Option 2 (90/10)	Option 3 (50/50)
Complexity	Low	Medium	High
Failover Speed	Slow (Minutes)	Fast (Seconds)	Instant
Confidence	⚠️ Low	✅ High	✅ High
Data Risk	None	Low	Managed
Wasted $	High	Low	None

🏁 Final Recommendation¶

Adopt Option 2 (90/10)

It bridges the gap between the Party_A's desire for safety and Party_B's need for reliability. It proves the system works without exposing the entire user base to full Active/Active complexity immediately.

Why This Works¶

Safety: 90% of traffic stays in proven Zone A
Validation: 10% continuously tests Zone B's readiness
Confidence: Real-world traffic validates the entire stack
Failover: Zone B is already serving production traffic successfully

📝 Next Steps¶

Phase 1: Deploy identical infrastructure to both zones
Phase 2: Configure 90/10 traffic split via Global Accelerator and Cloudflare
Phase 3: Monitor Zone B performance for 2 weeks
Phase 4: Document failover procedures
Phase 5: Run quarterly DR drills with full cutover to Zone B

📐 Reference Architecture Diagrams¶

Ingress Resilient Architecture¶

Egress Resilient Architecture¶

Full Resilient Architecture¶

SMSC Endpoints¶

graph TD
    subgraph EKS_Cluster [EKS Cluster SMS-Service Layer]
        SMSPod1[SMS Pod AZ1]
        SMSPod2[SMS Pod AZ2]
    end

    subgraph LoadBalancing [The Magic Layer]
        NLB[Internal AWS NLB TCP 2775]
        TG[Target Group: IP Mode]
    end

    subgraph Routing_Layer [DMZ VPC Routing Logic]
        RT_AZ1[Route Table AZ1]
        RT_AZ2[Route Table AZ2]
    end

    subgraph VPN_Gateways [The Bouncers]
        MikrotikA[Mikrotik A AZ1]
        MikrotikB[Mikrotik B AZ2]
    end

    subgraph Remote_World [The SMSC Targets]
        VTC[VTC SMSC Endpoint]
        STC[STC SMSC Endpoint]
    end

    %% Flows
    AppPod1 -- "Connects to LB DNS" --> NLB
    AppPod2 -- "Connects to LB DNS" --> NLB
    
    NLB -- "Distributes Connections" --> TG
    TG -. "Target 1 (IP)" .-> VTC
    TG -. "Target 2 (IP)" .-> STC
    
    %% The Physical Path (Hidden Complexity)
    TG -- "Traffic for VTC" --> RT_AZ1
    TG -- "Traffic for STC" --> RT_AZ2
    
    RT_AZ1 -- "Route: VTC IP -> ENI A" --> MikrotikA
    RT_AZ1 -- "Route: STC IP -> ENI B (Cross AZ)" --> MikrotikB
    
    RT_AZ2 -- "Route: VTC IP -> ENI A (Cross AZ)" --> MikrotikA
    RT_AZ2 -- "Route: STC IP -> ENI B" --> MikrotikB

    MikrotikA <== "IPSec Tunnel A" ==> VTC
    MikrotikB <== "IPSec Tunnel B" ==> STC

    style NLB fill:#ff9900,stroke:#333,stroke-width:2px,color:white
    style MikrotikA fill:#cc0000,stroke:#333,color:white
    style MikrotikB fill:#cc0000,stroke:#333,color:white
    style VTC fill:#009900,color:white
    style STC fill:#009900,color:white