Cloud Routing Troubleshooting: Common Issues and Solutions

When packets don't reach their destination in the cloud, systematic debugging is essential. This guide covers the most common cloud routing issues and provides step-by-step troubleshooting approaches for AWS, Azure, and GCP.

The Troubleshooting Mindset

Cloud networking has many layers, and issues can occur at any of them. Before diving in, remember:

Common Issue #1: Security Group Blocking Traffic

Security groups are the most common cause of connectivity issues:

Symptoms

Debugging Steps

  1. Identify the security groups:
    # AWS CLI
    aws ec2 describe-instances \
      --instance-ids i-1234567890abcdef0 \
      --query 'Reservations[].Instances[].SecurityGroups'
  2. Check inbound rules:
    • Is the port allowed?
    • Is the source IP/CIDR allowed?
    • Is the protocol correct (TCP vs UDP)?
  3. Check outbound rules:
    • Default is allow-all, but may be restricted
    • Check both source and destination security groups

AWS Security Group Reference Gotcha

When referencing another security group as source/destination:

Common Issue #2: Network ACL Blocking Return Traffic

NACLs are stateless—they don't track connections:

Symptoms

The Ephemeral Port Problem

When your instance connects to an external service, the return traffic comes back on an ephemeral port (1024-65535). NACL must allow this:

# Common NACL mistake: 
# Outbound rule: Allow TCP 443 to 0.0.0.0/0 ✓
# Inbound rule: Allow TCP 443 from 0.0.0.0/0 ✗ (return is on ephemeral!)

# Correct NACL:
# Outbound: Allow TCP 443 to 0.0.0.0/0
# Inbound: Allow TCP 1024-65535 from 0.0.0.0/0

Debugging Steps

  1. Find the NACL associated with the subnet
  2. Check rules for both inbound and outbound
  3. Remember rules are evaluated in order (lowest number first)
  4. Look for explicit DENY rules that may match before ALLOW

Common Issue #3: Missing Route

No route = packets dropped silently:

Symptoms

Debugging Steps

  1. Check the route table:
    # AWS CLI
    aws ec2 describe-route-tables \
      --filters "Name=association.subnet-id,Values=subnet-12345678"
  2. Verify route exists for destination CIDR:
    • Most specific route wins (longest prefix match)
    • 0.0.0.0/0 should point to IGW, NAT GW, or Transit Gateway
  3. Check the target is healthy:
    • NAT Gateway in available state?
    • Internet Gateway attached?
    • Transit Gateway route table active?

Common Issue #4: Asymmetric Routing

Traffic goes one path, returns another—breaking stateful firewalls:

Symptoms

Common Causes

Debugging Steps

  1. Trace the path in both directions
  2. Check BGP route advertisements from both ends
  3. Verify stateful devices see both directions of flow

Common Issue #5: NAT Gateway Problems

Symptoms

Debugging Steps

  1. Check NAT Gateway exists and is healthy:
    aws ec2 describe-nat-gateways \
      --nat-gateway-ids nat-12345678
  2. Verify route table points to NAT Gateway:
    • 0.0.0.0/0 → nat-xxxxx
  3. Confirm NAT Gateway is in public subnet:
    • NAT Gateway itself needs internet access via IGW
  4. Check for port exhaustion:
    • NAT Gateway has 55,000 ports per destination
    • Many connections to same destination can exhaust

Common Issue #6: VPC Peering Not Working

Symptoms

Debugging Steps

  1. Check peering connection status: Must be "active"
  2. Verify route tables on BOTH sides:
    • VPC A route table must have route to VPC B CIDR via peering connection
    • VPC B route table must have route to VPC A CIDR via peering connection
  3. Check for overlapping CIDRs: Peering doesn't work with overlapping addresses
  4. Verify security groups: Must allow traffic from peered VPC
Remember: VPC peering is NOT transitive. If A peers with B and B peers with C, A cannot reach C through B.

Common Issue #7: DNS Resolution Failures

Symptoms

Debugging Steps

  1. Check VPC DNS settings:
    # AWS: Enable DNS hostnames and resolution
    aws ec2 describe-vpc-attribute \
      --vpc-id vpc-12345678 \
      --attribute enableDnsHostnames
  2. Verify DHCP options: Custom DNS servers configured correctly?
  3. Check private hosted zone association: Zone must be associated with VPC
  4. Test from instance:
    # Check resolver
    cat /etc/resolv.conf
    
    # Test resolution
    dig +short internal-host.example.com
    nslookup internal-host.example.com

Useful Troubleshooting Tools

AWS

Instance-Level Tools

# Test TCP connectivity
nc -zv 10.0.1.100 443

# Trace route
traceroute 10.0.1.100
mtr 10.0.1.100

# Check listening ports
ss -tlnp
netstat -tlnp

# Capture packets
tcpdump -i eth0 host 10.0.1.100

# DNS debugging
dig @169.254.169.253 internal-host.example.com

Curl Timing

curl -w "
DNS:        %{time_namelookup}s
Connect:    %{time_connect}s
TLS:        %{time_appconnect}s
TTFB:       %{time_starttransfer}s
Total:      %{time_total}s
" -o /dev/null -s https://example.com

Troubleshooting Flowchart

Start
  │
  ▼
[Can ping target IP?]──No──► Check security groups
  │                          Check NACLs
  │ Yes                      Check route tables
  ▼
[Can resolve hostname?]──No──► Check VPC DNS settings
  │                            Check private hosted zones
  │ Yes                        Check /etc/resolv.conf
  ▼
[Connection times out?]──Yes──► Check security groups (inbound)
  │                             Check NACLs (ephemeral ports)
  │ No                          Check target is listening
  ▼
[Connection refused?]──Yes──► Target service not running
  │                           Wrong port
  │ No
  ▼
[Slow performance?]──Yes──► Check MTU issues
  │                         Check NAT Gateway capacity
  │                         Use VPC endpoints
  ▼
[Intermittent issues?]──Yes──► Check asymmetric routing
                               Check health check thresholds
                               Check for flapping routes

Key Takeaways

Need Expert Troubleshooting Help?

Our team can quickly diagnose and resolve complex networking issues. Contact us for support.