Cloud Routing Troubleshooting: Common Issues and Solutions
When packets don't reach their destination in the cloud, systematic debugging is essential. This guide covers the most common cloud routing issues and provides step-by-step troubleshooting approaches for AWS, Azure, and GCP.
The Troubleshooting Mindset
Cloud networking has many layers, and issues can occur at any of them. Before diving in, remember:
- Layer by layer: Start at the lowest layer and work up
- One change at a time: Make a single change, test, then proceed
- Check both directions: Routing is often asymmetric
- Verify assumptions: "It should work" often doesn't
Common Issue #1: Security Group Blocking Traffic
Security groups are the most common cause of connectivity issues:
Symptoms
- Connection timeouts (packets sent but no response)
- VPC Flow Logs show traffic as REJECT
- Works from some sources but not others
Debugging Steps
- Identify the security groups:
# AWS CLI aws ec2 describe-instances \ --instance-ids i-1234567890abcdef0 \ --query 'Reservations[].Instances[].SecurityGroups' - Check inbound rules:
- Is the port allowed?
- Is the source IP/CIDR allowed?
- Is the protocol correct (TCP vs UDP)?
- Check outbound rules:
- Default is allow-all, but may be restricted
- Check both source and destination security groups
AWS Security Group Reference Gotcha
When referencing another security group as source/destination:
- Both instances must be in the same VPC (or peered VPCs with proper config)
- Referencing works by internal ENI IP, not public IP
Common Issue #2: Network ACL Blocking Return Traffic
NACLs are stateless—they don't track connections:
Symptoms
- Outbound connections time out
- Security groups look correct
- One direction works, return doesn't
The Ephemeral Port Problem
When your instance connects to an external service, the return traffic comes back on an ephemeral port (1024-65535). NACL must allow this:
# Common NACL mistake:
# Outbound rule: Allow TCP 443 to 0.0.0.0/0 ✓
# Inbound rule: Allow TCP 443 from 0.0.0.0/0 ✗ (return is on ephemeral!)
# Correct NACL:
# Outbound: Allow TCP 443 to 0.0.0.0/0
# Inbound: Allow TCP 1024-65535 from 0.0.0.0/0
Debugging Steps
- Find the NACL associated with the subnet
- Check rules for both inbound and outbound
- Remember rules are evaluated in order (lowest number first)
- Look for explicit DENY rules that may match before ALLOW
Common Issue #3: Missing Route
No route = packets dropped silently:
Symptoms
- Completely unreachable destination
- No entry in VPC Flow Logs (packet never leaves instance)
- Works to some destinations but not others
Debugging Steps
- Check the route table:
# AWS CLI aws ec2 describe-route-tables \ --filters "Name=association.subnet-id,Values=subnet-12345678" - Verify route exists for destination CIDR:
- Most specific route wins (longest prefix match)
- 0.0.0.0/0 should point to IGW, NAT GW, or Transit Gateway
- Check the target is healthy:
- NAT Gateway in available state?
- Internet Gateway attached?
- Transit Gateway route table active?
Common Issue #4: Asymmetric Routing
Traffic goes one path, returns another—breaking stateful firewalls:
Symptoms
- Intermittent connectivity
- Works sometimes, fails others
- Firewall logs show INVALID packets
Common Causes
- Multiple NAT Gateways with different routing
- VPN/Direct Connect with unequal prefix advertisements
- Load balancer in one path but not return
Debugging Steps
- Trace the path in both directions
- Check BGP route advertisements from both ends
- Verify stateful devices see both directions of flow
Common Issue #5: NAT Gateway Problems
Symptoms
- Private instances can't reach internet
- Slow connections or timeouts
- Works for some instances, not others
Debugging Steps
- Check NAT Gateway exists and is healthy:
aws ec2 describe-nat-gateways \ --nat-gateway-ids nat-12345678 - Verify route table points to NAT Gateway:
- 0.0.0.0/0 → nat-xxxxx
- Confirm NAT Gateway is in public subnet:
- NAT Gateway itself needs internet access via IGW
- Check for port exhaustion:
- NAT Gateway has 55,000 ports per destination
- Many connections to same destination can exhaust
Common Issue #6: VPC Peering Not Working
Symptoms
- Can't reach resources across peered VPC
- Peering shows as "active" but no connectivity
Debugging Steps
- Check peering connection status: Must be "active"
- Verify route tables on BOTH sides:
- VPC A route table must have route to VPC B CIDR via peering connection
- VPC B route table must have route to VPC A CIDR via peering connection
- Check for overlapping CIDRs: Peering doesn't work with overlapping addresses
- Verify security groups: Must allow traffic from peered VPC
Common Issue #7: DNS Resolution Failures
Symptoms
- Can ping IP but not hostname
- Internal hostnames not resolving
- Private hosted zone records not found
Debugging Steps
- Check VPC DNS settings:
# AWS: Enable DNS hostnames and resolution aws ec2 describe-vpc-attribute \ --vpc-id vpc-12345678 \ --attribute enableDnsHostnames - Verify DHCP options: Custom DNS servers configured correctly?
- Check private hosted zone association: Zone must be associated with VPC
- Test from instance:
# Check resolver cat /etc/resolv.conf # Test resolution dig +short internal-host.example.com nslookup internal-host.example.com
Useful Troubleshooting Tools
AWS
- VPC Reachability Analyzer: Traces path and identifies blockers
- VPC Flow Logs: Shows accepted/rejected traffic
- CloudWatch Logs Insights: Query flow logs for patterns
Instance-Level Tools
# Test TCP connectivity
nc -zv 10.0.1.100 443
# Trace route
traceroute 10.0.1.100
mtr 10.0.1.100
# Check listening ports
ss -tlnp
netstat -tlnp
# Capture packets
tcpdump -i eth0 host 10.0.1.100
# DNS debugging
dig @169.254.169.253 internal-host.example.com
Curl Timing
curl -w "
DNS: %{time_namelookup}s
Connect: %{time_connect}s
TLS: %{time_appconnect}s
TTFB: %{time_starttransfer}s
Total: %{time_total}s
" -o /dev/null -s https://example.com
Troubleshooting Flowchart
Start
│
▼
[Can ping target IP?]──No──► Check security groups
│ Check NACLs
│ Yes Check route tables
▼
[Can resolve hostname?]──No──► Check VPC DNS settings
│ Check private hosted zones
│ Yes Check /etc/resolv.conf
▼
[Connection times out?]──Yes──► Check security groups (inbound)
│ Check NACLs (ephemeral ports)
│ No Check target is listening
▼
[Connection refused?]──Yes──► Target service not running
│ Wrong port
│ No
▼
[Slow performance?]──Yes──► Check MTU issues
│ Check NAT Gateway capacity
│ Use VPC endpoints
▼
[Intermittent issues?]──Yes──► Check asymmetric routing
Check health check thresholds
Check for flapping routes
Key Takeaways
- Security groups and NACLs are the most common issues
- Remember NACL is stateless—allow ephemeral ports for return traffic
- Always check route tables on both sides of a connection
- Use VPC Flow Logs to see what's being accepted/rejected
- AWS Reachability Analyzer can save hours of debugging
Need Expert Troubleshooting Help?
Our team can quickly diagnose and resolve complex networking issues. Contact us for support.