Cloud Network Monitoring: Visibility and Alerting Guide

You can't fix what you can't see. Network observability is critical for troubleshooting, security, and performance optimization. This guide covers monitoring strategies across AWS, Azure, and GCP.

Network Observability Pillars

Complete network visibility requires multiple data sources:

VPC Flow Logs

AWS VPC Flow Logs

Capture IP traffic information for network interfaces:

# Flow log format (default v2)
version account-id interface-id srcaddr dstaddr srcport dstport protocol packets bytes start end action log-status

# Example log entry:
2 123456789012 eni-abc123 10.0.1.100 10.0.2.50 52986 443 6 15 1500 1609459200 1609459260 ACCEPT OK

Query Examples (CloudWatch Logs Insights)

# Find rejected traffic
fields @timestamp, srcAddr, dstAddr, dstPort, action
| filter action = "REJECT"
| sort @timestamp desc
| limit 100

# Top talkers by bytes
stats sum(bytes) as totalBytes by srcAddr
| sort totalBytes desc
| limit 10

# Traffic to specific port
fields @timestamp, srcAddr, dstAddr, bytes
| filter dstPort = 443
| stats sum(bytes) as totalBytes by srcAddr

GCP VPC Flow Logs

Azure NSG Flow Logs

Network Metrics

Key Metrics to Monitor

Metric What It Tells You Alert Threshold
Network In/Out Bandwidth utilization >80% of capacity
Packets In/Out Traffic volume Sudden spikes
Connection Count Active connections Near limits
Packet Loss Network health >0.1%
Latency Response time P99 > baseline + 2σ

AWS CloudWatch Network Metrics

Latency Monitoring

Sources of Latency Data

Synthetic Monitoring Tools

# CloudWatch Synthetic canary example
exports.handler = async () => {
    const synthetics = require('Synthetics');
    const response = await synthetics.executeHttpStep(
        'Check API',
        'https://api.example.com/health'
    );
    
    if (response.statusCode !== 200) {
        throw new Error('Health check failed');
    }
};

Alerting Strategy

Alert Levels

Effective Alerting Principles

Example Alert Configurations

# NAT Gateway port exhaustion warning
Metric: ErrorPortAllocation
Threshold: > 0 for 5 minutes
Action: Page on-call, add NAT Gateway

# VPC Flow Log rejected traffic spike
Query: filter action = "REJECT" | stats count()
Threshold: > 1000/minute (baseline-dependent)
Action: Security team notification

Centralized Logging

Log Aggregation Architecture

VPC Flow Logs ─────┐
                   │
Firewall Logs ─────┼────► Log Aggregation ────► Analysis/SIEM
                   │      (CloudWatch, ELK,
DNS Query Logs ────┤       Splunk, Datadog)
                   │
App Logs ──────────┘

Cost Optimization

Third-Party Monitoring Tools

Full-Stack Observability

Network-Focused Tools

Network Performance Testing

Tools for Testing

# iperf3 for bandwidth testing
iperf3 -c target-server -t 30 -P 4

# mtr for path analysis
mtr -rw --report-cycles 60 target-host

# curl timing
curl -w "@timing.txt" -o /dev/null -s https://api.example.com

# timing.txt format:
time_namelookup:  %{time_namelookup}\n
time_connect:     %{time_connect}\n
time_appconnect:  %{time_appconnect}\n
time_starttransfer: %{time_starttransfer}\n
time_total:       %{time_total}\n

Key Takeaways

Need Network Observability Help?

We design comprehensive monitoring solutions. Contact us for a consultation.