Cloud Network Monitoring: Visibility and Alerting Guide

You can't fix what you can't see. Network observability is critical for troubleshooting, security, and performance optimization. This guide covers monitoring strategies across AWS, Azure, and GCP.

Network Observability Pillars

Complete network visibility requires multiple data sources:

Flow logs: Who talked to whom, when, how much
Metrics: Bandwidth, packet counts, latency
Logs: Firewall decisions, DNS queries, errors
Traces: Request path through distributed systems

VPC Flow Logs

AWS VPC Flow Logs

Capture IP traffic information for network interfaces:

# Flow log format (default v2)
version account-id interface-id srcaddr dstaddr srcport dstport protocol packets bytes start end action log-status

# Example log entry:
2 123456789012 eni-abc123 10.0.1.100 10.0.2.50 52986 443 6 15 1500 1609459200 1609459260 ACCEPT OK

Capture level: VPC, subnet, or ENI
Destinations: CloudWatch Logs, S3, Kinesis Data Firehose
Custom formats: Select only needed fields to reduce cost
Aggregation: 1-minute or 10-minute intervals

Query Examples (CloudWatch Logs Insights)

# Find rejected traffic
fields @timestamp, srcAddr, dstAddr, dstPort, action
| filter action = "REJECT"
| sort @timestamp desc
| limit 100

# Top talkers by bytes
stats sum(bytes) as totalBytes by srcAddr
| sort totalBytes desc
| limit 10

# Traffic to specific port
fields @timestamp, srcAddr, dstAddr, bytes
| filter dstPort = 443
| stats sum(bytes) as totalBytes by srcAddr

GCP VPC Flow Logs

Metadata: Source/dest VM, zone, region included
Sample rate: Configurable sampling (5%-100%)
Export: Cloud Logging, BigQuery, Pub/Sub

Azure NSG Flow Logs

Version 2: Includes throughput information
Storage: Azure Storage Account
Analysis: Traffic Analytics for visualization

Network Metrics

Key Metrics to Monitor

Metric	What It Tells You	Alert Threshold
Network In/Out	Bandwidth utilization	>80% of capacity
Packets In/Out	Traffic volume	Sudden spikes
Connection Count	Active connections	Near limits
Packet Loss	Network health	>0.1%
Latency	Response time	P99 > baseline + 2σ

AWS CloudWatch Network Metrics

EC2: NetworkIn, NetworkOut, NetworkPacketsIn/Out
ELB: RequestCount, Latency, HealthyHostCount
NAT Gateway: BytesIn/Out, ConnectionAttemptCount, ErrorPortAllocation
Transit Gateway: BytesIn/Out, PacketsIn/Out per attachment

Latency Monitoring

Sources of Latency Data

Load balancer metrics: Target response time
Application traces: Full request timing breakdown
Synthetic monitoring: Scheduled probes from multiple locations
Real user monitoring: Actual user experience

Synthetic Monitoring Tools

AWS CloudWatch Synthetics: Canary scripts with screenshots
GCP Uptime Checks: HTTP, TCP, SSL checks globally
Azure Application Insights: Availability tests
Third-party: Datadog Synthetics, Pingdom, Catchpoint

# CloudWatch Synthetic canary example
exports.handler = async () => {
    const synthetics = require('Synthetics');
    const response = await synthetics.executeHttpStep(
        'Check API',
        'https://api.example.com/health'
    );
    
    if (response.statusCode !== 200) {
        throw new Error('Health check failed');
    }
};

Alerting Strategy

Alert Levels

Info: Awareness, no action needed
Warning: Investigation needed soon
Critical: Immediate action required

Effective Alerting Principles

Alert on symptoms, not causes: "Service down" not "CPU high"
Avoid alert fatigue: Every alert should be actionable
Use appropriate thresholds: Baseline + standard deviations
Include runbook links: How to investigate and fix

Example Alert Configurations

# NAT Gateway port exhaustion warning
Metric: ErrorPortAllocation
Threshold: > 0 for 5 minutes
Action: Page on-call, add NAT Gateway

# VPC Flow Log rejected traffic spike
Query: filter action = "REJECT" | stats count()
Threshold: > 1000/minute (baseline-dependent)
Action: Security team notification

Centralized Logging

Log Aggregation Architecture

VPC Flow Logs ─────┐
                   │
Firewall Logs ─────┼────► Log Aggregation ────► Analysis/SIEM
                   │      (CloudWatch, ELK,
DNS Query Logs ────┤       Splunk, Datadog)
                   │
App Logs ──────────┘

Cost Optimization

Sample flow logs: 10% sampling vs. 100% for cost savings
Custom log formats: Only capture needed fields
Tiered storage: Hot (queries) → Warm → Cold (archive)
Retention policies: Match compliance requirements

Third-Party Monitoring Tools

Full-Stack Observability

Datadog: Metrics, logs, APM, network monitoring
New Relic: Infrastructure, APM, synthetics
Dynatrace: AI-powered, full-stack
Splunk: Log analysis, SIEM capabilities

Network-Focused Tools

Kentik: Network traffic analysis, DDoS detection
ThousandEyes: Internet/cloud visibility
NetFlow/IPFIX analyzers: Deep traffic analysis

Network Performance Testing

Tools for Testing

# iperf3 for bandwidth testing
iperf3 -c target-server -t 30 -P 4

# mtr for path analysis
mtr -rw --report-cycles 60 target-host

# curl timing
curl -w "@timing.txt" -o /dev/null -s https://api.example.com

# timing.txt format:
time_namelookup:  %{time_namelookup}\n
time_connect:     %{time_connect}\n
time_appconnect:  %{time_appconnect}\n
time_starttransfer: %{time_starttransfer}\n
time_total:       %{time_total}\n

Key Takeaways

VPC Flow Logs are essential for network visibility
Monitor both infrastructure metrics and user-facing latency
Synthetic monitoring catches issues before users do
Alert on symptoms, include runbooks, avoid fatigue
Centralize logs for correlation and analysis

Need Network Observability Help?

We design comprehensive monitoring solutions. Contact us for a consultation.