Cloud Network Monitoring: Visibility and Alerting Guide
You can't fix what you can't see. Network observability is critical for troubleshooting, security, and performance optimization. This guide covers monitoring strategies across AWS, Azure, and GCP.
Network Observability Pillars
Complete network visibility requires multiple data sources:
- Flow logs: Who talked to whom, when, how much
- Metrics: Bandwidth, packet counts, latency
- Logs: Firewall decisions, DNS queries, errors
- Traces: Request path through distributed systems
VPC Flow Logs
AWS VPC Flow Logs
Capture IP traffic information for network interfaces:
# Flow log format (default v2)
version account-id interface-id srcaddr dstaddr srcport dstport protocol packets bytes start end action log-status
# Example log entry:
2 123456789012 eni-abc123 10.0.1.100 10.0.2.50 52986 443 6 15 1500 1609459200 1609459260 ACCEPT OK
- Capture level: VPC, subnet, or ENI
- Destinations: CloudWatch Logs, S3, Kinesis Data Firehose
- Custom formats: Select only needed fields to reduce cost
- Aggregation: 1-minute or 10-minute intervals
Query Examples (CloudWatch Logs Insights)
# Find rejected traffic
fields @timestamp, srcAddr, dstAddr, dstPort, action
| filter action = "REJECT"
| sort @timestamp desc
| limit 100
# Top talkers by bytes
stats sum(bytes) as totalBytes by srcAddr
| sort totalBytes desc
| limit 10
# Traffic to specific port
fields @timestamp, srcAddr, dstAddr, bytes
| filter dstPort = 443
| stats sum(bytes) as totalBytes by srcAddr
GCP VPC Flow Logs
- Metadata: Source/dest VM, zone, region included
- Sample rate: Configurable sampling (5%-100%)
- Export: Cloud Logging, BigQuery, Pub/Sub
Azure NSG Flow Logs
- Version 2: Includes throughput information
- Storage: Azure Storage Account
- Analysis: Traffic Analytics for visualization
Network Metrics
Key Metrics to Monitor
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| Network In/Out | Bandwidth utilization | >80% of capacity |
| Packets In/Out | Traffic volume | Sudden spikes |
| Connection Count | Active connections | Near limits |
| Packet Loss | Network health | >0.1% |
| Latency | Response time | P99 > baseline + 2σ |
AWS CloudWatch Network Metrics
- EC2: NetworkIn, NetworkOut, NetworkPacketsIn/Out
- ELB: RequestCount, Latency, HealthyHostCount
- NAT Gateway: BytesIn/Out, ConnectionAttemptCount, ErrorPortAllocation
- Transit Gateway: BytesIn/Out, PacketsIn/Out per attachment
Latency Monitoring
Sources of Latency Data
- Load balancer metrics: Target response time
- Application traces: Full request timing breakdown
- Synthetic monitoring: Scheduled probes from multiple locations
- Real user monitoring: Actual user experience
Synthetic Monitoring Tools
- AWS CloudWatch Synthetics: Canary scripts with screenshots
- GCP Uptime Checks: HTTP, TCP, SSL checks globally
- Azure Application Insights: Availability tests
- Third-party: Datadog Synthetics, Pingdom, Catchpoint
# CloudWatch Synthetic canary example
exports.handler = async () => {
const synthetics = require('Synthetics');
const response = await synthetics.executeHttpStep(
'Check API',
'https://api.example.com/health'
);
if (response.statusCode !== 200) {
throw new Error('Health check failed');
}
};
Alerting Strategy
Alert Levels
- Info: Awareness, no action needed
- Warning: Investigation needed soon
- Critical: Immediate action required
Effective Alerting Principles
- Alert on symptoms, not causes: "Service down" not "CPU high"
- Avoid alert fatigue: Every alert should be actionable
- Use appropriate thresholds: Baseline + standard deviations
- Include runbook links: How to investigate and fix
Example Alert Configurations
# NAT Gateway port exhaustion warning
Metric: ErrorPortAllocation
Threshold: > 0 for 5 minutes
Action: Page on-call, add NAT Gateway
# VPC Flow Log rejected traffic spike
Query: filter action = "REJECT" | stats count()
Threshold: > 1000/minute (baseline-dependent)
Action: Security team notification
Centralized Logging
Log Aggregation Architecture
VPC Flow Logs ─────┐
│
Firewall Logs ─────┼────► Log Aggregation ────► Analysis/SIEM
│ (CloudWatch, ELK,
DNS Query Logs ────┤ Splunk, Datadog)
│
App Logs ──────────┘
Cost Optimization
- Sample flow logs: 10% sampling vs. 100% for cost savings
- Custom log formats: Only capture needed fields
- Tiered storage: Hot (queries) → Warm → Cold (archive)
- Retention policies: Match compliance requirements
Third-Party Monitoring Tools
Full-Stack Observability
- Datadog: Metrics, logs, APM, network monitoring
- New Relic: Infrastructure, APM, synthetics
- Dynatrace: AI-powered, full-stack
- Splunk: Log analysis, SIEM capabilities
Network-Focused Tools
- Kentik: Network traffic analysis, DDoS detection
- ThousandEyes: Internet/cloud visibility
- NetFlow/IPFIX analyzers: Deep traffic analysis
Network Performance Testing
Tools for Testing
# iperf3 for bandwidth testing
iperf3 -c target-server -t 30 -P 4
# mtr for path analysis
mtr -rw --report-cycles 60 target-host
# curl timing
curl -w "@timing.txt" -o /dev/null -s https://api.example.com
# timing.txt format:
time_namelookup: %{time_namelookup}\n
time_connect: %{time_connect}\n
time_appconnect: %{time_appconnect}\n
time_starttransfer: %{time_starttransfer}\n
time_total: %{time_total}\n
Key Takeaways
- VPC Flow Logs are essential for network visibility
- Monitor both infrastructure metrics and user-facing latency
- Synthetic monitoring catches issues before users do
- Alert on symptoms, include runbooks, avoid fatigue
- Centralize logs for correlation and analysis
Need Network Observability Help?
We design comprehensive monitoring solutions. Contact us for a consultation.