Monitoring VPN / Direct Connect Connectivity

In an on premise network it is typical to do some type of topology discovery via the MIB-II SNMP tables of all your network gear and then use ICMP to ping each of the discovered routers or switches respective management interfaces to validate connectivity and contribute to intelligence when parts of the network go dark and speed up root cause analysis.

There are a few challenges when it comes to this basic topology continuity testing when AWS gets hooked up to the enterprise network.

  1. No SNMP discovery. You will need to write some type of adapter that integrates with the AWS EC2 API to discover and translate the AWS network topology into your existing model of the network.

  2. No ICMP to check router connectivity. You are not able to ICMP ping AWS VPC routers from your network management servers.

ICMP and the AWS EC2 VPC router

There are two constructs at play for AWS EC2 VPC:

  • VPC - The VPC is a logical construct that spans the region and gives you a main route table with a route for the subnet that targets itself as the next hop

  • Subnet - This is a more tangible Availability Zone specific construct that creates a router for that AZ to provides AZ isolation. The first usable IP address of the subnet is used for the nexthop locally within the subnet and will respond to ICMP to EC2 inside the subnet.

From inside a VPC subnet you can ICMP ping the VPC router IP of the subnet. But you cannot ping the router from outside the subnet. To be clear you also cannot ping the VPC subnet router from another subnet in the same VPC.

This is also true over VPC peering or Direct Connect. You are not able to ping the router of the VPC subnet over VPC or DX.

For some reason a ping to a peered subnet doesn’t show up on either end of the VPC flow logs and this is stated in the documentation

Flow logs do not capture all IP traffic. The following types of traffic are not logged:

  • (Traffic to the reserved IP address for the default VPC router.) …

What about using AWS Lambda to ICMP ping across the VPN or Direct Connect

This will not work, the Lambda container has CAP_NET_RAW disabled, which means you have no way to craft an ICMP echo_request packet from any Lambda runtime. But you could try normal TCP or UDP to test it, like run a HTTP server on prem and have a Lambda in each VPC check in every 5 minutes, the HTTP server can translate that to an availability metric on your NMS to help with topology and segment availability calculations. In case of full AWS connectivity failure you can code a lack of heartbeats from all the Lambdas/VPCs to be a link failure and consolidate to reduce noise. (Some NMS will do this for you without any extra work)

How to monitor all the VPC

It is possible to ping the Direct Connect VIF IP address, but ICMP traffic is de-prioritized on the AWS network. You may observe ICMP packet loss if AWS drops them due to traffic engineering when other traffic is still working.

There is no way to ping all your VPC subnet routers to detect when things go bad, and things can go bad in way that only effects certain VPCs some examples I have had the misfortune to experience:

  • NACLs introduced that cause traffic to be blocked

  • Security group configuration errors

  • BGP route removed or dropped at either end of the link due to error human or otherwise

  • Sloppy route table changes on the VPC, or Subnet route tables

  • Link level problems (trench digger in the street)

  • Datacenter migrations unknowingly causing asymmetric routing causing stateful firewall to drop traffic

  • Firewall bugs that cause some traffic to get dropped depending on the weather conditions on Kepler-452b

  • MTU mismatches

  • Network changes in the datacenter , What is this VLAN? I don’t think we need it anymore no vlan 123 oh no the VIF is down!

War story on network outages

Tron not authorized

I once experienced DX outage because the AWS DX partner we purchased our link off was unknowingly dependant on another vendor (vendor B) to reach AWS. Vendor B notified the partner the link would be terminated but it seems nobody in the company remembered the importance of the link or bothered to check what it was for. So their partner decommissioned the link, and I experienced an outage.

I was only able to discover what caused this after a heap of back and forth between AWS and the partner that went nowhere, so I decided to take matters into my own hands.

Since my equipment was down the road from this datacenter I decided to walk there and talk to them. I was able to Jedi magic my way in to the datacenter and explained the situation to the local DC dude (a case of beer goes a long way) I was able to tell him the exact location of the partners device and port we connected to so he helped me chase the thankfully small number of cables that come out of it. We took a lift to where the DC managers paper work said the cable terminated to find another room full of racks, as we navigated our way down the aisles we got to one spot where there was no rack between many other racks. Not an empty rack, No rack - at all, just floor tile and some cables hanging out of the overhead. (I assume the DC manager also took a case of beer to cast a blind eye to this DC no no).

The DC manager casually informed me the rack was removed just a few hours before. As typical with your DC manager type person they are unconcerned with putting two and two together. I suspect he knew all along but wanted to appear to be working hard for for his case of beer.

I also suspect it was no coincidence he happened to pick the cable that went to the rack he authorized to be removed. Come to think of it I actually saw them loading the rack into a truck at the dock as I arrived.

Long story short, vendor A begged vendor B to undo the change, who in turn gave the DC manager another case of beer, the truck came back and not long after my network was back.

The moral of the story is both that crazy stuff happens on the network, and also the security of the biggest most reputable datacenters in the world are not the best 🤷‍♂️ but I do appreciate the DC managers help and understanding, what a champ!

There were a few things working for me in order to Jedi 🧙‍♂️ the datacenter. (one of them being I had a pass to one of their other DCs but not this one) I am sure that any person walking off the street would not be able to easily pull it off. I was supervised the entire time I was in there.

Comments

comments powered by Disqus