Application calling AWS internal load balancer in same subnet is timing out

Question

Some background:

I've created a moderately complex network using Amazon's vpc. It's a three-tiered network across two availability zones. Each layer has a subnet in zone-a and zone-b. The presentation layer is at the top, there’s an application layer in the middle, and a core layer at the bottom.

All of the security groups and ACL's for the subnets are currently allowing ALL inbound and outbound traffic to help me reduce surface area of the issue.

The presentation layer’s routing table is pointing all traffic to an internet gateway. The NAT gateway is in a segregated subnet also pointing all traffic to the internet gateway.

My application has two components, a UI (React.js) and an API (Node/Express). They are deployed as docker images. In front of each is a classic load balancer.

The UI-ELB is internet facing and resides in the presentation layer, routing traffic from 80/443 to port 8080 and is associated with my app-ec2 that is placed in the application-layer subnet.

My API has an internal load balancer in front of it. The API-ELB is in the application layer (in the same subnet as the app-ec2), and takes traffic on port 80/443 and routes it down to the api-ec2 in the core on port 3000.

Both load balancers are offloading the certificate before passing traffic to their instances.

I have both my load balancers associated as alias's in Route53, and referenced in the applications by their pretty url (https://app.website.com). Each load balancer passes the defined healthchecks and reports all ec2 instances in use.

Lastly, on the API I have enabled cors using the cors nodejs package.

Here's a quick and dirty diagram of my network.

The problem:

The APP-ELB successfully routes me to the application. However, when the app tries to send a GET request to the API-ELB, it first sends an OPTIONS request that tims out with the error code 408.

Where it gets weird

Some of the weirdest things I've encountered while debugging are:

I can SSH into the app-ec2 instance and can run a successful curl against the API-ELB. I’ve tried many, and they all work. A few examples are: curl -L https://api.website.com/system/healthcheck and curl -L -X OPTIONS https://api.website.com/system/healthcheck. It always returns the desired information.
I've moved the entire application out of my network into a public default vpc and it works as it's supposed to.
I have the api-ec2 writing all network requests to the console. While it shows the healthcheck requests, it does not show any requests from the app-ec2. This leads me to believe traffic is not even reaching the api.

Really the biggest thing that has me at a complete loss is that curling the internal api elb works, but the axios request to the same exact url does not. This doesn't make sense to me at all.

What I've tried

I originally spent a lot of time playing with ACL rules and security groups thinking I did something wrong. Eventually I just said, "screw it", and opened everything up to try and take that piece out of the equation.

I've spent way to much time playing with Cors on my api. Eventually landing on the configuration I have now, that is the default app.use(cors()) callback provided by the cors node package. I've also included the app.options('*', cors()) that is recommended in the documentation.

I've google everything under the sun, but specifically whether I need to define some special custom headers with the elbs? But can't seem to find anything. Plus, when I moved my app out of the network it worked just fine.

I'm sure I've tried many other things, but these seem to be the most pertinent. What am I missing? I realize this is potentially a very vague and broad issue, and an enormous post, but I appreciate any insight and your time in reading in it!

score 9 · Accepted Answer · answered Jun 07 '17 at 08:38

So what you have actually is this:

As your API ELB is in a private zone it can't be accessed from the internet.
Your frontend in React.js just run in User's browser and not on the UI servers, those server just serve static files.

You have two options, configure your frontend servers to redirect API calls to the API ELB or just update the API ELB to be internet facing.

The usual pitfall of JavaScript apps is forgetting they run inside user's browser and not on the frontend servers as a JEE application would.

James Shewey · Answer 2 · 2017-06-06T15:37:50.840

This sounds like an asymmetric or n-path routing issue. Here is what is probably happening:

Machine A at IP address 192.168.1.1 makes initiates a [SYN] request through the LB at 192.168.1.10. the LB then proxies the payload to Machine B at 192.168.1.2, so the payload now has source: 192.168.1.1 and has has destination: 192.168.1.2 (which used to be 192.168.1.10).

So what happens now when 192.168.1.2 responds with a [SYN, ACK]? What should happen is that Machine B should respond to Machine A through the load balancer - typically because of a default route or gateway on the server that routes traffic through the LB. In this case however, the machine is on the same subnet, so the route/gateway is not used and the routing table ignored by the server. This means that when the server responds, the [SYN,ACK] appears to Machine A to come from an IP different than the IP that Machine A initiated the request with - it was expecting a source IP of 192.168.1.10 (the LB) but is seeing a [SYN,ACK] coming from 192.168.1.2 (machine B) and thus the LB is unable to establish a connection with the machine B in this scenario because the response went to the wrong device.

The reason this works for external traffic is because of your default route - the responses to everyone else are routed through the ELB. The ELB sees that it was initiating a connection and automagically intercepts the response and swaps the source of 192.168.1.2 back to 192.168.1.10.

So, for one solution to this issue, you could implement one-armed load balancing (also known as a load balancer on a stick). What this will do is use a Source NAT on the inside interface of the load balancer (so assume you had outside interface 192.168.1.10 on your load balancer and 192.168.1.11 on the inside interface). This will make all traffic appear to be coming from 192.168.1.11 from the perspective of Machine B which should solve your connection issue.

It appears however, that your AWS ELB doesn't support SNAT, so you will either need to put your hosts and ELB on different subnets or use something that supports SNATs like F5's Virtual Edition which comes in hourly or BYOL flavors. Beware connection limitations with SNATing though - if you need over about 30k simultaneous connections you will run into SNAT port exhaustion and need to start using a SNAT pool..

Hence, you best solution (for cost and to prevent future issues) would be to make sure the client and server are on different subnets.

The best way to confirm would be to use tcpdump on the connecting host and/or back-end server and look for responses coming directly to/from the back-end server instead of going through the load balancer. You can then load your dump file into WireShark to figure out exactly what is going on.

Application calling AWS internal load balancer in same subnet is timing out

Some background:

The problem:

What I've tried

2 Answers2