By Eric Lathrop on

I wrote Living in Syndication to scrape websites and generate missing RSS feeds. I was having trouble with a certain site rate-limiting my scraping to something ridiculous like 5 requests per day! It's not like I'm hammering the site or doing anything unethical.

In order to workaround the rate limiting, I needed to "borrow" other IP addresses for a short time. I realized that AWS Lambda has a large pool of IP addresses I could use temporarily for little to no cost. I originally tried to build a HTTP proxy inside a Lambda function, but I ran into a problem because HTTPS requests through a proxy use the HTTP CONNECT method, which requires a TCP connection, which you can't get because of the way API Gateway works.

My second attempt was to split the proxy into two pieces, a "frontend" running on my server, and a "backend" running on AWS Lambda. It works like this:

  1. Some software connects to the frontend, and issues a HTTP CONNECT command.
  2. The frontend fires a HTTP request to the API Gateway / Lambda function to trigger it to start.
  3. The backend opens a WebSocket connection back to the frontend.
  4. The backend opens a TCP connection to the final destination, and pipes it to the WebSocket connection from the previous step.
  5. The frontend receives the WebSocket connection from the backend, and pipes it to the HTTP socket that got the CONNECT request.
  6. The original software can now send arbitrary bytes to the destination through the two proxies, appearing to come from the AWS Lambda IP address.

I built this in node, and got it all working locally, and tested with curl. When I deployed it, I ran into a bug when the frontend is running behind Traefik, where Traefik routes the request to an incorrect container. Hopefully it will be fixed soon, and I can get back to reading my RSS feeds.

Once I'm sure it fully works when deployed, I'll make the repository public.