I wrote Living in Syndication to scrape websites and generate missing RSS feeds. I was having trouble with a certain site rate-limiting my scraping to something ridiculous like 5 requests per day! It's not like I'm hammering the site or doing anything unethical.
In order to workaround the rate limiting, I needed to "borrow" other IP
addresses for a short time. I realized that AWS
Lambda has a large pool of IP addresses I could
use temporarily for little to no cost. I originally tried to build a HTTP proxy
inside a Lambda function, but I ran into a problem because HTTPS requests
through a proxy use the HTTP CONNECT
method,
which requires a TCP connection, which you can't get because of the way API
Gateway works.
My second attempt was to split the proxy into two pieces, a "frontend" running on my server, and a "backend" running on AWS Lambda. It works like this:
- Some software connects to the frontend, and issues a HTTP
CONNECT
command. - The frontend fires a HTTP request to the API Gateway / Lambda function to trigger it to start.
- The backend opens a WebSocket connection back to the frontend.
- The backend opens a TCP connection to the final destination, and pipes it to the WebSocket connection from the previous step.
- The frontend receives the WebSocket connection from the backend, and pipes it
to the HTTP socket that got the
CONNECT
request. - The original software can now send arbitrary bytes to the destination through the two proxies, appearing to come from the AWS Lambda IP address.
I built this in node, and got it all working locally, and tested with curl
.
When I deployed it, I ran into a
bug when the frontend is
running behind Traefik, where Traefik routes the request
to an incorrect container. Hopefully it will be fixed soon, and I can get back
to reading my RSS feeds.
Once I'm sure it fully works when deployed, I'll make the repository public.