Save Copies of All Data You Import

By Eric Lathrop on

Working as a programmer today you end up writing a lot of plumbing code, gluing together systems under your control with external systems. Many times you have to build processes to ingest data from other companies. Sometimes you pull the data, either querying an HTTP API, downloading CSV files from an FTP server, or even directly connecting to a partner's database. Other times, data is pushed to you, either through your own API, webhooks, or a shared S3 bucket.

Over the years of building such data ingesting processes, I've noticed a design pattern that's helped me recover from errors, made my systems more reliable, and even helped debug the systems generating the data. Whenever you ingest external data, save raw unmodified copies of the data for as long as you can afford to.

The first benefit of this pattern is that if your import code ever processes some data incorrectly, or you realized you skipped over a data field, you can just re-process the original data from the archive once your code is fixed. The best part about this is that you won't have to talk to the person/company on the other end. You already have all the data and won't need it resent.

Another benefit comes up during development and testing. When you go to work on the code again in the future, you'll have tons of real data you can run through your system to catch any regressions.

Many times I've come across some sort of communications problem between my system and an external system, where the owner pushes back and says the error is on my side. After looking through my archive of data they've sent me, I've been able to determine that they were sending broken or incomplete data. I can even email them back their own data that they didn't have, helping get the issue solved faster!

I usually split my ingest processes into two parts. The first part permanently stores a read-only archive of the data, and the second part processes the data in the archive (to parse it, insert into my database, etc.) If you split up your system this way, you get a reliability benefit where the data receiving code is pretty simple and hardly changes. This allows your importing code to go offline without you ever losing data. The data will be waiting for you safely in your archive when the importing code comes back up.

I've gotten a lot of mileage out of this pattern over the years, and the costs are usually negligible. There's a little bit of extra code to archive the data, then you have to pay for the disk space to store it all. It's always been worth it in my experience.