Simple ETL with bash, jq, and SQLite
I run a couple simple ETL pipelines for side projects, and after experimenting with Lambda to kick off Python jobs (and finding it frustrating), I've settled on simple bash scripts (along with the awesome jq
util) run via cron on a DigitalOcean instance. Not only is it more stable, but I've found it to be much faster. Recently, a thread on HN convinced me to try adding SQLite to the mix, and I've been happy with the results.
For my Twitter timeline scrape, I run a script every five minutes that scrapes the last 200 timeline tweets (via twurl
), converts the JSON into a CSV, and imports that CSV into a SQLite db. A unique key on the id
col prevents duplicates, and so far the setup has proven solid.
The trick is using jq
's @csv
filter (along with the -r
flag to not escape output as JSON) to output the tweets to CSV:
Then all that remains is to import the CSV into your SQLite db:
You can pass a single command to sqlite3
's CLI, but when you need to chain multiple commands (as importing from CSV requires .mode csv
), use the -cmd
flag for each of the first N-1 commands (listed commands will be executed in order).
The HN thread gives some tips (recommended SQLite pragmas, etc.) on how to use this with larger pipelines, and I look forward to using this with my transit scraper (1000s of vehicle locations updating every minute).