Easy filter, sort, and limit of JSON files with jq
jqjsondebuggingBackground / Motivation #
Lately I've been doing a lot of debugging of a data pipeline. There are a variety of ETL steps, each of which produces a big JSON file as part of an audit trail. We use these files primarily for debugging the pipeline when something goes wrong.
These are JSON files 50 megabytes or larger. They typically contain arrays of > 100,000 objects.
Regardless of your particular situation, at some point as a developer it is fairly likely you will be trying to extract some key pieces of infromation from a large JSON file as part of some debugging process. The tool jq
is a crucial option for making this easier.
With jq
we can:
- format ("pretty print") large JSON files to make them easier to read
- perform familiar operations such as filter, order, and limit
jq for fun and profit (or just fun) #
Pretty printing / formatting #
I find myself often working with JSON files which are too large for vscode to format them. Instead, I use jq
There are multiple ways to do this, here are 3 that should cover most use cases.
If you want to keep the original file #
This creates a new file my-large-json-formatted.json
and leaves the original intact.
jq . my.json > my-large-json-formatted.json
If you want to overwrite the original file #
Use a temporary file and mv
jq . my.json > my-formatted.json && mv my-formatted.json my.json
Or, if you have sponge installed
jq . my.json | sponge my.json
sponge
is part of moreutils
Filtering, sorting, limiting #
Setting #
Let's say we have a file events.json
which is an array of objects of the shape:
{
"event": "some_event_type",
"time": 1665365562000
}
Operations #
Sort by time (ascending) and get top 5
jq '[.[]] | sort_by(.time) | limit(5;.[])' events.json
Sort by time (descending) and get top 5
jq '[.[]] | sort_by(.time) | reverse | limit(5;.[])' events.json
Find 5 most recent "install" events"
jq '[.[] | select(.event == "install" )] | sort_by(.time) | reverse | limit(5;.[])' events.json
Formatting and conversions #
Unfortunately, our dates are millisecond epoch timetamps which means they aren't easy to read in the output.
We can fix that!
jq provides todate
- but it takes seconds, not milliseonds. So we need to convert:
jq '[.[]] | sort_by(.time) | reverse | limit(1;.[]) | .time / 1000 | todate' events.json