Joseph O'Shea

Easy filter, sort, and limit of JSON files with jq

jqjsondebugging

Background / Motivation

Lately I've been doing a lot of debugging of a data pipeline. There are a variety of ETL steps, each of which produces a big JSON file as part of an audit trail. We use these files primarily for debugging the pipeline when something goes wrong.

These are JSON files 50 megabytes or larger. They typically contain arrays of > 100,000 objects.

Regardless of your particular situation, at some point as a developer it is fairly likely you will be trying to extract some key pieces of infromation from a large JSON file as part of some debugging process. The tool jq is a crucial option for making this easier.

With jq we can:

jq for fun and profit (or just fun)

Pretty printing / formatting

I find myself often working with JSON files which are too large for vscode to format them. Instead, I use jq

There are multiple ways to do this, here are 3 that should cover most use cases.

If you want to keep the original file

This creates a new file my-large-json-formatted.json and leaves the original intact.

jq . my.json > my-large-json-formatted.json

If you want to overwrite the original file

Use a temporary file and mv

jq . my.json > my-formatted.json && mv my-formatted.json  my.json

Or, if you have sponge installed

jq . my.json | sponge my.json

sponge is part of moreutils

Filtering, sorting, limiting

Setting

Let's say we have a file events.json which is an array of objects of the shape:

{
  "event": "some_event_type",
  "time": 1665365562000
}

Operations

Sort by time (ascending) and get top 5

jq '[.[]] | sort_by(.time) | limit(5;.[])' events.json

Sort by time (descending) and get top 5

jq '[.[]] | sort_by(.time) | reverse | limit(5;.[])' events.json

Find 5 most recent "install" events"

jq '[.[] | select(.event == "install" )] | sort_by(.time) | reverse | limit(5;.[])' events.json

Formatting and conversions

Unfortunately, our dates are millisecond epoch timetamps which means they aren't easy to read in the output.

We can fix that!

jq provides todate - but it takes seconds, not milliseonds. So we need to convert:

jq '[.[]] | sort_by(.time) | reverse | limit(1;.[]) | .time / 1000 | todate' events.json