Skip to main content
Version: Next

zq

TL;DR zq is a command-line tool that uses the Zed language for pipeline-style search and analytics. zq can query a variety of data formats in files, over HTTP, or in S3 storage. It is particularly fast when operating on data in the Zed-native ZNG format.

The zq design philosophy blends the query/search-tool approach of jq, awk, and grep with the command-line, embedded database approach of sqlite and duckdb.

1. Usage

zq [ options ] [ query ] input [ input ... ]
zq [ options ] query

zq is a command-line tool for processing data in diverse input formats, providing search, analytics, and extensive transformations using the Zed language. A query typically applies Boolean logic or keyword search to filter the input, then transforms or analyzes the filtered stream. Output is written to one or more files or to standard output.

Each input argument must be a file path, an HTTP or HTTPS URL, an S3 URL, or standard input specified with -.

For built-in command help and a listing of all available options, simply run zq with no arguments.

zq supports a number of formats but ZNG tends to be the most space-efficient and most performant. ZNG has efficiency similar to Avro and Protocol Buffers but its comprehensive Zed type system obviates the need for schema specification or registries. Also, the ZSON format is human-readable and entirely one-to-one with ZNG so there is no need to represent non-readable formats like Avro or Protocol Buffers in a clunky JSON encapsulated form.

zq typically operates on ZNG-encoded data and when you want to inspect human-readable bits of output, you merely format it as ZSON, which is the default format when output is directed to the terminal. ZNG is the default when redirecting to a non-terminal output like a file or pipe.

When run with input arguments, each input's format is automatically inferred (as described below) and each input is scanned in the order appearing on the command line forming the input stream.

A query expressed in the Zed language may be optionally specified and applied to the input stream.

If no query is specified, the inputs are scanned without modification and output in the desired format as described below. This latter approach provides a convenient means to convert files from one format to another.

To determine whether the first argument is a query or an input, zq checks the local file system for the existence of a file by that name or whether the name is an URL. If no such file or URL exists, it attempts to parse the text as a Zed program. If both checks fail, then an error is reported and zq exits.

This heuristic is convenient but can result in a rare surprise when a simple Zed query (like a keyword search) happens to correspond with a file of the same name in the local directory. To avoid this, you can provide the query with the -query flag, which specifies the Zed program to run and forces all arguments to be interpreted as inputs.

When zq is run with a query and no input arguments, then the query must begin with a

In the case of a yield with no inputs, the query is run with a single input value of null. This provides a convenient means to run in a "calculator mode" where input is produced by the yield and can be operated upon by the Zed query, e.g.,

zq -z '1+1'

emits

2

Note here that the query 1+1 implies yield 1+1.

2. Input Formats

zq currently supports the following input formats:

OptionAutoSpecification
jsonyesJSON RFC 8259
csvyesCSV RFC 4180
parquetnoApache Parquet
zsonyesZSON - Human-readable Format
zngyesZNG - Binary Row Format
zstnoZST - Binary Columnar Format
zjsonyesZJSON - Zed over JSON
zeekyesZeek Logs

The input format is typically detected automatically and the formats for which Auto is yes in the table above support auto detection. Formats without auto detection require the -i option.

2.1 Hard-wired Input Format

The input format is specified with the -i flag.

When -i is specified, all of the inputs on the command-line must be in the indicated format.

2.2 Auto-detection

When using auto detection, each input's format is independently determined so it is possible to easily blend different input formats into a unified output format.

For example, suppose this content is in a file sample.csv:

a,b
1,foo
2,bar

and this content is in sample.json

{"a":3,"b":"baz"}

then the command

zq -z sample.csv sample.json

would produce this output in the default ZSON format

{a:1.,b:"foo"}
{a:2.,b:"bar"}
{a:3,b:"baz"}

2.3 ZSON-JSON Auto-detection

Since ZSON is a superset of JSON, zq must be careful in whether it interprets input as ZSON as JSON. While you can always clarify your intent with the -i zson or -i json, zq attempts to "just do the right thing" when you run it with JSON vs. ZSON.

While zq can parse any JSON using its built-in ZSON parser this is typically not desirable because (1) the ZSON parser is not particularly performant and (2) all JSON numbers are floating point but the ZSON parser will parse as JSON any number that appears without a decimal point as an integer type.

The reason zq is not particularly performant for ZSON is that the ZNG or ZST formats are semantically equivalent to ZSON but much more efficient and the design intent is that these efficient binary formats should be used in use cases where performance matters. ZSON is typically used only when data needs to be human-readable in interactive settings or in automated tests.

To this end, zq uses a heuristic to select between ZSON in JSON when the -i option is not specified. Specifically, JSON is selected when the first values of the input are parsable as valid JSON and includes a JSON object either as an outer object or as a value nested somewhere within a JSON array.

This heuristic almost always works in practice because ZSON records typically omit quotes around field names.

3. Output Formats

The output format defaults to either ZSON or ZNG and may be specified with the -f option. The supported output formats include all of the input formats along with text and table formats, which are useful for displaying data. (They do not capture all the information required to reconstruct the original data so they are not supported input formats.)

Since ZSON is a common format choice, the -z flag is a shortcut for -f zson. Also, -Z is a shortcut for -f zson with -pretty 4 as described below.

And since JSON is another common format choice, the -j flag is a shortcut for -f json.

3.1 Output Format Selection

When the format is not specified with -f, it defaults to ZSON if the output is a terminal and to ZNG otherwise.

While this can cause an occasional surprise (e.g., forgetting -f or -z in a scripted test that works fine on the command line but fails in CI), we felt that the design of having a uniform default had worse consequences:

  • If the default format were ZSON, it would be very easy to create pipelines and deploy to production systems that were accidentally using ZSON instead of the much more efficient ZNG format because the -f zng had been mistakenly omitted from some command. The beauty of Zed is that all of this "just works" but it would otherwise perform poorly.
  • If the default format were ZNG, then users would be endlessly annoyed by binary output to their terminal when forgetting to type -f zson.

In practice, we have found that the output defaults "just do the right thing" almost all of the time.

3.2 ZSON Pretty Printing

ZSON text may be "pretty printed" with the -pretty option, which takes the number of spaces to use for indentation. As this is a common option, the -Z option is a shortcut for -f zson -pretty 4.

For example,

echo '{a:{b:1,c:[1,2]},d:"foo"}' | zq -Z -

produces

{
a: {
b: 1,
c: [
1,
2
]
},
d: "foo"
}

and

echo '{a:{b:1,c:[1,2]},d:"foo"}' | zq -f zson -pretty 2 -

produces

{
a: {
b: 1,
c: [
1,
2
]
},
d: "foo"
}

When pretty printing, colorization is enabled by default when writing to a terminal, and can be disabled with -color false.

3.3 Pipeline-friendly ZNG

Though it's a compressed binary format, ZNG data is self-describing and stream-oriented and thus is pipeline friendly.

Since data is self-describing you can simply take ZNG output of one command and pipe it to the input of another. It doesn't matter if the value sequence is scalars, complex types, or records. There is no need to declare or register schemas or "protos" with the downstream entities.

In particular, ZNG data can simply be concatenated together, e.g.,

zq -f zng 'yield 1,[1,2,3]' > a.zng
zq -f zng 'yield {s:"hello"},{s:"world"}' > b.zng
cat a.zng b.zng | zq -z -

produces

1
[1,2,3]
{s:"hello"}
{s:"world"}

And while this ZSON output is human readable, the ZNG files are binary, e.g.,

zq -f zng 'yield 1,[1,2,3]' > a.zng
hexdump -C a.zng

produces

00000000  02 00 01 09 1b 00 09 02  02 1e 07 02 02 02 04 02  |................|
00000010 06 ff |..|
00000012

3.4 Schema-rigid Outputs

Certain data formats like Parquet are "schema rigid" in the sense that they require a schema to be defined before values can be written into the file and all the values in the file must conform to this schema.

Zed, however, has a fine-grained type system instead of schemas and a sequence of data values are completely self-describing and may be heterogeneous in nature. This creates a challenge converting the type-flexible Zed formats to a schema-rigid format like Parquet.

For example, this seemingly simple conversion:

echo '{x:1}{s:"hello"}' | zq -o out.parquet -f parquet -

causes this error

Parquet output requires uniform records but multiple types encountered (consider 'fuse')

3.4.1 Fusing Schemas

As suggested by the error above, the Zed fuse operator can merge different record types into a blended type, e.g., here we create the file and read it back:

echo '{x:1}{s:"hello"}' | zq -o out.parquet -f parquet fuse -
zq -z -i parquet out.parquet

but the data was necessarily changed (by inserting nulls):

{x:1,s:null(string)}
{x:null(int64),s:"hello"}

3.4.2 Splitting Schemas

Another common approach to dealing with the schema-rigid limitation of Parquet is to create a separate file for each schema.

zq can do this too with the -split option, which specifies a path to a directory for the output files. If the path is ., then files are written to the current directory.

The files are named using the -o option as a prefix and the suffix is -<n>.<ext> where the <ext> is determined from the output format and where <n> is a unique integer for each distinct output file.

For example, the example above would produce two output files, which can then be read separately to reproduce the original data, e.g.,

echo '{x:1}{s:"hello"}' | zq -o out -split . -f parquet -
zq -z -i parquet out-*.parquet

produces the original data

{x:1}
{s:"hello"}

While the -split option is most useful for schema-rigid formats, it can be used with any output format.

4. Query Debugging

If you are ever stumped about how the zq compiler is parsing your query, you can always run zq -C to compile and display your query in canonical form without running it. This can be especially handy when you are learning the language and its shortcuts.

For example, this query

zq -C 'has(foo)'

is an implied where operator, which matches values that have a field foo, i.e.,

where has(foo)

while this query

zq -C 'lower(foo)'

is an implied yield operator, which produces the lower case version of the presumed string in field foo, i.e.,

yield lower(foo)

5. Error Handling

Fatal errors like "file not found" or "file system full" are reported as soon as they happen and cause the zq process to exit.

On the other hand, runtime errors resulting from the Zed query itself do not halt execution. Instead, these error conditions produce first-class Zed errors in the data output stream interleaved with any valid results. Such errors are easily queried with the is_error function.

This approach provides a robust technique for debugging complex query pipelines, where errors can be wrapped in one another providing stack-trace-like debugging output alongside the output data. This approach has emerged as a more powerful alternative to the traditional technique of looking through logs for errors or trying to debug a halted program with a vague error message.

For example, this query

echo '1 2 0 3' |  zq '10.0/this' -

produces

10.
5.
error("divide by zero")
3.3333333333333335

and

echo '1 2 0 3' |  zq '10.0/this' - | zq 'is_error(this)' -

produces just

error("divide by zero")

6. Examples

As you may have noticed, many examples of the Zed language are illustrated using this pattern

echo <values> | zq <query> -

which is used throughout the language documentation and operator reference.

The language documentation and tutorials directory have many examples, but here are a few more simple zq use cases.

Hello, world

echo '"hello, world"' | zq -z 'yield this' -

produces this ZSON output

"hello, world"

Some values of available data types

echo '1 1.5 [1,"foo"] |["apple","banana"]|' | zq -z 'yield this' -

produces

1
1.5
[1,"foo"]
|["apple","banana"]|

The types of various data

echo '1 1.5 [1,"foo"] |["apple","banana"]|' | zq -z 'yield typeof(this)' -

produces

<int64>
<float64>
<[(int64,string)]>
<|[string]|>

A simple aggregation

echo '{key:"foo",val:1}{key:"bar",val:2}{key:"foo",val:3}' | zq -z 'sum(val) by key | sort key' -

produces

{key:"bar",sum:2}
{key:"foo",sum:4}

Convert CSV to Zed and cast a to an integer from default float

printf "a,b\n1,foo\n2,bar\n" | zq 'a:=int64(a)' -

produces

{a:1,b:"foo"}
{a:2,b:"bar"}

Convert JSON to Zed and cast to an integer from default float

echo '{"a":1,"b":"foo"}{"a":2,"b":"bar"}' | zq 'a:=int64(a)' -

produces

{a:1,b:"foo"}
{a:2,b:"bar"}

Make a schema-rigid Parquet file using fuse and turn it back into Zed

echo '{a:1}{a:2}{b:3}' | zq -f parquet -o tmp.parquet fuse -
zq -z -i parquet tmp.parquet

produces

{a:1,b:null(int64)}
{a:2,b:null(int64)}
{a:null(int64),b:3}

7. Performance

Your mileage may vary, but many new users of zq are surprised by its speed compared to tools like jq, grep, awk, or sqlite especially when running zq over files in the ZNG format.

7.1 Fast Pattern Matching

One important technique that helps zq run fast is to take advantage of queries that involve fine-grained searches.

When a query begins with a logical expression containing either a search or a predicate match with a constant value, and presuming the input data format is ZNG, then the runtime optimizes the query by performing an efficient, byte-oriented "pre-search" of the values required in the predicate. This pre-search scans the bytes that comprise a large buffer of values and looks for these values and, if they are not present, the entire buffer is discarded knowing no individual value in that buffer could match because the required serialized values were not present in the buffer.

For example, if the Zed query is

"http error" and ipsrc==10.0.0.1 | count()

then the pre-search would look for the string "http error" and the Zed encoding of the IP address 10.0.0.1 and unless both those values are present, then the buffer is discarded.

Moreover, ZNG data is compressed and arranged into frames that can be decompressed and processed in parallel. This allows the decompression and pre-search to run in parallel very efficiently across a large number of threads. When searching for sparse results, many frames are discarded without their uncompressed bytes having to be processed any further.

While this pre-search technique results in very fast brute-force pattern matching, search indexes can also be created when Zed data is managed by a Zed lake thereby avoiding scans of data altogether as the index pinpoints the locations of specific values in the lake.

7.2 Efficient JSON Processing

While processing data in the ZNG format is far more efficient than JSON, there is substantial JSON data in the world and it is important for JSON input to perform well.

This proved a challenge as zq is written in Go and Go's JSON package is not particularly performant. To this end, zq has its own lean and simple JSON tokenizer, which performs quite well, and is integrated tightly with Zed's internal data representation. Moreover, like jq, zq's JSON parser does not require objects to be newline delimited and can incrementally parse the input to minimize memory overhead and improve processor cache performance.

The net effect is a JSON parser that is typically a bit faster than the native C implementation in jq.

7.3 Performance Comparisons

To provide a rough sense of the performance tradeoffs between zq and other tooling, this section provides results of a few simple speed tests.

7.3.1 Test Data

These tests are easy to reproduce. The input data comes from the Zed sample data repository, where we used a semi-structured Zeek "conn" log from the zeek-default directory.

It is easy to convert the Zeek logs to a local ZNG file using zq's built-in get operator:

zq -o conn.zng 'get https://raw.githubusercontent.com/brimdata/zed-sample-data/main/zeek-default/conn.log.gz'

This creates a new file conn.zng from the Zeek log file fetched from GitHub.

Note that this data is a gzip'd file in the Zeek format and zq's auto-detector figures out both that it is gzip'd and that the uncompressed format is Zeek. There's no need to specify flags for this.

Next, a JSON file can be converted from ZNG using:

zq -f json conn.zng > conn.json

Note here that we lose information in this conversion because the rich data types of Zed (that were translated from the Zeek format) are lost.

We'll also make a SQLite database in the file conn.db as the table named conn. One easy way to do this is to install sqlite-utils and run

sqlite-utils insert conn.db conn conn.json --nl

(If you need a cup of coffee, a good time to get it would be when loading the JSON into SQLite.)

7.3.2 File Sizes

Note the resulting file sizes:

% du -h conn.json conn.db conn.zng
416M conn.json
192M conn.db
38M conn.zng

Much of the performance of ZNG derives from an efficient, parallelizable structure where frames of data are compressed (currently with LZ4 though the specification supports multiple algorithms) and the sequence of values can be processed with only partial deserialization.

That said, there are quite a few more opportunities to further improve the performance of zq and the Zed system and we have a number of projects forthcoming on this front.

7.3.3 Tests

We ran three styles of tests on a Mac quad-core 2.3GHz i7:

  • count - compute the number of values present
  • search - find a value in a field
  • agg - sum a field grouped by another field

Each test was run for jq, zq on JSON, sqlite3, and zq on ZNG.

We used the Bash time command to measure elapsed time.

The command lines for the count test were:

jq -s length conn.json
sqlite3 conn.db 'select count(*) from conn'
zq 'count()' conn.zng
zq 'count()' conn.json

The command lines for the search test were:

jq 'select(.id.orig_h=="10.47.23.5")' conn.json
sqlite3 conn.db 'select * from conn where json_extract(id, "$.orig_h")=="10.47.23.5"'
zq 'id.orig_h==10.47.23.5' conn.zng
zq 'id.orig_h==10.47.23.5' conn.json

Here, we look for an IP address (10.47.23.5) in a specific field id.orig_h in the semi-structured data. Note when using ZNG, the IP is a native type whereas for jq and SQLite it is a string. Note that sqlite must use its json_extract function since nested JSON objects are stored as minified JSON text.

The command lines for the agg test were:

jq -n -f agg.jq conn.json
sqlite3 conn.db 'select sum(orig_bytes),json_extract(id, "$.orig_h") as orig_h from conn group by orig_h'
zq "sum(orig_bytes) by id.orig_h" conn.zng
zq "sum(orig_bytes) by id.orig_h" conn.json

where the agg.jq script is:

def adder(stream):
reduce stream as $s ({}; .[$s.key] += $s.val);
adder(inputs | {key:.id.orig_h,val:.orig_bytes})
| to_entries[]
| {orig_h: (.key), sum: .value}

7.3.4 Results

The following table summarizes the results of each test as a column and each tool as a row with the speed-up factor (relative to jq) shown in parentheses:

countsearchagg
jq11,540ms (1X)10,730ms (1X)20,175ms (1X)
zq-json7,150ms (1.6X)7,230ms (1.5X)7,390ms (2.7X)
sqlite100ms (115X)620ms (17X)1,475ms (14X)
zq-zng110ms (105X)135ms (80X)475ms (42X)

To summarize, zq with ZNG is consistently fastest though sqlite was a bit faster counting rows.

In particular, zq is substantially faster (40-100X) than jq with the efficient ZNG format but more modestly faster (50-170%) when processing the bulky JSON input. This is expected because parsing JSON becomes the bottleneck.

While SQLite is much faster than jq, it is not as fast as zq. The primary reason for this is that SQLite stores its semi-structured columns as minified JSON text, so it must scan and parse the JSON when executing the where clause above as well as the aggregated fields.

Also, note that the inferior performance of sqlite is in areas where databases perform extraordinarily well if you do the work to (1) transform semi-structured columns to relational columns by flattening nested JSON objects (which are not indexable by sqlite) and (2) configuring database indexes.

In fact, if you implement these changes, sqlite performs better than zq on these tests.

However, the benefit of Zed is that no flattening is required. And unlike sqlite, zq is not intended to be a database. That said, there is no reason why database performance techniques cannot be applied to the Zed model and this is precisely what the open-source Zed project intends to do. As a first step, with a Zed lake, you can build type-flexible search indexes to scale searches across very large stores of Zed data.

Stay tuned!