zq
TL;DR
zq
is a command-line tool that uses the Zed language for pipeline-style search and analytics.zq
can query a variety of data formats in files, over HTTP, or in S3 storage. It is particularly fast when operating on data in the Zed-native ZNG format.The
zq
design philosophy blends the query/search-tool approach ofjq
,awk
, andgrep
with the command-line, embedded database approach ofsqlite
andduckdb
.
Usage
zq [ options ] [ query ] input [ input ... ]
zq [ options ] query
zq
is a command-line tool for processing data in diverse input
formats, providing search, analytics, and extensive transformations
using the Zed language. A query typically applies Boolean logic
or keyword search to filter the input, then transforms or analyzes
the filtered stream. Output is written to one or more files or to
standard output.
Each input
argument must be a file path, an HTTP or HTTPS URL,
an S3 URL, or standard input specified with -
.
For built-in command help and a listing of all available options,
simply run zq
with no arguments.
zq
supports a number of formats but ZNG
tends to be the most space-efficient and most performant. ZNG has efficiency similar to
Avro
and Protocol Buffers
but its comprehensive Zed type system obviates
the need for schema specification or registries.
Also, the ZSON format is human-readable and entirely one-to-one with ZNG
so there is no need to represent non-readable formats like Avro or Protocol Buffers
in a clunky JSON encapsulated form.
zq
typically operates on ZNG-encoded data and when you want to inspect
human-readable bits of output, you merely format it as ZSON, which is the
default format when output is directed to the terminal. ZNG is the default
when redirecting to a non-terminal output like a file or pipe.
When run with input arguments, each input's format is automatically inferred and each input is scanned in the order appearing on the command line forming the input stream.
A query expressed in the Zed language may be optionally specified and applied to the input stream.
If no query is specified, the inputs are scanned without modification and output in the desired format as described below. This latter approach provides a convenient means to convert files from one format to another.
To determine whether the first argument is a query or an input,
zq
checks the local file system for the existence of a file by that name
or whether the name is an URL.
If no such file or URL exists, it attempts to parse the text as a Zed program.
If both checks fail, then an error is reported and zq
exits.
This heuristic is convenient but can result in a rare surprise when a simple
Zed query (like a keyword search) happens to correspond with a file of the
same name in the local directory.
When zq
is run with a query and no input arguments, then the query must
begin with
- a from, file, or get operator, or
- an explicit or implied yield operator.
In the case of a yield
with no inputs, the query is run with
a single input value of null
. This provides a convenient means to run in a
"calculator mode" where input is produced by the yield and can be operated upon
by the Zed query, e.g.,
zq -z '1+1'
emits
2
Note here that the query 1+1
implies
yield 1+1
.
Input Formats
zq
currently supports the following input formats:
Option | Auto | Specification |
---|---|---|
arrows | yes | Arrow IPC Stream Format |
json | yes | JSON RFC 8259 |
csv | yes | CSV RFC 4180 |
line | no | One string value per input line |
parquet | yes | Apache Parquet |
tsv | yes | TSV - Tab-Separated Values |
vng | yes | VNG - Binary Columnar Format |
zson | yes | ZSON - Human-readable Format |
zng | yes | ZNG - Binary Row Format |
zjson | yes | ZJSON - Zed over JSON |
zeek | yes | Zeek Logs |
The input format is typically detected automatically and the formats for which
Auto
is yes
in the table above support auto-detection.
Formats without auto-detection require the -i
option.
Hard-wired Input Format
The input format is specified with the -i
flag.
When -i
is specified, all of the inputs on the command-line must be
in the indicated format.
Auto-detection
When using auto-detection, each input's format is independently determined so it is possible to easily blend different input formats into a unified output format.
For example, suppose this content is in a file sample.csv
:
a,b
1,foo
2,bar
and this content is in sample.json
{"a":3,"b":"baz"}
then the command
zq -z sample.csv sample.json
would produce this output in the default ZSON format
{a:1.,b:"foo"}
{a:2.,b:"bar"}
{a:3,b:"baz"}
ZSON-JSON Auto-detection
Since ZSON is a superset of JSON, zq
must be careful in whether it
interprets input as ZSON as JSON. While you can always clarify your intent
with the -i zson
or -i json
, zq
attempts to "just do the right thing"
when you run it with JSON vs. ZSON.
While zq
can parse any JSON using its built-in ZSON parser this is typically
not desirable because (1) the ZSON parser is not particularly performant and
(2) all JSON numbers are floating point but the ZSON parser will parse as
JSON any number that appears without a decimal point as an integer type.
The reason
zq
is not particularly performant for ZSON is that the ZNG or VNG formats are semantically equivalent to ZSON but much more efficient and the design intent is that these efficient binary formats should be used in use cases where performance matters. ZSON is typically used only when data needs to be human-readable in interactive settings or in automated tests.
To this end, zq
uses a heuristic to select between ZSON in JSON when the
-i
option is not specified. Specifically, JSON is selected when the first values
of the input are parsable as valid JSON and includes a JSON object either
as an outer object or as a value nested somewhere within a JSON array.
This heuristic almost always works in practice because ZSON records typically omit quotes around field names.
Output Formats
The output format defaults to either ZSON or ZNG and may be specified
with the -f
option. The supported output formats include all of
the input formats along with text and table formats, which are useful
for displaying data. (They do not capture all the information required
to reconstruct the original data so they are not supported input formats.)
Since ZSON is a common format choice, the -z
flag is a shortcut for
-f zson.
Also, -Z
is a shortcut for -f zson
with -pretty 4
as
described below.
And since JSON is another common format choice, the -j
flag is a shortcut for
-f json.
Output Format Selection
When the format is not specified with -f
, it defaults to ZSON if the output
is a terminal and to ZNG otherwise.
While this can cause an occasional surprise (e.g., forgetting -f
or -z
in a scripted test that works fine on the command line but fails in CI),
we felt that the design of having a uniform default had worse consequences:
- If the default format were ZSON, it would be very easy to create pipelines
and deploy to production systems that were accidentally using ZSON instead of
the much more efficient ZNG format because the
-f zng
had been mistakenly omitted from some command. The beauty of Zed is that all of this "just works" but it would otherwise perform poorly. - If the default format were ZNG, then users would be endlessly annoyed by
binary output to their terminal when forgetting to type
-f zson
.
In practice, we have found that the output defaults "just do the right thing" almost all of the time.
ZSON Pretty Printing
ZSON text may be "pretty printed" with the -pretty
option, which takes
the number of spaces to use for indentation. As this is a common option,
the -Z
option is a shortcut for -f zson -pretty 4
.
For example,
echo '{a:{b:1,c:[1,2]},d:"foo"}' | zq -Z -
produces
{
a: {
b: 1,
c: [
1,
2
]
},
d: "foo"
}
and
echo '{a:{b:1,c:[1,2]},d:"foo"}' | zq -f zson -pretty 2 -
produces
{
a: {
b: 1,
c: [
1,
2
]
},
d: "foo"
}
When pretty printing, colorization is enabled by default when writing to a terminal,
and can be disabled with -color false
.
Pipeline-friendly ZNG
Though it's a compressed binary format, ZNG data is self-describing and stream-oriented and thus is pipeline friendly.
Since data is self-describing you can simply take ZNG output of one command and pipe it to the input of another. It doesn't matter if the value sequence is scalars, complex types, or records. There is no need to declare or register schemas or "protos" with the downstream entities.
In particular, ZNG data can simply be concatenated together, e.g.,
zq -f zng 'yield 1,[1,2,3]' > a.zng
zq -f zng 'yield {s:"hello"},{s:"world"}' > b.zng
cat a.zng b.zng | zq -z -
produces
1
[1,2,3]
{s:"hello"}
{s:"world"}
And while this ZSON output is human readable, the ZNG files are binary, e.g.,
zq -f zng 'yield 1,[1,2,3]' > a.zng
hexdump -C a.zng
produces
00000000 02 00 01 09 1b 00 09 02 02 1e 07 02 02 02 04 02 |................|
00000010 06 ff |..|
00000012
Schema-rigid Outputs
Certain data formats like Arrow and Parquet are "schema rigid" in the sense that they require a schema to be defined before values can be written into the file and all the values in the file must conform to this schema.
Zed, however, has a fine-grained type system instead of schemas and a sequence of data values are completely self-describing and may be heterogeneous in nature. This creates a challenge converting the type-flexible Zed formats to a schema-rigid format like Arrow and Parquet.
For example, this seemingly simple conversion:
echo '{x:1}{s:"hello"}' | zq -o out.parquet -f parquet -
causes this error
parquetio: encountered multiple types (consider 'fuse'): {x:int64} and {s:string}
Fusing Schemas
As suggested by the error above, the Zed fuse
operator can merge different record
types into a blended type, e.g., here we create the file and read it back:
echo '{x:1}{s:"hello"}' | zq -o out.parquet -f parquet fuse -
zq -z out.parquet
but the data was necessarily changed (by inserting nulls):
{x:1,s:null(string)}
{x:null(int64),s:"hello"}
Splitting Schemas
Another common approach to dealing with the schema-rigid limitation of Arrow and Parquet is to create a separate file for each schema.
zq
can do this too with the -split
option, which specifies a path
to a directory for the output files. If the path is .
, then files
are written to the current directory.
The files are named using the -o
option as a prefix and the suffix is
-<n>.<ext>
where the <ext>
is determined from the output format and
where <n>
is a unique integer for each distinct output file.
For example, the example above would produce two output files, which can then be read separately to reproduce the original data, e.g.,
echo '{x:1}{s:"hello"}' | zq -o out -split . -f parquet -
zq -z out-*.parquet
produces the original data
{x:1}
{s:"hello"}
While the -split
option is most useful for schema-rigid formats, it can
be used with any output format.
Query Debugging
If you are ever stumped about how the zq
compiler is parsing your query,
you can always run zq -C
to compile and display your query in canonical form
without running it.
This can be especially handy when you are learning the language and
its shortcuts.
For example, this query
zq -C 'has(foo)'
is an implied where operator, which matches values
that have a field foo
, i.e.,
where has(foo)
while this query
zq -C 'a:=x+1'
is an implied put operator, which creates a new field a
with the value x+1
, i.e.,
put a:=x+1
Error Handling
Fatal errors like "file not found" or "file system full" are reported
as soon as they happen and cause the zq
process to exit.
On the other hand, runtime errors resulting from the Zed query itself do not halt execution. Instead, these error conditions produce first-class Zed errors in the data output stream interleaved with any valid results. Such errors are easily queried with the is_error function.
This approach provides a robust technique for debugging complex query pipelines, where errors can be wrapped in one another providing stack-trace-like debugging output alongside the output data. This approach has emerged as a more powerful alternative to the traditional technique of looking through logs for errors or trying to debug a halted program with a vague error message.
For example, this query
echo '1 2 0 3' | zq '10.0/this' -
produces
10.
5.
error("divide by zero")
3.3333333333333335
and
echo '1 2 0 3' | zq '10.0/this' - | zq 'is_error(this)' -
produces just
error("divide by zero")
Examples
As you may have noticed, many examples of the Zed language are illustrated using this pattern
echo <values> | zq <query> -
which is used throughout the language documentation and operator reference.
The language documentation and tutorials directory
have many examples, but here are a few more simple zq
use cases.
Hello, world
echo '"hello, world"' | zq -z 'yield this' -
produces this ZSON output
"hello, world"
Some values of available data types
echo '1 1.5 [1,"foo"] |["apple","banana"]|' | zq -z 'yield this' -
produces
1
1.5
[1,"foo"]
|["apple","banana"]|
The types of various data
echo '1 1.5 [1,"foo"] |["apple","banana"]|' | zq -z 'yield typeof(this)' -
produces
<int64>
<float64>
<[(int64,string)]>
<|[string]|>
A simple aggregation
echo '{key:"foo",val:1}{key:"bar",val:2}{key:"foo",val:3}' | zq -z 'sum(val) by key | sort key' -
produces
{key:"bar",sum:2}
{key:"foo",sum:4}
Convert CSV to Zed and cast a to an integer from default float
printf "a,b\n1,foo\n2,bar\n" | zq 'a:=int64(a)' -
produces
{a:1,b:"foo"}
{a:2,b:"bar"}
Convert JSON to Zed and cast to an integer from default float
echo '{"a":1,"b":"foo"}{"a":2,"b":"bar"}' | zq 'a:=int64(a)' -
produces
{a:1,b:"foo"}
{a:2,b:"bar"}
Make a schema-rigid Parquet file using fuse and turn it back into Zed
echo '{a:1}{a:2}{b:3}' | zq -f parquet -o tmp.parquet fuse -
zq -z tmp.parquet
produces
{a:1,b:null(int64)}
{a:2,b:null(int64)}
{a:null(int64),b:3}
Performance
Your mileage may vary, but many new users of zq
are surprised by its speed
compared to tools like jq
, grep
, awk
, or sqlite
especially when running
zq
over files in the ZNG format.
Fast Pattern Matching
One important technique that helps zq
run fast is to take advantage of queries
that involve fine-grained searches.
When a query begins with a logical expression containing either a search or a predicate match with a constant value, and presuming the input data format is ZNG, then the runtime optimizes the query by performing an efficient, byte-oriented "pre-search" of the values required in the predicate. This pre-search scans the bytes that comprise a large buffer of values and looks for these values and, if they are not present, the entire buffer is discarded knowing no individual value in that buffer could match because the required serialized values were not present in the buffer.
For example, if the Zed query is
"http error" and ipsrc==10.0.0.1 | count()
then the pre-search would look for the string "http error" and the Zed encoding of the IP address 10.0.0.1 and unless both those values are present, then the buffer is discarded.
Moreover, ZNG data is compressed and arranged into frames that can be decompressed and processed in parallel. This allows the decompression and pre-search to run in parallel very efficiently across a large number of threads. When searching for sparse results, many frames are discarded without their uncompressed bytes having to be processed any further.
Efficient JSON Processing
While processing data in the ZNG format is far more efficient than JSON, there is substantial JSON data in the world and it is important for JSON input to perform well.
This proved a challenge as zq
is written in Go and Go's JSON package
is not particularly performant. To this end, zq
has its own lean and simple
JSON tokenizer,
which performs quite well,
and is
integrated tightly
with Zed's internal data representation.
Moreover, like jq
,
zq
's JSON parser does not require objects to be newline delimited and can
incrementally parse the input to minimize memory overhead and improve
processor cache performance.
The net effect is a JSON parser that is typically a bit faster than the
native C implementation in jq
.
Performance Comparisons
To provide a rough sense of the performance tradeoffs between zq
and
other tooling, this section provides results of a few simple speed tests.
Test Data
These tests are easy to reproduce. The input data comes from the
Zed sample data repository,
where we used a semi-structured Zeek "conn" log from the zeek-default
directory.
It is easy to convert the Zeek logs to a local ZNG file using
zq's built-in get
operator:
zq -o conn.zng 'get https://raw.githubusercontent.com/brimdata/zed-sample-data/main/zeek-default/conn.log.gz'
This creates a new file conn.zng
from the Zeek log file fetched from GitHub.
Note that this data is a gzip'd file in the Zeek format and zq
's auto-detector
figures out both that it is gzip'd and that the uncompressed format is Zeek.
There's no need to specify flags for this.
Next, a JSON file can be converted from ZNG using:
zq -f json conn.zng > conn.json
Note here that we lose information in this conversion because the rich data types of Zed (that were translated from the Zeek format are lost.
We'll also make a SQLite database in the file conn.db
as the table named conn
.
One easy way to do this is to install
sqlite-utils
and run
sqlite-utils insert conn.db conn conn.json --nl
(If you need a cup of coffee, a good time to get it would be when loading the JSON into SQLite.)
File Sizes
Note the resulting file sizes:
% du -h conn.json conn.db conn.zng
416M conn.json
192M conn.db
38M conn.zng
Much of the performance of ZNG derives from an efficient, parallelizable structure where frames of data are compressed (currently with LZ4 though the specification supports multiple algorithms) and the sequence of values can be processed with only partial deserialization.
That said, there are quite a few more opportunities to further improve
the performance of zq
and the Zed system and we have a number of projects
forthcoming on this front.
Tests
We ran three styles of tests on a Mac quad-core 2.3GHz i7:
count
- compute the number of values presentsearch
- find a value in a fieldagg
- sum a field grouped by another field
Each test was run for jq
, zq
on JSON, sqlite3
, and zq
on ZNG.
We used the Bash time
command to measure elapsed time.
The command lines for the count
test were:
jq -s length conn.json
sqlite3 conn.db 'select count(*) from conn'
zq 'count()' conn.zng
zq 'count()' conn.json
The command lines for the search
test were:
jq 'select(.id.orig_h=="10.47.23.5")' conn.json
sqlite3 conn.db 'select * from conn where json_extract(id, "$.orig_h")=="10.47.23.5"'
zq 'id.orig_h==10.47.23.5' conn.zng
zq 'id.orig_h==10.47.23.5' conn.json
Here, we look for an IP address (10.47.23.5) in a specific
field id.orig_h
in the semi-structured data. Note when using ZNG,
the IP is a native type whereas for jq
and SQLite it is a string.
Note that sqlite
must use its json_extract
function since nested JSON objects
are stored as minified JSON text.
The command lines for the agg
test were:
jq -n -f agg.jq conn.json
sqlite3 conn.db 'select sum(orig_bytes),json_extract(id, "$.orig_h") as orig_h from conn group by orig_h'
zq "sum(orig_bytes) by id.orig_h" conn.zng
zq "sum(orig_bytes) by id.orig_h" conn.json
where the agg.jq
script is:
def adder(stream):
reduce stream as $s ({}; .[$s.key] += $s.val);
adder(inputs | {key:.id.orig_h,val:.orig_bytes})
| to_entries[]
| {orig_h: (.key), sum: .value}
Results
The following table summarizes the results of each test as a column and
each tool as a row with the speed-up factor (relative to jq
)
shown in parentheses:
count | search | agg | |
---|---|---|---|
jq | 11,540ms (1X) | 10,730ms (1X) | 20,175ms (1X) |
zq-json | 7,150ms (1.6X) | 7,230ms (1.5X) | 7,390ms (2.7X) |
sqlite | 100ms (115X) | 620ms (17X) | 1,475ms (14X) |
zq-zng | 110ms (105X) | 135ms (80X) | 475ms (42X) |
To summarize, zq
with ZNG is consistently fastest though sqlite
was a bit faster counting rows.
In particular, zq
is substantially faster (40-100X) than jq
with the efficient
ZNG format but more modestly faster (50-170%) when processing the bulky JSON input.
This is expected because parsing JSON becomes the bottleneck.
While SQLite is much faster than jq
, it is not as fast as zq
. The primary
reason for this is that SQLite stores its semi-structured columns as minified JSON text,
so it must scan and parse the JSON when executing the where clause above
as well as the aggregated fields.
Also, note that the inferior performance of sqlite
is in areas where databases
perform extraordinarily well if you do the work to
(1) transform semi-structured columns to relational columns by flattening
nested JSON objects (which are not indexable by sqlite
) and
(2) configuring database indexes.
In fact, if you implement these changes, sqlite
performs better than zq
on these tests.
However, the benefit of Zed is that no flattening is required. And unlike sqlite
,
zq
is not intended to be a database. That said, there is no reason why database
performance techniques cannot be applied to the Zed model and this is precisely what the
open-source Zed project intends to do.
Stay tuned!