super Tutorial
This tour provides new users of super
an overview of the tool and
the SuperPipe language
by walking through a number of examples on the command-line.
This should get you started without having to read through all the gory details
of the SuperPipe language or super
command-line usage.
We'll start with some simple one-liners on the command line where we feed
some data to super
with echo
and specify -
for super
input to indicate
that standard input should be used, e.g.,
echo '"hello, world"' | super -
Then, toward the end of the tour, we'll experiment with some real-world GitHub data pulled from the GitHub API.
If you want to follow along on the command line,
just make sure the super
command is installed
as well as jq
.
But JSON
While super
is based on a new type of data model,
Zed just so happens to be a superset of JSON.
So if all you ever use zq
for is manipulating JSON data,
it can serve you well as a handy, go-to tool. In this way, zq
is kind of
like jq
. As you probably know, jq
is a popular command-line tool for taking a sequence of JSON values as input,
doing interesting things on that input, and emitting results, of course, as JSON.
jq
is awesome and powerful, but its syntax and computational model can
sometimes be daunting and difficult. We tried to make zq
really easy and intuitive,
and it is usually faster, sometimes much faster,
than jq
.
To this end, if you want full JSON compatibility without having to delve into the
details of Zed, just use the -j
option with zq
and this will tell zq
to
expect JSON values as input and produce JSON values as output, much like jq
.
If your downstream JSON tooling expects only a single JSON value, we can use
-j
along with collect()
to aggregate
multiple input values into an array. A collect()
example is shown
later in this tutorial.
this
vs .
For example, to add 1 to some numbers with jq
, you say:
echo '1 2 3' | jq '.+1'
and you get
2
3
4
With zq
, the mysterious jq
value .
is instead called
the almost-as-mysterious value
this
and you say:
echo '1 2 3' | super -z -c 'this+1' -
which also gives
2
3
4
Note that we are using the
-z
option withzq
in all of the examples, which causeszq
to format the output as ZSON. When runningzq
on the terminal, you do not need-z
as it is the default, but we include it here for clarity and because all of these examples are run through automated testing, which is not attached to a terminal.
Search vs Transformation
Unlike jq
, which leads with transformation, zq
leads with search but
transformation is also pretty easy. Let's show what we mean here with an
example.
If we run this jq
command,
echo '1 2 3' | jq 2
we get
2
2
2
Hmm, that's a little odd, but it did what we told it to do. In jq
, the
expression 2
is evaluated for each input value, and the value 2
is produced each time, so three copies of 2
are emitted.
In zq
however, 2
by itself is interpreted as a search and is
shorthand for search 2
so the command
echo '1 2 3' | super -z -c '? 2' -
produces this "search result":
2
In fact, this search syntax generalizes, and if we search over a more complex input:
echo '1 2 [1,2,3] [4,5,6] {r:{x:1,y:2}} {r:{x:3,y:4}} "hello" "Number 2"' |
super -z -c '? 2' -
we naturally find all the 2's whether as a value, inside a value, or inside a string:
2
[1,2,3]
{r:{x:1,y:2}}
"Number 2"
You can also do keyword-text search, e.g.,
echo '1 2 [1,2,3] [4,5,6] {r:{x:1,y:2}} {r:{x:3,y:4}} "hello" "Number 2"' |
super -z -c '? hello or Number' -
produces
"hello"
"Number 2"
Doing searches like this in jq
would be hard.
That said, we can emulate the jq
transformation stance by explicitly
indicating that we want to yield
the result of the expression evaluated for each input value, e.g.,
echo '1 2 3' | super -z -c 'yield 2' -
now gives the same answer as jq
:
2
2
2
Cool, but doesn't it seem like search is a better disposition for shorthand syntax? What do you think?
On to ZSON
JSON is super easy and ubiquitous, but it can be limiting and frustrating when trying to do high-precision stuff with data.
When using zq
, it's handy to operate in the
domain of Zed data and only output to JSON when needed.
The human-readable format of Zed is called ZSON (and yes, that's a play on the acronym JSON).
ZSON is nice because it has a comprehensive type system and you can go from ZSON to an efficient binary row format (Super Binary) and columnar (Super Columnar) --- and vice versa --- with complete fidelity and no loss of information. In this tour, we'll stick to ZSON (though for large data sets, Super Binary is much faster).
The first thing you'll notice about ZSON is that you don't need
quotations around field names. We can see this by taking some JSON
as input (the JSON format is auto-detected by zq
) and formatting
it as pretty-printed ZSON with -Z
:
echo '{"s":"hello","val":1,"a":[1,2],"b":true}' | super -Z -
which gives
{
s: "hello",
val: 1,
a: [
1,
2
],
b: true
}
s
, val
, a
, and b
all appear as unquoted identifiers here.
Of course if you have funny characters in a field name, ZSON can handle
it with quotes just like JSON:
echo '{"funny@name":1}' | super -z -
produces
{"funny@name":1}
Moreover, ZSON is fully compatible with all of JSON's corner cases like empty string as a field name and empty object as a value, e.g.,
echo '{"":{}}' | super -z -
produces
{"":{}}
Comprehensive Types
ZSON also has a comprehensive type system.
For example, here is ZSON "record" with a taste of different types of values as record fields:
{
v1: 1.5,
v2: 1,
v3: 1 (uint8),
v4: 2018-03-24T17:30:20.600852Z,
v5: 2m30s,
v6: 192.168.1.1,
v7: 192.168.1.0/24,
v8: [
1,
2,
3
],
v9: |[
"GET",
"PUT",
"POST"
]|,
v10: |{
"key1": 123,
"key2": 456
}|,
v11: {
a: 1,
r: {
s1: "hello",
s2: "world"
}
}
}
Here, v1
is a 64-bit IEEE floating-point value just like JSON.
Unlike JSON, v2
is a 64-bit integer. And there are other integer
types as with v3
,
which utilizes a ZSON type decorator,
in this case,
to clarify its specific type of integer as unsigned 8 bits.
a
v4
has type time
and v5
type duration
.
v6
is type ip
and v7
type net
.
v8
is an array of elements of type int64
, which in Zed, is a type
written as [int64]
.
v9
is a "set of strings", which is written like an array but with the
enclosing syntax |[
and ]|
.
v10
is a "map" type, which in other languages is often called a "table"
or a "dictionary". In Zed, a value of any type can be used for the key or the
value though all of the keys and all of the values must have the same type.
Finally, v11
is a Zed "record", which is similar to a JSON "object", but the
keys are called "fields", the order of the fields is significant and
is always preserved.
Records
As is often the case with semi-structured systems, you deal with nested values all the time: in JSON, data is nested with objects and arrays, while in Zed, data is nested with "records" and arrays (as well as other complex types).
Record expressions
are rather flexible with zq
and look a bit like JavaScript
or jq
syntax, e.g.,
echo '1 2 3' | super -z -c 'yield {kind:"counter",val:this}' -
produces
{kind:"counter",val:1}
{kind:"counter",val:2}
{kind:"counter",val:3}
Note that like the search shortcut, you can also drop the yield keyword here because the record literal implies the yield operator, e.g.,
echo '1 2 3' | super -z -c '{kind:"counter",val:this}' -
also produces
{kind:"counter",val:1}
{kind:"counter",val:2}
{kind:"counter",val:3}
zq
can also use a spread operator like JavaScript, e.g.,
echo '{a:{s:"foo", val:1}}{b:{s:"bar"}}' | super -z -c '{...a,s:"baz"}' -
produces
{s:"baz",val:1}
{s:"baz"}
while
echo '{a:{s:"foo", val:1}}{b:{s:"bar"}}' | super -z -c '{d:2,...a,...b}' -
produces
{d:2,s:"foo",val:1}
{d:2,s:"bar"}
Record Mutation
Sometimes you just want to extract or mutate certain fields of records.
Similar to the Unix cut
command, the Zed cut operator
extracts fields, e.g.,
echo '{s:"foo", val:1}{s:"bar"}' | super -z -c 'cut s' -
produces
{s:"foo"}
{s:"bar"}
while the put operator mutates existing fields or adds new fields, e.g.,
echo '{s:"foo", val:1}{s:"bar"}' | super -z -c 'put val:=123,pi:=3.14' -
produces
{s:"foo",val:123,pi:3.14}
{s:"bar",val:123,pi:3.14}
Note that put
is also an implied operator so the command with put
omitted
echo '{s:"foo", val:1}{s:"bar"}' | super -z -c 'val:=123,pi:=3.14' -
produces the very same output:
{s:"foo",val:123,pi:3.14}
{s:"bar",val:123,pi:3.14}
Finally, it's worth mentioning that errors in Zed are
first class.
This means they can just show up in the data as values. In particular,
a common error is error("missing")
which occurs most often when referencing
a field that does not exist, e.g.,
echo '{s:"foo", val:1}{s:"bar"}' | super -z -c 'cut val' -
produces
{val:1}
{val:error("missing")}
Sometimes you expect missing errors to occur sporadically and just want to ignore them, which can you easily do with the quiet function, e.g.,
echo '{s:"foo", val:1}{s:"bar"}' | super -z -c 'cut quiet(val)' -
produces
{val:1}
Union Types
One of the tricks zq
uses to represent JSON data in its structured type system
is union types.
Most of the time, you don't need to worry about unions
but they show up from time to time. Even when
they show up, Zed just tries to "do the right thing" so you usually
don't have to worry about them even when they show up.
For example, this query is perfectly happy to operate on the union values that are implied by a mixed-type array:
echo '[1, "foo", 2, "bar"]' | super -z -c 'yield this[2],this[1]' -
produces
2
"foo"
but under the covers, the elements of the array have a union type of
int64
and string
, which is written (int64,string)
, e.g,.
echo '[1, "foo", 2, "bar"]' | super -z -c 'yield typeof(this)' -
produces
<[(int64,string)]>
which is a type value representing an array of union values.
As you learn more about Zed and want to use zq
to do data discovery and
preparation, union types are really quite powerful. They allow records
with fields of different types or mixed-type arrays to be easily expressed
while also having a very precise type definition. This is the essence
of Zed's new
super-structured data model.
First-class Types
Note that in the type value above, the type is wrapped in angle brackets. This is how ZSON represents types when expressed as values. In other words, Zed has first-class types.
The type of any value in zq
can be accessed via the
typeof function, e.g.,
echo '1 "foo" 10.0.0.1' | super -z -c 'yield typeof(this)' -
produces
<int64>
<string>
<ip>
What's the big deal here? We can print out the type of something. Yawn.
Au contraire, this is really quite powerful because we can use types as values to functions, e.g., as a dynamic argument to the cast function:
echo '{a:0,b:"2"}{a:0,b:"3"}' | super -z -c 'yield cast(b, typeof(a))' -
produces
2
3
But more powerfully, types can be used anywhere a value can be used and in particular, they can be group-by keys, e.g.,
echo '{x:1,y:2}{s:"foo"}{x:3,y:4}' |
super -f table -c "count() by \`shape\`:=typeof(this) |> sort count" -
produces
shape count
<{s:string}> 1
<{x:int64,y:int64}> 2
When run over large data sets, this gives you an insightful count of each "shape" of data in the input. This is a powerful building block for data discovery.
It's worth mentioning jq
also has a type operator, but it produces a
simple string instead of first-class types, and arrays and objects have
no detail about their structure, e,g.,
echo '1 true [1,2,3] {"s":"foo"}' | jq type
produces
"number"
"boolean"
"array"
"object"
Moreover, if we compare types of different objects
echo '{"a":{"s":"foo"},"b":{"x":1,"y":2}}' | jq '(.a|type)==(.b|type)'
we get "object" here for each type and thus the result:
true
i.e., they match even though their underlying shape is different.
With zq
of course, these are different super-structured types so
the result is false, e.g.,
echo '{"a":{"s":"foo"},"b":{"x":1,"y":2}}' |
super -z -c 'yield typeof(a)==typeof(b)' -
produces
false
Sample
Sometimes you'd like to see a sample value of each shape, not its type. This is easy to do with the any aggregate function, e.g,
echo '{x:1,y:2}{s:"foo"}{x:3,y:4}' |
super -z -c 'val:=any(this) by typeof(this) |> sort val |> yield val' -
produces
{s:"foo"}
{x:1,y:2}
We like this pattern so much there is a shortcut sample operator, e.g.,
echo '{x:1,y:2}{s:"foo"}{x:3,y:4}' | super -z -c 'sample this |> sort this' -
emits the same result:
{s:"foo"}
{x:1,y:2}
Fuse
Sometimes JSON data can get really messy with lots of variations in fields, with null values appearing sometimes and sometimes not, and with the same fields having different data types. Most annoyingly, when you see a JSON object like this in isolation:
{a:1,b:null}
you have no idea what the expected data type of b
will be. Maybe it's another
number? Or maybe a string? Or maybe an array or an embedded object?
zq
and ZSON don't have this problem because every value (even null
) is
comprehensively typed. However, zq
in fact must deal with this thorny problem
when reading JSON and converting it to Zed's super-structure.
This is where you might have to spend a little bit of time coding up
the right zq
logic to disentangle a JSON mess. But once the data is cleaned up,
you can leave it in a Zed format and not worry again.
To do so, the fuse operator comes in handy. Let's say you have this sequence of data:
{a:1,b:null}
{a:null,b:[2,3,4]}
As we said,
you can't tell by looking at either value what the types of both a
and b
should be. But if you merge the values into a common type, things begin to make
sense, e.g.,
echo '{a:1,b:null}{a:null,b:[2,3,4]}' | super -z -c fuse -
produces this transformed and comprehensively-typed ZSON output:
{a:1,b:null([int64])}
{a:null(int64),b:[2,3,4]}
Now you can see all the detail.
This turns out to be so useful, especially with large amounts of messy input data, you will often find yourself fusing data then sampling it, e.g.,
echo '{a:1,b:null}{a:null,b:[2,3,4]}' | super -Z -c 'fuse |> sample' -
produces a comprehensively-typed sample:
{
a: 1,
b: null ([int64])
}
As you explore data in this fashion, you will often type various searches
to slice and dice the data as you get a feel for it all while sending
your interactive search results to fuse |> sample
.
To appreciate all this, let's have a look next at some real-world data...
Real-world GitHub Data
Now that we've covered the basics of zq
and the Zed language, let's
use the query patterns from above to explore some GitHub data.
First, we need to grab the data. You can use curl
for this or you can
just use zq
as zq
can take URLs in addition to file name arguments.
This command will grab descriptions of first 30 PRs created in the
public zed
repository and place it in a file called prs.json
:
super -f json \
https://api.github.com/repos/brimdata/super/pulls\?state\=all\&sort\=desc\&per_page=30 \
> prs.json
Now that you have this JSON file on your local file system, how would you query it
with zq
?
Data Discovery
Before you can do anything, you need to know its structure but you generally don't know anything after pulling some random data from an API.
So, let's poke around a bit and figure it out. This process of data introspection is often called data discovery.
You could start by using jq
to pretty-print the JSON data,
jq . prs.json
That's 10,592 lines. Ugh, quite a challenge to sift through.
Instead, let's start out by figuring out how many values are in the input, e.g.,
super -f text -c 'count()' prs.json
produces
1
Hmm, there's just one value. It's probably a big JSON array but let's check with the kind function, and as expected:
super -z -c 'kind(this)' prs.json
produces
"array"
Ok got it. But, how many items are in the array?
super -z -c 'len(this)' prs.json
produces
30
Of course! We asked GitHub to return 30 items and the API returns the pull-request objects as elements of one array representing a single JSON value.
Let's see what sorts of things are in this array. Here, we need to enumerate the items from the array and do something with them. So how about we use the over operator to traverse the array and count the array items by their "kind",
super -z -c 'over this |> count() by kind(this)' prs.json
produces
{kind:"record",count:30(uint64)}
Ok, they're all records. Good, this should be easy!
The Zed records were all originally JSON objects. Maybe we can just use "sample" to have a deeper look...
super -Z -c 'over this |> sample' prs.json
Here we are using
-Z
, which is like-z
, but instead of formatting each ZSON value on its own line, it pretty-prints the ZSON with vertical formatting likejq
does for JSON.
Ugh, that output is still pretty big. It's not 10k lines but it's still more than 700 lines of pretty-printed ZSON.
Ok, maybe it's not so bad. Let's check how many shapes there are with sample
...
super -z -c 'over this |> sample |> count()' prs.json
produces
3(uint64)
All that data across the samples and only three shapes. They must each be really big. Let's check that out.
We can use the len function on the records to see the size of each of the four records:
super -z -c 'over this |> sample |> len(this) |> sort this' prs.json
and we get
0
36
36
Ok, this isn't so bad... two shapes each have 36 fields but one is length zero?! That outlier could only be the empty record. Let's check:
super -z -c 'over this |> sample |> len(this)==0' prs.json
produces
{}
Sure enough, there it is. We could also double check with jq
that there are
blank records in the GitHub results, and sure enough
jq '.[] | select(length==0)' prs.json
produces
{}
{}
Try opening your editor on that JSON file to look for the empty objects. Who knows why they are there? No fun. Real-world data is messy.
How about we fuse the 3 shapes together and have a look at the result:
super -Z -c 'over this |> fuse |> sample' prs.json
We won't display the result here as it's still pretty big. But you can give it a try. It's 379 lines.
But let's break down what's taking up all this space.
We can take the output from fuse |> sample
and list the fields with
and their "kind". Note that when we do an over this
with records as
input, we get a new record value for each field structured as a key/value pair:
super -f table -c '
over this
|> fuse
|> sample
|> over this
|> {field:key[0],kind:kind(value)}
' prs.json
produces
field kind
url primitive
id primitive
node_id primitive
html_url primitive
diff_url primitive
patch_url primitive
issue_url primitive
number primitive
state primitive
locked primitive
title primitive
user record
body primitive
created_at primitive
updated_at primitive
closed_at primitive
merged_at primitive
merge_commit_sha primitive
assignee primitive
assignees array
requested_reviewers array
requested_teams array
labels array
milestone primitive
draft primitive
commits_url primitive
review_comments_url primitive
review_comment_url primitive
comments_url primitive
statuses_url primitive
head record
base record
_links record
author_association primitive
auto_merge primitive
active_lock_reason primitive
With this list of top-level fields, we can easily explore the different pieces of their structure with sample. Let's have a look at a few of the record fields by giving these one-liners each a try and looking at the output:
super -Z -c 'over this |> sample head' prs.json
super -Z -c 'over this |> sample base' prs.json
super -Z -c 'over this |> sample _links' prs.json
While these fields have some useful information, we'll decide to drop them here and focus on other top-level fields. To do this, we can use the drop operator to whittle down the data:
super -Z -c 'over this |> fuse |> drop head,base,_link |> sample' prs.json
Ok, this looks more reasonable and is now only 120 lines of pretty-printed ZSON.
One more annoying detail here about JSON: time values are stored as strings, in this case, in ISO format, e.g., we can pull this value out with this query:
super -z -c 'over this |> head 1 |> yield created_at' prs.json
which produces this string:
"2019-11-11T19:50:46Z"
Since Zed has a native time
type and we might want to do native date comparisons
on these time fields, we can easily translate the string to a time with a cast, e.g.,
super -z -c 'over this |> head 1 |> yield time(created_at)' prs.json
produces the native time value:
2019-11-11T19:50:46Z
To be sure, you can check any value's type with the typeof
function, e.g.,
super -z -c 'over this |> head 1 |> yield time(created_at) |> typeof(this)' prs.json
produces the native time value:
<time>
Cleaning up the Messy JSON
Okay, now that we've explored the data, we have a sense of it and can "clean it up" with some Zed logic. We'll do this one step at a time, then put it all together.
First, let's get rid of the outer array and generate elements of an array as a sequence of Zed records that have been fused and let's filter out the empty records:
super -c 'over this |> len(this) != 0 |> fuse' prs.json > prs1.bsup
We can check that worked with count:
super -z -c 'count()' prs1.bsup
super -z -c 'sample |> count()' prs1.bsup
produces
{count:28(uint64)}
{count:1(uint64)}
Okay, good. There are 28 values (the 30 requested less the two empty records) and exactly one shape since the data was fused.
Now, let's drop the fields we aren't interested in:
super -c 'drop head,base,_links' prs1.bsup > prs2.bsup
Finally, let's clean up those dates. To track down all the candidates, we can run this Zed to group field names by their type and limit the output to primitive types:
super -z -c '
over this
|> kind(value)=="primitive"
|> fields:=union(key[0]) by type:=typeof(value)
' prs2.bsup
which gives
{type:<string>,fields:|["url","body","state","title","node_id","diff_url","html_url","closed_at","issue_url","merged_at","patch_url","created_at","updated_at","commits_url","comments_url","statuses_url","merge_commit_sha","author_association","review_comment_url","review_comments_url"]|}
{type:<int64>,fields:|["id","number"]|}
{type:<bool>,fields:|["draft","locked"]|}
{type:<null>,fields:|["assignee","milestone","auto_merge","active_lock_reason"]|}
Note that this use of
over
traverses each record and generates a key-value pair for each field in each record.
Looking through the fields that are strings, the candidates for ISO dates appear to be
closed_at
,merged_at
,created_at
, andupdated_at
. You can do a quick check of the theory by running...
super -z -c '{closed_at,merged_at,created_at,updated_at}' prs2.bsup
and you will get strings that are all ISO dates:
{closed_at:"2019-11-11T20:00:22Z",merged_at:"2019-11-11T20:00:22Z",created_at:"2019-11-11T19:50:46Z",updated_at:"2019-11-11T20:00:25Z"}
{closed_at:"2019-11-11T21:00:15Z",merged_at:"2019-11-11T21:00:15Z",created_at:"2019-11-11T20:57:12Z",updated_at:"2019-11-11T21:00:26Z"}
...
To fix those strings, we simply transform the fields in place using the
(implied) put operator and redirect the final
output the ZNG file prs.bsup
:
super -c '
closed_at:=time(closed_at),
merged_at:=time(merged_at),
created_at:=time(created_at),
updated_at:=time(updated_at)
' prs2.bsup > prs.bsup
We can check the result with our type analysis:
super -z -c '
over this
|> kind(value)=="primitive"
|> fields:=union(key[0]) by type:=typeof(value)
|> sort type
' prs.bsup
which now gives:
{type:<int64>,fields:|["id","number"]|}
{type:<time>,fields:|["closed_at","merged_at","created_at","updated_at"]|}
{type:<bool>,fields:|["draft","locked"]|}
{type:<string>,fields:|["url","body","state","title","node_id","diff_url","html_url","issue_url","patch_url","commits_url","comments_url","statuses_url","merge_commit_sha","author_association","review_comment_url","review_comments_url"]|}
{type:<null>,fields:|["assignee","milestone","auto_merge","active_lock_reason"]|}
and we can see that the date fields are correctly typed as type time
!
Note that we sorted the output values here using the sort operator to produce a consistent output order since aggregations can be run in parallel to achieve scale and do not guarantee their output order.
Putting It All Together
Instead of running each step above into a temporary file, we can put all the transformations together in a single Zed pipeline, where the Zed source text might look like this:
over this // traverse the array of objects
|> len(this) != 0 // skip empty objects
|> fuse // fuse objects into records of a combined type
|> drop head,base,_links // drop fields that we don't need
|> closed_at:=time(closed_at), // transform string dates to type time
merged_at:=time(merged_at),
created_at:=time(created_at),
updated_at:=time(updated_at)
Note that the
//
syntax indicates a single-line comment.
We can then put this in a file, called say transform.zed
, and use the -I
argument to run all the transformations in one fell swoop:
super -I transform.zed prs.json > prs.bsup
Running Analytics
Now that we've cleaned up our data, we can reliably and easily run analytics
on the finalized ZNG file prs.bsup
.
Zed gives us the best of both worlds of JSON and relational tables: we have the structure and clarity of the relational model while retaining the flexibility of JSON's document model. No need to create tables then issue SQL insert commands to put your clean data into all the right places.
Let's start with something simple. How about we output a "PR Report" listing the title of each PR along with its PR number and creation date:
super -f table -c '{DATE:created_at,NUMBER:f"PR #{number}",TITLE:title}' prs.bsup
and you'll see this output...
DATE NUMBER TITLE
2019-11-11T19:50:46Z PR #1 Make "make" work in zq
2019-11-11T20:57:12Z PR #2 fix install target
2019-11-11T23:24:00Z PR #3 import github.com/looky-cloud/lookytalk
2019-11-12T16:25:46Z PR #5 Make zq -f work
2019-11-12T16:49:07Z PR #6 a few clarifications to the zson spec
...
Note that we used a formatted string literal
to convert the field number
into a string and format it with surrounding text.
Instead of old PRs, we can get the latest list of PRs using the tail operator since we know the data is sorted chronologically. This command retrieves the last five PRs in the dataset:
super -f table -c '
tail 5
|> {DATE:created_at,"NUMBER":f"PR #{number}",TITLE:title}
' prs.bsup
and the output is:
DATE NUMBER TITLE
2019-11-18T22:14:08Z PR #26 ndjson writer
2019-11-18T22:43:07Z PR #27 Add reader for ndjson input
2019-11-19T00:11:46Z PR #28 fix TS_ISO8601, TS_MILLIS handling in NewRawAndTsFromJSON
2019-11-19T21:14:46Z PR #29 Return count of "dropped" fields from zson.NewRawAndTsFromJSON
2019-11-20T00:36:30Z PR #30 zval.sizeBytes incorrect
How about some aggregations? We can count the number of PRs and sort by the count highest first:
super -z -c "count() by user:=user.login |> sort count desc" prs.bsup
produces
{user:"mattnibs",count:10(uint64)}
{user:"aswan",count:7(uint64)}
{user:"mccanne",count:6(uint64)}
{user:"nwt",count:4(uint64)}
{user:"henridf",count:1(uint64)}
How about getting a list of all of the reviewers? To do this, we need to
traverse the records in the requested_reviewers
array and collect up
the login field from each record:
super -z -c 'over requested_reviewers |> collect(login)' prs.bsup
Oops, this gives us an array of the reviewer logins with repetitions since collect collects each item that it encounters into an array:
["mccanne","nwt","henridf","mccanne","nwt","mccanne","mattnibs","henridf","mccanne","mattnibs","henridf","mccanne","mattnibs","henridf","mccanne","nwt","aswan","henridf","mccanne","nwt","aswan","philrz","mccanne","mccanne","aswan","henridf","aswan","mccanne","nwt","aswan","mikesbrown","henridf","aswan","mattnibs","henridf","mccanne","aswan","nwt","henridf","mattnibs","aswan","aswan","mattnibs","aswan","henridf","aswan","henridf","mccanne","aswan","aswan","mccanne","nwt","aswan","henridf","aswan"]
What we'd prefer is a set of reviewers where each reviewer appears only once. This
is easily done with the union aggregate function
(not to be confused with union types) which
computes the set-wise union of its input and produces a Zed set
type as its
output. In this case, the output is a set of strings, written |[string]|
in the Zed language. For example:
super -z -c 'over requested_reviewers |> reviewers:=union(login)' prs.bsup
produces
{reviewers:|["nwt","aswan","philrz","henridf","mccanne","mattnibs","mikesbrown"]|}
Ok, that's pretty neat.
Let's close with an analysis that's a bit more sophisticated. Suppose we want to look at the reviewers that each user tends to ask for. We can think about this question as a "graph problem" where the user requesting reviews is one node in the graph and each set of reviewers is another node.
So as a first step, let's figure out how to create each edge, where an edge is a relation between the requesting user and the set of reviewers. We can create this in Zed with a "lateral subquery". Instead of computing a set-union over all the reviewers across all PRs, we instead want to compute the set-union over the reviewers in each PR. We can do this as follows:
super -z -c 'over requested_reviewers => ( reviewers:=union(login) )' prs.bsup
which produces an output like this:
{reviewers:|["nwt","mccanne"]|}
{reviewers:|["nwt","henridf","mccanne"]|}
{reviewers:|["mccanne","mattnibs"]|}
{reviewers:|["henridf","mccanne","mattnibs"]|}
{reviewers:|["henridf","mccanne","mattnibs"]|}
...
Note that the syntax => ( ... )
defines a lateral scope where any Zed subquery can
run in isolation over the input values created from the sequence of values
traversed by the outer over
.
But we need a "graph edge" between the requesting user and the reviewers.
To do this, we need to reference the user.login
from the top-level scope within the
lateral scope. This can be done by
bringing that value into the scope using a with
clause appended to the
over
expression and yielding a
record literal with the desired value:
super -z -c '
over requested_reviewers with user=user.login => (
reviewers:=union(login)
|> {user,reviewers}
)
|> sort user,len(reviewers)
' prs.bsup
which gives us
{user:"aswan",reviewers:|["mccanne"]|}
{user:"aswan",reviewers:|["nwt","mccanne"]|}
{user:"aswan",reviewers:|["nwt","henridf","mccanne"]|}
{user:"aswan",reviewers:|["henridf","mccanne","mattnibs"]|}
{user:"aswan",reviewers:|["henridf","mccanne","mattnibs"]|}
{user:"henridf",reviewers:|["nwt","aswan","mccanne"]|}
{user:"mattnibs",reviewers:|["aswan","mccanne"]|}
{user:"mattnibs",reviewers:|["aswan","henridf"]|}
...
The final step is to simply aggregate the "reviewer sets" with the user
field
as the group-by key:
super -Z -c '
over requested_reviewers with user=user.login => (
reviewers:=union(login)
|> {user,reviewers}
)
|> groups:=union(reviewers) by user
|> sort user,len(groups)
' prs.bsup
and we get
{
user: "aswan",
groups: |[
|[
"mccanne"
]|,
|[
"nwt",
"mccanne"
]|,
|[
"nwt",
"henridf",
"mccanne"
]|,
|[
"henridf",
"mccanne",
"mattnibs"
]|
]|
}
{
user: "henridf",
groups: |[
|[
"nwt",
"aswan",
"mccanne"
]|
]|
}
{
user: "mattnibs",
groups: |[
|[
"aswan",
"henridf"
]|,
|[
"aswan",
"mccanne"
]|,
|[
"aswan",
"henridf",
"mccanne"
]|,
|[
"nwt",
"aswan",
"henridf",
"mccanne"
]|,
|[
"nwt",
"aswan",
"mccanne",
"mikesbrown"
]|,
|[
"nwt",
"aswan",
"philrz",
"henridf",
"mccanne"
]|
]|
}
{
user: "mccanne",
groups: |[
|[
"nwt"
]|,
|[
"aswan"
]|,
|[
"mattnibs"
]|
]|
}
{
user: "nwt",
groups: |[
|[
"aswan"
]|,
|[
"aswan",
"mattnibs"
]|,
|[
"henridf",
"mattnibs"
]|,
|[
"mccanne",
"mattnibs"
]|
]|
}
After a quick glance here, you can tell that mccanne
looks for
very targeted reviews while mattnibs
casts a wide net, at least
for the PRs from the beginning of the zed
repo.
To quantify this concept, we can easily modify this query to compute the average number of reviewers requested instead of the set of groups of reviewers. To do this, we just average the reviewer set size with an aggregation:
super -z -c '
over requested_reviewers with user=user.login => (
reviewers:=union(login)
|> {user,reviewers}
)
|> avg_reviewers:=avg(len(reviewers)) by user
|> sort avg_reviewers
' prs.bsup
which produces
{user:"mccanne",avg_reviewers:1.}
{user:"nwt",avg_reviewers:1.75}
{user:"aswan",avg_reviewers:2.4}
{user:"mattnibs",avg_reviewers:2.9}
{user:"henridf",avg_reviewers:3.}
Of course, if you'd like the query output in JSON, you can just say -j
and
zq
will happily format the Zed sets as JSON arrays, e.g.,
super -j -c '
over requested_reviewers with user=user.login => (
reviewers:=union(login)
|> {user,reviewers}
)
|> groups:=union(reviewers) by user
|> sort user,len(groups)
' prs.bsup
produces
{"user":"aswan","groups":[["mccanne"],["nwt","mccanne"],["nwt","henridf","mccanne"],["henridf","mccanne","mattnibs"]]}
{"user":"henridf","groups":[["nwt","aswan","mccanne"]]}
{"user":"mattnibs","groups":[["aswan","henridf"],["aswan","mccanne"],["aswan","henridf","mccanne"],["nwt","aswan","henridf","mccanne"],["nwt","aswan","mccanne","mikesbrown"],["nwt","aswan","philrz","henridf","mccanne"]]}
{"user":"mccanne","groups":[["nwt"],["aswan"],["mattnibs"]]}
{"user":"nwt","groups":[["aswan"],["aswan","mattnibs"],["henridf","mattnibs"],["mccanne","mattnibs"]]}
Key Takeaways
So to summarize, we gave you a tour here of zq
and how the Zed data model
provide a powerful way do search, transformation, and analytics in a structured-like
way on data that begins its life as semi-structured JSON and is transformed
into the powerful super-structured format without having to create relational
tables and schemas.
As you can see, zq
is a general-purpose tool that you can add to your bag
of tricks to:
- explore messy and confusing JSON data using shaping and sampling,
- transform JSON data in ad hoc ways, and
- develop transform logic for hitting APIs like the GitHub API to produce
clean data for analysis by
zq
or even export into other systems or for testing.
If you'd like to learn more, feel free to read through the language docs in depth or see how you can organize data into a lake using a git-like commit model.