zed Tutorial
zq
is great, but what if we have a lot of data on which we want to perform search and
analytics? This is where the zed
command comes in. zed
builds on the type
system and language found in zq
and adds a high performance data lake on top.
Note:
zed
is currently in alpha form. Check out its current status in thesuper db
command documentation..
Creating a Lake
We start by creating our Zed lake. First we'll set the ZED_LAKE
environment variable that tells zed
where we want to store our lake:
$ export ZED_LAKE=$HOME/.zedlake
Next we instruct zed
to initialize our lake:
$ zed init
=>
lake created: /path/to/home/.zedlake
Adding data to our lake
Let's add some data.
Data is stored in pools in a Zed lake. You might say a pool is similar to a table in a SQL database except unlike a SQL table a Zed pool has no schema to which underlying data must adhere. Any data is welcome in a Zed pool! A Zed pool does have a pool key (or field) by which data is sorted. You might think of a pool key as a pool's primary index. Though individual values in a pool are not required to have the pool key field, it is nice to have a pool key that fits the data since this will allow Zed to efficiently query data within a range of the pool key without having to touch the entire data set.
For this primer we'll work with pull requests on this public repository via the
GitHub API.
Let's create a pool to store this data and use the field created_at
as the
pool key, sorted in descending order:
$ zed create -orderby created_at:desc prs
=>
pool created: prs <unique pool ID>
Using zed ls
we can view all the pools in the lake:
$ zed ls
=>
prs <pool_id> key created_at order desc
Let's add some pull request data I've prefetched from the GitHub API here:
$ zed load -use prs github1.bsup
=>
<commit_id> committed
Our data has been committed. The -use prs
argument in zed load
tells
zed
to load our data into the prs
pool.
Querying our data
With our data now loaded let's run a quick count()
query to verify that we have
the expected data. To do this we'll use the zed query
command. To those
familiar with super
, zed query
operates similarly except
it doesn't accept file input arguments since it queries pools.
$ zed query -use prs 'count()'
=>
{count:100(uint64)}
This looks good so far, but let's do something more interesting. First let's use
the zed use
command to set prs
as our default pool so we don't have to type
the -use
argument every time we operate on this pool.
$ zed use prs
We can run an aggregation to see who has created the most PRs during the time range of this first data set:
$ zed query 'count() by user:=user.login |> sort count desc'
=>
{user:"mccanne",count:40(uint64)}
{user:"mattnibs",count:23(uint64)}
{user:"aswan",count:20(uint64)}
{user:"henridf",count:9(uint64)}
{user:"nwt",count:5(uint64)}
{user:"philrz",count:3(uint64)}
A productive few weeks for McCanne!
We can use the min
and max
aggregations to see the time range of our data set:
$ zed query -Z 'min(created_at), max(created_at)'
=>
{
min: 2019-11-11T19:50:46Z,
max: 2019-12-05T16:56:57Z
}
That's not a lot of data, so let's add some more.
Adding additional data
Additional data can be added to our pool by running zed load
on our second
data set:
$ zed load github2.bsup
Running our min(created_at), max(created_at)
query, we'll see that we now have
almost two years of pull requests:
$ zed query -Z 'min(created_at), max(created_at)'
=>
{
min: 2019-11-11T19:50:46Z,
max: 2021-09-19T19:31:43Z
}
Now let's run a bucketed aggregation to count approximate PRs per month (specifically, PRs bucketed in 12 equal spans of a year):
$ zed query 'count() by ts:=bucket(created_at, 1y/12) |> sort ts'
=>
{ts:2019-10-20T04:00:00Z,count:28(uint64)}
{ts:2019-11-19T14:00:00Z,count:123(uint64)}
{ts:2019-12-20T00:00:00Z,count:72(uint64)}
{ts:2020-01-19T10:00:00Z,count:102(uint64)}
{ts:2020-02-18T20:00:00Z,count:114(uint64)}
{ts:2020-03-20T06:00:00Z,count:111(uint64)}
{ts:2020-04-19T16:00:00Z,count:137(uint64)}
{ts:2020-05-20T02:00:00Z,count:74(uint64)}
...
There are lots of PRs that happened in the ~30 day block starting on 4/19/2020, so let's zoom in here and see who created these PRs:
$ zed query 'from prs range 2020-04-19T16:00:00Z to 2020-05-20T02:00:00Z
|> count() by user:=user.login | sort count desc'
=>
{user:"mccanne",count:35(uint64)}
{user:"henridf",count:34(uint64)}
{user:"aswan",count:27(uint64)}
{user:"mattnibs",count:14(uint64)}
{user:"alfred-landrum",count:12(uint64)}
{user:"philrz",count:9(uint64)}
{user:"mikesbrown",count:5(uint64)}
{user:"nwt",count:1(uint64)}
McCanne is once again in the lead but Henri is not far behind.
The important thing demonstrated in the above query is the use of the from
operator. The from
operator specifies to query the main
branch of the prs
pool
and also defines a time range for the query. The range part of the query is an
important distinction from zq
. Whereas zq
would be required to
scan the entire data set to execute this query, this Zed pool which stores data
sorted by created_at
can skip all data that doesn't fall within the range
2020-04-19T16:00:00Z to 2020-05-20T02:00:00Z
. This results in a much faster
query over the limited range.
Time travel
Suppose we made a mistake by loading the last chunk of data.
Perhaps we applied the wrong transform to the incoming data. Is there any
way we can fix this? Similar to version control systems like git
,
a Zed lake maintains a linear history (or commit log) of all the changes made to
a pool. There are many advantages to having data stored in this manner, one of
which is that we can easily discard changes we don't want.
First we'll use zed log
command to view the history of commits (IDs will vary in your output):
$ zed log
=>
commit 26i2N0uu6wEo5XAhPMid6eQsamF
Author: nibs@Matthews-MacBook-Air-2.local
Date: 2022-03-21T26:03:25Z
loaded 1 data object
26i2MyhTem11tTOS2HSa1cgnYyz 1900 records in 765024 data bytes
commit 26i2MeIlGMoGHzjpbZttKtUuSFb
Author: nibs@Matthews-MacBook-Air-2.local
Date: 2022-03-21T19:47:19Z
loaded 1 data object
26i2Mi5xPdaTRxbho05DUhTYHIx 100 records in 46000 data bytes
Let's revert the most recent commit:
zed revert 26i2N0uu6wEo5XAhPMid6eQsamF
=>
"main": 26i2N0uu6wEo5XAhPMid6eQsamF reverted in 26nY9AYOxx2WtSfKGjof9R2MOYb
We can run count()
to see we're back to our original 100 values.
$ zed query 'count()'
=>
{count:100(uint64)}
If we made a mistake and we'd like to keep the data, we can also revert our revert commit:
$ zed revert 26nY9AYOxx2WtSfKGjof9R2MOYb
Running count()
will show we're back to 2000 values:
$ zed query 'count()'
=>
{count:2000(uint64)}
Running as a service
Now that we've compiled an interesting data set, how might we share this with
others? Using the zed serve
command we can launch our Zed lake as a service
that will allow multiple clients to query and add data to the same lake. In a
separate console window run:
$ zed serve -lake $HOME/.zedlake
=>
{"level":"info","ts":1647957396.828584,"msg":"Open files limit raised","limit":10240}
{"level":"info","ts":1647957396.8318028,"logger":"core","msg":"Started"}
{"level":"info","ts":1647957396.83288,"logger":"httpd","msg":"Listening","addr":"[::]:9867"}
We now have a service running on http://localhost:9867
. If we set the
ZED_LAKE
environment variable we defined at the beginning to this URL we can
run the full set of zed
commands against this service:
$ export ZED_LAKE=http://localhost:9867
$ zed query -Z 'min(created_at), max(created_at)'
=>
{
min: 2019-11-11T19:50:46Z,
max: 2021-08-10T19:48:56Z
}
Where to go from here?
Obviously this is only the tip of the iceberg in terms of things that can be done with
the zed
command. Some suggested next steps:
- Dig deeper into SuperDB data lakes by having a look at the
super db
command documentation. - Get a better idea of ways you can query your data by looking at the Zed language documentation.
If you have any questions or run into any snags, join the friendly Zed community at the Brim Data Slack workspace.