Version: v1.5.0

Join Overview

This is a brief primer on Zed's experimental join operator.

Currently, join is limited in the following ways:

the joined inputs both come from the parent so the query must be split before join,
only merge join is implemented, requiring inputs to be explicitly sorted, and
only equi-join (i.e., a join predicate containing =) is supported.

A more comprehensive join design with easier-to-use syntax is forthcoming.

Example Data

The first input data source for our usage examples is fruit.ndjson, which describes the characteristics of some fresh produce.

{"name":"apple","color":"red","flavor":"tart"}
{"name":"banana","color":"yellow","flavor":"sweet"}
{"name":"avocado","color":"green","flavor":"savory"}
{"name":"strawberry","color":"red","flavor":"sweet"}
{"name":"dates","color":"brown","flavor":"sweet","note":"in season"}
{"name":"figs","color":"brown","flavor":"plain"}

The other input data source is people.ndjson, which describes the traits and preferences of some potential eaters of fruit.

{"name":"morgan","age":61,"likes":"tart"}
{"name":"quinn","age":14,"likes":"sweet","note":"many kids enjoy sweets"}
{"name":"jessie","age":30,"likes":"plain"}
{"name":"chris","age":47,"likes":"tart"}

Inner Join

We'll start by outputting only the fruits liked by at least one person. The name of the matching person is copied into a field of a different name in the joined results.

Because we're performing an inner join (the default), the explicit inner is not strictly necessary, but including it clarifies our intention.

Notice how each input is specified separately within the parentheses-wrapped from() block before the join appears in our Zed pipeline.

The Zed script inner-join.zed:

from (
  file fruit.ndjson => sort flavor
  file people.ndjson => sort likes
) | inner join on flavor=likes eater:=name

Executing the Zed script:

zq -z -I inner-join.zed

produces

{name:"figs",color:"brown",flavor:"plain",eater:"jessie"}
{name:"banana",color:"yellow",flavor:"sweet",eater:"quinn"}
{name:"strawberry",color:"red",flavor:"sweet",eater:"quinn"}
{name:"dates",color:"brown",flavor:"sweet",note:"in season",eater:"quinn"}
{name:"apple",color:"red",flavor:"tart",eater:"morgan"}
{name:"apple",color:"red",flavor:"tart",eater:"chris"}

Left Join

note

In some databases a left join is called a left outer join.

By performing a left join that targets the same key fields, now all of our fruits will be shown in the results even if no one likes them (e.g., avocado).

As another variation, we'll also copy over the age of the matching person. By referencing only the field name rather than using := for assignment, the original field name age is maintained in the results.

The Zed script left-join.zed:

from (
  file fruit.ndjson => sort flavor
  file people.ndjson => sort likes
) | left join on flavor=likes eater:=name,age

Executing the Zed script:

zq -z -I left-join.zed

produces

{name:"figs",color:"brown",flavor:"plain",eater:"jessie",age:30}
{name:"avocado",color:"green",flavor:"savory"}
{name:"banana",color:"yellow",flavor:"sweet",eater:"quinn",age:14}
{name:"strawberry",color:"red",flavor:"sweet",eater:"quinn",age:14}
{name:"dates",color:"brown",flavor:"sweet",note:"in season",eater:"quinn",age:14}
{name:"apple",color:"red",flavor:"tart",eater:"morgan",age:61}
{name:"apple",color:"red",flavor:"tart",eater:"chris",age:47}

Right join

note

In some databases a right join is called a right outer join.

Next we'll change the join type from left to right. Notice that this causes the note field from the right-hand input to appear in the joined results.

The Zed script right-join.zed:

from (
  file fruit.ndjson => sort flavor
  file people.ndjson => sort likes
) | right join on flavor=likes fruit:=name

Executing the Zed script:

zq -z -I right-join.zed

produces

{name:"jessie",age:30,likes:"plain",fruit:"figs"}
{name:"quinn",age:14,likes:"sweet",note:"many kids enjoy sweets",fruit:"banana"}
{name:"quinn",age:14,likes:"sweet",note:"many kids enjoy sweets",fruit:"strawberry"}
{name:"quinn",age:14,likes:"sweet",note:"many kids enjoy sweets",fruit:"dates"}
{name:"morgan",age:61,likes:"tart",fruit:"apple"}
{name:"chris",age:47,likes:"tart",fruit:"apple"}

Inputs from Pools

As our prior examples all used zq, we used file in our from() block to pull our respective inputs from named file sources. However, if the inputs are stored in pools in a Zed lake, the pool names would instead be specified in the from() block.

Here we'll load our input data to pools in a temporary Zed Lake, then execute our inner join using zed query.

Notice that because we happened to use -orderby to sort our pools by the same keys that we reference in our join, we did not need to use any explicit upstream sort.

The Zed script inner-join-pools.zed:

from (
  pool fruit
  pool people
) | inner join on flavor=likes eater:=name

Populating the pools, then executing the Zed script:

export ZED_LAKE=lake
zed init -q
zed create -q -orderby flavor:asc fruit
zed create -q -orderby likes:asc people
zed load -q -use fruit fruit.ndjson
zed load -q -use people people.ndjson
zed query -z -I inner-join-pools.zed

produces

{name:"figs",color:"brown",flavor:"plain",eater:"jessie"}
{name:"dates",color:"brown",flavor:"sweet",note:"in season",eater:"quinn"}
{name:"banana",color:"yellow",flavor:"sweet",eater:"quinn"}
{name:"strawberry",color:"red",flavor:"sweet",eater:"quinn"}
{name:"apple",color:"red",flavor:"tart",eater:"chris"}
{name:"apple",color:"red",flavor:"tart",eater:"morgan"}

Self Joins

In addition to the named files and pools like we've used in the prior examples, Zed is also intended to work on a single sequence of data that is split and joined to itself. Here we'll combine our file sources into a stream that we'll pipe into zq via stdin. Because join requires two separate inputs, here we'll use the has() function to identify the records in the stream that will be treated as the left and right sides.

The Zed script inner-join-streamed.zed:

switch (
  case has(color) => sort flavor
  case has(age) => sort likes
) | inner join on flavor=likes eater:=name

Executing the Zed script:

cat fruit.ndjson people.ndjson | zq -z -I inner-join-streamed.zed -

produces

{name:"figs",color:"brown",flavor:"plain",eater:"jessie"}
{name:"banana",color:"yellow",flavor:"sweet",eater:"quinn"}
{name:"strawberry",color:"red",flavor:"sweet",eater:"quinn"}
{name:"dates",color:"brown",flavor:"sweet",note:"in season",eater:"quinn"}
{name:"apple",color:"red",flavor:"tart",eater:"morgan"}
{name:"apple",color:"red",flavor:"tart",eater:"chris"}

Multi-value Joins

The equality test in a Zed join accepts only one named key from each input. However, joins on multiple matching values can still be performed by making the values available in comparable complex types, such as embedded records.

To illustrate this, we'll introduce some new input data inventory.ndjson that represents a vendor's available quantity of fruit for sale. As the colors indicate, they separately offer both ripe and unripe fruit.

{"name":"banana","color":"yellow","quantity":1000}
{"name":"banana","color":"green","quantity":5000}
{"name":"strawberry","color":"red","quantity":3000}
{"name":"strawberry","color":"white","quantity":6000}

Let's assume we're interested in seeing the available quantities of only the ripe fruit in our fruit.ndjson records. In the Zed script multi-value-join.zed, we create the keys as embedded records inside each input record, using the same field names and data types in each. We'll leave the created fruitkey records intact to show what they look like, but since it represents redundant data, in practice we'd typically drop it after the join in our Zed pipeline.

from (
  file fruit.ndjson => put fruitkey:={name,color} | sort fruitkey
  file inventory.ndjson => put invkey:={name,color} | sort invkey
) | inner join on fruitkey=invkey quantity

Executing the Zed script:

zq -z -I multi-value-join.zed

produces

{name:"banana",color:"yellow",flavor:"sweet",fruitkey:{name:"banana",color:"yellow"},quantity:1000}
{name:"strawberry",color:"red",flavor:"sweet",fruitkey:{name:"strawberry",color:"red"},quantity:3000}

Embedding the entire opposite record

In the current join implementation, explicit entries must be provided in the [field-list] in order to copy values from the opposite input into the joined results (a possible future enhancement zed/2815 may improve upon this). This can be cumbersome if your goal is to copy over many fields or you don't know the names of all desired fields.

One way to work around this limitation is to specify this in the field list to copy the contents of the entire opposite record into an embedded record in the result.

The Zed script embed-opposite.zed:

from (
  file fruit.ndjson => sort flavor
  file people.ndjson => sort likes
) | inner join on flavor=likes eaterinfo:=this

Executing the Zed script:

zq -z -I embed-opposite.zed

produces

{name:"figs",color:"brown",flavor:"plain",eaterinfo:{name:"jessie",age:30,likes:"plain"}}
{name:"banana",color:"yellow",flavor:"sweet",eaterinfo:{name:"quinn",age:14,likes:"sweet",note:"many kids enjoy sweets"}}
{name:"strawberry",color:"red",flavor:"sweet",eaterinfo:{name:"quinn",age:14,likes:"sweet",note:"many kids enjoy sweets"}}
{name:"dates",color:"brown",flavor:"sweet",note:"in season",eaterinfo:{name:"quinn",age:14,likes:"sweet",note:"many kids enjoy sweets"}}
{name:"apple",color:"red",flavor:"tart",eaterinfo:{name:"morgan",age:61,likes:"tart"}}
{name:"apple",color:"red",flavor:"tart",eaterinfo:{name:"chris",age:47,likes:"tart"}}

Join Overview

Example Data​

Inner Join​

Left Join​

Right join​

Inputs from Pools​

Self Joins​

Multi-value Joins​

Embedding the entire opposite record​

Example Data

Inner Join

Left Join

Right join

Inputs from Pools

Self Joins

Multi-value Joins

Embedding the entire opposite record