Data Types
The SuperPipe language includes most data types of a typical programming language as defined in the super data model.
The syntax of individual literal values generally follows the Super JSON syntax with the exception that type decorators are not included in the language. Instead, a type cast may be used in any expression for explicit type conversion.
In particular, the syntax of primitive types follows the primitive-value definitions in Super JSON as well as the various complex value definitions like records, arrays, sets, and so forth. However, complex values are not limited to constant values like Super JSON and can be composed from literal expressions.
First-class Types
As in the super data model, the SuperPipe language has first-class types: any type may be used as a value.
The primitive types are listed in the
data model specification
and have the same syntax in SuperPipe. Complex types also follow
the Super JSON syntax. Note that the type of a type value is simply type
.
As in Super JSON, when types are used as values, e.g., in an expression,
they must be referenced within angle brackets. That is, the integer type
int64
is expressed as a type value using the syntax <int64>
.
Complex types in the SuperPipe language follow the Super JSON syntax as well. Here are a few examples:
- a simple record type -
{x:int64,y:int64}
- an array of integers -
[int64]
- a set of strings -
|[string]|
- a map of strings keys to integer values -
{[string,int64]}
- a union of string and integer -
(string,int64)
Complex types may be composed, as in [({s:string},{x:int64})]
which is
an array of type union
of two types of records.
The typeof
function returns a value's type as
a value, e.g., typeof(1)
is <int64>
and typeof(<int64>)
is <type>
.
First-class types are quite powerful because types can serve as group-by keys or be used in "data shaping" logic. A common workflow for data introspection is to first perform a search of exploratory data and then count the shapes of each type of data as follows:
search ... |> count() by typeof(this)
For example,
echo '1 2 "foo" 10.0.0.1 <string>' |
super -z -c 'count() by typeof(this) |> sort this' -
produces
{typeof:<int64>,count:2(uint64)}
{typeof:<string>,count:1(uint64)}
{typeof:<ip>,count:1(uint64)}
{typeof:<type>,count:1(uint64)}
When running such a query over complex, semi-structured data, the results can be quite illuminating and can inform the design of "data shaping" queries to transform raw, messy data into clean data for downstream tooling.
Note the somewhat subtle difference between a record value with a field t
of
type type
whose value is type string
{t:<string>}
and a record type used as a value
<{t:string}>
Named Types
As in any modern programming language, types can be named and the type names persist into the data model and thus into the serialized input and output.
Named types may be defined in four ways:
- with a
type
statement, - with the
cast
function, - with a definition inside of another type, or
- by the input data itself.
Type names that are embedded in another type have the form
name=type
and create a binding between the indicated string name
and the specified type.
For example,
type socket = {addr:ip,port:port=uint16}
defines a named type socket
that is a record with field addr
of type ip
and field port
of type "port", where type "port" is a named type for type uint16
.
Named types may also be defined by the input data itself, as super data is
comprehensively self describing.
When named types are defined in the input data, there is no need to declare their
type in a query.
In this case, a SuperPipe expression may refer to the type by the name that simply
appears to the runtime as a side effect of operating upon the data. If the type
name referred to in this way does not exist, then the type value reference
results in error("missing")
. For example,
echo '1(=foo) 2(=bar) 3(=foo)' | super -z -c 'typeof(this)==<foo>' -
results in
1(=foo)
3(=foo)
and
echo '1(=foo)' | super -z -c 'yield <foo>' -
results in
<foo=int64>
but
super -z -c 'yield <foo>'
gives
error("missing")
Each instance of a named type definition overrides any earlier definition. In this way, types are local in scope.
Each value that references a named type retains its local definition of the named type retaining the proper type binding while accommodating changes in a particular named type. For example,
echo '1(=foo) 2(=bar) "hello"(=foo) 3(=foo)' |
super -z -c 'count() by typeof(this) |> sort this' -
results in
{typeof:<bar=int64>,count:1(uint64)}
{typeof:<foo=int64>,count:2(uint64)}
{typeof:<foo=string>,count:1(uint64)}
Here, the two versions of type "foo" are retained in the group-by results.
In general, it is bad practice to define multiple versions of a single named type, though the SuperDB system and super data model accommodate such dynamic bindings. Managing and enforcing the relationship between type names and their type definitions on a global basis (e.g., across many different data pools in a data lake) is outside the scope of the Zed data model and language. That said, Zed provides flexible building blocks so systems can define their own schema versioning and schema management policies on top of these Zed primitives.
The super-structured data model is a superset of relational tables and SuperPipe's type system can easily make this connection. As an example, consider this type definition for "employee":
type employee = {id:int64,first:string,last:string,job:string,salary:float64}
In SQL, you might find the top five salaries by last name with
SELECT last,salary
FROM employee
ORDER BY salary
LIMIT 5
In SuperPipe, you would say
from anywhere
| typeof(this)==<employee>
| cut last,salary
| sort salary
| head 5
and since type comparisons are so useful and common, the is
function
can be used to perform the type match:
from anywhere
| is(<employee>)
| cut last,salary
| sort salary
| head 5
The power of SuperPipe is that you can interpret data on the fly as belonging to a certain schema, in this case "employee", and those records can be intermixed with other relevant data. There is no need to create a table called "employee" and put the data into the table before that data can be queried as an "employee". And if the schema or type name for "employee" changes, queries still continue to work.
First-class Errors
As with types, errors in SuperPipe are first-class: any value can be transformed
into an error by wrapping it in an error
type.
In general, expressions and functions that result in errors simply return
a value of type error
as a result. This encourages a powerful flow-style
of error handling where errors simply propagate from one operation to the
next and land in the output alongside non-error values to provide a very helpful
context and rich information for tracking down the source of errors. There is
no need to check for error conditions everywhere or look through auxiliary
logs to find out what happened.
For example, input values can be transformed to errors as follows:
echo '0 "foo" 10.0.0.1' | super -z -c 'error(this)' -
produces
error(0)
error("foo")
error(10.0.0.1)
More practically, errors from the runtime show up as error values. For example,
echo 0 | super -z -c '1/this' -
produces
error("divide by zero")
And since errors are first-class and just values, they have a type.
In particular, they are a complex type where the error value's type is the
complex type error
containing the type of the value. For example,
echo 0 | super -z -c 'typeof(1/this)' -
produces
<error(string)>
First-class errors are particularly useful for creating structured errors. When a SuperPipe query encounters a problematic condition, instead of silently dropping the problematic error and logging an error obscurely into some hard-to-find system log as so many ETL pipelines do, the Zed logic can preferably wrap the offending value as an error and propagate it to its output.
For example, suppose a bad value shows up:
{kind:"bad", stuff:{foo:1,bar:2}}
A shaper could catch the bad value (e.g., as a default
case in a switch
topology) and propagate it as
an error using the expression:
yield error({message:"unrecognized input",input:this})
then such errors could be detected and searched for downstream with the
is_error
function.
For example,
is_error(this)
on the wrapped error from above produces
error({message:"unrecognized input",input:{kind:"bad", stuff:{foo:1,bar:2}}})
There is no need to create special tables in a complex warehouse-style ETL to land such errors as they can simply land next to the output values themselves.
And when transformations cascade one into the next as different stages of an ETL pipeline, errors can be wrapped one by one forming a "stack trace" or lineage of where the error started and what stages it traversed before landing at the final output stage.
Errors will unfortunately and inevitably occur even in production, but having a first-class data type to manage them all while allowing them to peacefully coexist with valid production data is a novel and useful approach that Zed enables.
Missing and Quiet
SuperDB's heterogeneous data model allows for queries
that operate over different types of data whose structure and type
may not be known ahead of time, e.g., different
types of records with different field names and varying structure.
Thus, a reference to a field, e.g., this.x
may be valid for some values
that include a field called x
but not valid for those that do not.
What is the value of x
when the field x
does not exist?
A similar question faced SQL when it was adapted in various different forms
to operate on semi-structured data like JSON or XML. SQL already had the NULL
value
so perhaps a reference to a missing value could simply be NULL
.
But JSON also has null
, so a reference to x
in the JSON value
{"x":null}
and a reference to x
in the JSON value
{}
would have the same value of NULL
. Furthermore, an expression like x==NULL
could not differentiate between these two cases.
To solve this problem, the MISSING
value was proposed to represent the value that
results from accessing a field that is not present. Thus, x==NULL
and
x==MISSING
could disambiguate the two cases above.
SuperPipe, instead, recognizes that the SQL value MISSING
is a paradox:
I'm here but I'm not.
In reality, a MISSING
value is not a value. It's an error condition
that resulted from trying to reference something that didn't exist.
So why should we pretend that this is a bona fide value? SQL adopted this approach because it lacks first-class errors.
But SuperPipe has first-class errors so
a reference to something that does not exist is an error of type
error(string)
whose value is error("missing")
. For example,
echo "{x:1} {y:2}" | super -z -c 'yield x' -
produces
1
error("missing")
Sometimes you want missing errors to show up and sometimes you don't.
The quiet
function transforms missing errors into
"quiet errors". A quiet error is the value error("quiet")
and is ignored
by most operators, in particular yield
. For example,
echo "{x:1} {y:2}" | super -z -c "yield quiet(x)" -
produces
1
And what if you want a default value instead of a missing error? The
coalesce
function returns the first value that is not
null, error("missing")
, or error("quiet")
. For example,
echo "{x:1} {y:2}" | super -z -c "yield coalesce(x, 0)" -
produces
1
0