Zed/Zeek Data Type Compatibility
As the super data model was in many ways inspired by the Zeek TSV log format, SuperDB's rich storage formats (Super JSON, Super Binary, etc.) maintain comprehensive interoperability with Zeek. When Zeek is configured to output its logs in JSON format, much of the rich type information is lost in translation, but this can be restored by following the guidance for shaping Zeek JSON. On the other hand, Zeek TSV can be converted to Zed storage formats and back to Zeek TSV without any loss of information.
This document describes how the Zed type system is able to represent each of the types that may appear in Zeek logs.
Zed tools maintain an internal Zed-typed representation of any Zeek data that is read or imported. Therefore, knowing the equivalent types will prove useful when performing operations in the Zed language such as type casting or looking at the data when output as Super JSON.
Equivalent Types
The following table summarizes which Zed data type corresponds to each Zeek data type that may appear in a Zeek TSV log. While most types have a simple 1-to-1 mapping from Zeek to Zed and back to Zeek again, the sections linked from the Additional Detail column describe cosmetic differences and other subtleties applicable to handling certain types.
Zeek Type | Zed Type | Additional Detail |
---|---|---|
bool | bool | |
count | uint64 | |
int | int64 | |
double | float64 | See double details |
time | time | |
interval | duration | |
string | string | See string details about escaping |
port | uint16 | See port details |
addr | ip | |
subnet | net | |
enum | string | See enum details |
set | set | See set details |
vector | [array ](/docs/next/formats/zed#22-array | |
record | [record ](/docs/next/formats/zed#21-record | See record details |
The Zeek data types page describes the types in the context of the Zeek scripting language. The Zeek types available in scripting are a superset of the data types that may appear in Zeek log files. The encodings of the types also differ in some ways between the two contexts. However, we link to this reference because there is no authoritative specification of the Zeek TSV log format.
Example
The following example shows a TSV log that includes each Zeek data type, how
it's output as Super JSON by super
, and then how it's written back out again as a Zeek
log. You may find it helpful to refer to this example when reading the
type-specific details.
Viewing the TSV log:
cat zeek_types.log
Output:
#separator \x09
#set_separator ,
#empty_field (empty)
#unset_field -
#fields my_bool my_count my_int my_double my_time my_interval my_printable_string my_bytes_string my_port my_addr my_subnet my_enum my_set my_vector my_record.name my_record.age
#types bool count int double time interval string string port addr subnet enum set[string] vector[string] string count
T 123 456 123.4560 1592502151.123456 123.456 smile😁smile \x09\x07\x04 80 127.0.0.1 10.0.0.0/8 tcp things,in,a,set order,is,important Jeanne 122
Reading the TSV log, outputting as Super JSON, and saving a copy:
super -Z zeek_types.log | tee zeek_types.jsup
Output:
{
my_bool: true,
my_count: 123 (uint64),
my_int: 456,
my_double: 123.456,
my_time: 2020-06-18T17:42:31.123456Z,
my_interval: 2m3.456s,
my_printable_string: "smile😁smile",
my_bytes_string: "\t\u0007\u0004",
my_port: 80 (port=uint16),
my_addr: 127.0.0.1,
my_subnet: 10.0.0.0/8,
my_enum: "tcp" (=zenum),
my_set: |[
"a",
"in",
"set",
"things"
]|,
my_vector: [
"order",
"is",
"important"
],
my_record: {
name: "Jeanne",
age: 122 (uint64)
}
}
Reading the saved Super JSON output and outputting as Zeek TSV:
super -f zeek zeek_types.jsup
Output:
#separator \x09
#set_separator ,
#empty_field (empty)
#unset_field -
#fields my_bool my_count my_int my_double my_time my_interval my_printable_string my_bytes_string my_port my_addr my_subnet my_enum my_set my_vector my_record.name my_record.age
#types bool count int double time interval string string port addr subnet enum set[string] vector[string] string count
T 123 456 123.456 1592502151.123456 123.456000 smile😁smile \x09\x07\x04 80 127.0.0.1 10.0.0.0/8 tcp a,in,set,things order,is,important Jeanne 122
Type-Specific Details
As zq
acts as a reference implementation for SuperDB storage formats such as
Super JSON and ZNG, it's helpful to understand how it reads the following Zeek data
types into readable text equivalents in the Super JSON format, then writes them back
out again in the Zeek TSV log format. Other implementations of the Zed storage
formats (should they exist) may handle these differently.
Multiple Zeek types discussed below are represented via a
type definition to one of Zed's
primitive types. The Zed type
definitions maintain the history of the field's original Zeek type name
such that zq
may restore it if the field is later output in
Zeek TSV format. Knowledge of its original Zeek type may also enable special
operations in Zed that are unique to values known to have originated as a
specific Zeek type, though no such operations are currently implemented in
zq
.
double
As they do not affect accuracy, "trailing zero" decimal digits on Zeek double
values will not be preserved when they are formatted into a string, such as
via the -f jsup|zeek|table
output options in zq
(e.g., 123.4560
becomes
123.456
).
s
enum
As they're encountered in common programming languages, enum variables
typically hold one of a set of predefined values. While this is
how Zeek's enum
type behaves inside the Zeek scripting language,
when the enum
type is output in a Zeek log, the log does not communicate
any such set of "allowed" values as they were originally defined. Therefore,
these values are represented with a type name bound to the Zed string
type. See the text above regarding type definitions
for more details.
port
The numeric values that appear in Zeek logs under this type are represented
in Zed with a type name of port
bound to the uint16
type. See the text
above regarding type names for more details.
set
Because order within sets is not significant, no attempt is made to maintain
the order of set
elements as they originally appeared in a Zeek log.
string
Zeek's string
data type is complicated by its ability to hold printable ASCII
and UTF-8 as well as arbitrary unprintable bytes represented as \x
escapes.
Because such binary data may need to legitimately be captured (e.g. to record
the symptoms of DNS exfiltration), it's helpful that Zeek has a mechanism to
log it. Unfortunately, Zeek's use of the single string
type for these
multiple uses leaves out important details about the intended interpretation
and presentation of the bytes that make up the value. For instance, one Zeek
string
field may hold arbitrary network data that coincidentally sometimes
form byte sequences that could be interpreted as printable UTF-8, but they are
not intended to be read or presented as such. Meanwhile, another Zeek
string
field may be populated such that it will only ever contain printable
UTF-8. These details are currently only captured within the Zeek source code
itself that defines how these values are generated.
Zed includes a primitive type
called bytes
that's suited to storing the former "always binary" case and a
string
type for the latter "always printable" case. However, Zeek logs do
not currently communicate details that would allow an implementation to know
which Zeek string
fields to store as which of these two Zed data types.
Instead, the Zed system does what the Zeek system does when writing strings
to JSON: any \x
escapes used in Zeek TSV strings are translated into valid
Zed UTF-8 strings by escaping the backslash before the x
. In this way,
you can still see binary-corrupted strings that are generated by Zeek in
the Zed data formats.
Unfortunately there is no way to distinguish whether a \x
escape occurred
or whether that string pattern happened to occur in the original data. A nice
solution would be to convert Zeek strings that are valid UTF-8 strings into
Zed strings and convert invalid strings into a Zed bytes
type, or we could
convert both of them into a Zed union of string
and bytes
. If you have
interest in a capability like this, please let us know and we can elevate
the priority.
If Zeek were to provide an option to output logs directly in one or more of
Zed's richer storage formats, this would create an opportunity to
assign the appropriate Zed bytes
or string
type at the point of origin,
depending on what's known about how the field's value is intended to be
populated and used.
record
Zeek's record
type is unique in that every Zeek log line effectively is a
record, with its schema defined via the #fields
and #types
directives in
the headers of each log file. The word "record" never appears explicitly in
the schema definition in Zeek logs.
Embedded records also subtly appear within Zeek log lines in the form of
dot-separated field names. A common example in Zeek is the
id
record, which captures the source and destination IP addresses and ports for a
network connection as fields id.orig_h
, id.orig_p
, id.resp_h
, and
id.resp_p
. When reading such fields into their Zed equivalent, zq
restores
the hierarchical nature of the record as it originally existed inside of Zeek
itself before it was output by its logging system. This enables operations in
Zed that refer to the record at a higher level but affect all values lower
down in the record hierarchy.
For instance, revisiting the data from our example, we can output all fields within
my_record
using Zed's cut
operator.
Command:
super -f zeek -c 'cut my_record' zeek_types.jsup
Output:
#separator \x09
#set_separator ,
#empty_field (empty)
#unset_field -
#fields my_record.name my_record.age
#types string count
Jeanne 122