Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions di/sort/init.q
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
/ library for sorting and applying attributes to on-disk kdb+ tables

\l ::sort.q

export:([init;readcsv;sorttab])
178 changes: 178 additions & 0 deletions di/sort/sort.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
# di.sort

Module for sorting and applying attributes to on-disk kdb+ tables. Driven by a **config table** that specifies which columns to sort by and which attributes to apply per table. You pass that table straight to `sorttab`; if you keep your config in a CSV file, `readcsv` reads one into the right shape for you. If a table has no explicit entry, `di.sort` falls back to a `default` row. Extracted from the `.sort` namespace in TorQ's `dbwriteutils.q`.

## Usage

```q
srt:use`di.sort

/ inject dependencies — log is required
log:use`di.log
logdep:`info`warn`error!(log.info;log.warn;log.error)
srt.init[enlist[`log]!enlist logdep]

/ build a config table directly ...
config:([] tabname:`trade`trade`default; att:`p``p; column:`sym`time`sym; sort:101b)
srt.sorttab[config; `trade; `:/hdb/2000.01.01/trade`:/hdb/2000.01.02/trade]

/ ... or read the config from a csv and pass it straight in
srt.sorttab[srt.readcsv `:config/sort.csv; `trade; `:/hdb/2000.01.01/trade]
```

`di.sort` holds no state of its own: every `sorttab` call takes the config it should use, so you can build that config however you like (by hand, from a query, or from a CSV) and reuse or vary it freely.

### Typical HDB sort loop

```q
srt:use`di.sort
log:use`di.log
logdep:`info`warn`error!(log.info;log.warn;log.error)
srt.init[enlist[`log]!enlist logdep]

config:srt.readcsv `:config/sort.csv

/ .Q.par[root;date;table] builds the on-disk partition path
hdb:`:/hdb
dates:2000.01.01 2000.01.02
tabs:`trade`quote
/ for each table, build its partition paths, then sort every table with the same config
pdirs:{[hdb;dates;t] .Q.par[hdb;;t] each dates}[hdb;dates] each tabs
srt.sorttab[config]'[tabs; pdirs]
```

## The config table

`di.sort` is configured with a table of this shape:

| Column | Type | Description |
|---|---|---|
| `tabname` | symbol | Table name, or `default` to apply to all unlisted tables |
| `att` | symbol | Attribute to apply after sort: `p` `s` `g` `u` or empty (`` ` ``) for none |
| `column` | symbol | Column to sort/attribute |
| `sort` | boolean | `1b` to use this column as a sort key; `0b` to apply an attribute only |

Build it any way you like — in code, from a query, or from a CSV via `readcsv`:

```q
([] tabname:`trade`trade`default; att:`p``p; column:`sym`time`sym; sort:101b)
```

Multiple rows for the same table are supported. All rows with `sort=1b` for a table form the compound sort key, in the order they appear. Attribute rows with `sort=0b` are applied independently after sorting.

**Attributes:**

| Value | Description |
|---|---|
| `p` | Parted — all rows with the same value are contiguous. Requires the column to be sorted first (`sort=1b`). |
| `s` | Sorted — values are in ascending order. Applied automatically by `xasc`; set explicitly here if wanted after the sort step. |
| `g` | Grouped — inverse index stored on disk. Suitable for low-to-medium cardinality unsorted columns. |
| `u` | Unique — all values are distinct. |
| ` ` (empty) | No attribute applied. Column may still participate in the sort if `sort=1b`. |

### sort.csv format (for `readcsv`)

A CSV consumed by `readcsv` must have these four columns, in this order:

```
tabname,att,column,sort
trade,p,sym,1
trade,,time,0
quote,p,sym,1
default,p,sym,1
```

## API

### `init[deps]`

Wire injectable dependencies. Must be called before any other function.

| Key | Required | Type | Description |
|---|---|---|---|
| `` `log `` | yes | dict | Functions keyed `` `info`warn`error ``, each with signature `{[ctx;msg]}` |

Errors with prefix `di.sort:` if `deps` is not a dict, if `log` is missing, or if the log dict does not contain all three required keys.

```q
srt.init[enlist[`log]!enlist logdep]
```

---

### `readcsv[file]`

Read a config CSV and **return** it as a table. Does not store it — pass the result to `sorttab`. Use this only when your config lives in a CSV; a hand-built table goes straight to `sorttab`.

| Parameter | Type | Description |
|---|---|---|
| `file` | hsym (or symbol) | Path to the CSV. Coerced with `hsym`, so `` `:config/sort.csv `` and `` `config/sort.csv `` both work. |

The CSV must have exactly the four columns `tabname`, `att`, `column`, `sort` — in **any** order (the result is normalised to canonical column order). The header is validated as it is read: a missing, extra, or misnamed column raises a clear `di.sort:` error rather than silently mis-parsing or dropping data. Logs info messages while reading (the read start and the row count) and an error message on file-read failure (then rethrows). Attribute-value validation (e.g. an unknown `att`) happens later in `sorttab`.

```q
config:srt.readcsv `:config/sort.csv
srt.sorttab[config; `trade; dirs]

/ or in one line
srt.sorttab[srt.readcsv `:config/sort.csv; `trade; dirs]
```

---

### `sorttab[config;tabname;dirs]`

Sort and apply attributes to on-disk partitions for a single table, using the supplied config table.

| Parameter | Type | Description |
|---|---|---|
| `config` | table | A config table with columns `` `tabname`att`column`sort `` (see [The config table](#the-config-table)) |
| `tabname` | symbol | Table name |
| `dirs` | hsym, or list of hsyms | Partition directory (or directories) for that table |

`config` is validated first; `sorttab` errors (prefixed `di.sort:`) if it is not a table, has unknown or missing columns, has a non-boolean `sort` column, or has an `att` value outside `` ` `p`s`g`u ``. It also errors (prefixed `di.sort:`) if `tabname` is not a symbol.

Lookup order for sort parameters within `config`:
1. Rows where `tabname` matches the supplied table name
2. Rows where `tabname = \`default`
3. If neither found — logs a warn and returns `()` without error

Each partition directory is processed independently: a failure on one partition is logged (as an error) and does not halt remaining partitions.

```q
/ single partition
srt.sorttab[config; `trade; enlist `:/hdb/2000.01.01/trade]

/ multiple partitions
srt.sorttab[config; `trade; `:/hdb/2000.01.01/trade`:/hdb/2000.01.02/trade]
```

## Log dependency contract

`di.sort` requires a log dependency dictionary with keys `` `info`warn`error ``, each a function with signature `{[ctx;msg]}`:

```q
`info`warn`error!({[ctx;msg] ...};{[ctx;msg] ...};{[ctx;msg] ...})
```

`di.log` satisfies this contract out of the box:

```q
log:use`di.log
logdep:`info`warn`error!(log.info;log.warn;log.error)
srt.init[enlist[`log]!enlist logdep]
```

You can supply any custom implementation with the same signatures.

Context symbols used by `di.sort` in log calls:

| Context | Level | When |
|---|---|---|
| `` `readcsv `` | info | CSV read start and row count on successful read |
| `` `readcsv `` | error | File read failure (rethrown after logging) |
| `` `sorttab `` | info | Sort start, params lookup result, column list, sort completion |
| `` `sorttab `` | warn | Table has no matching params and no default row |
| `` `sorttab `` | error | `xasc` failure on a partition (non-fatal — remaining partitions continue) |
| `` `applyattr `` | info | Attribute applied to a column |
| `` `applyattr `` | error | Attribute application failure (non-fatal) |
152 changes: 152 additions & 0 deletions di/sort/sort.q
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
/ library for sorting and applying attributes to on-disk kdb+ tables

/ attributes that may legitimately be applied on disk (empty leaves a column unattributed)
validatts:``p`s`g`u;

init:{[deps]
/ wire the injectable log dependency so the module reports through the host's logger
/ deps: a dict with a `log key -> `info`warn`error!({[c;m]};{[c;m]};{[c;m]})
/ see di.log for a default implementation, or pass any matching dict
/ example: srt.init[enlist[`log]!enlist logdep]
if[99h<>type deps;
'"di.sort: deps must be a dict with a `log key; see di.log for a default logger"];
if[not `log in key deps;
'"di.sort: log dependency is required; pass `info`warn`error functions keyed on `log"];
if[99h<>type deps`log;
'"di.sort: log value must be a dict of `info`warn`error functions"];
if[not all (`info`warn`error) in key deps`log;
'"di.sort: log dict must have `info`warn`error keys; got: ",(", " sv string key deps`log)];
.z.m.log:deps`log;
};

readcsv:{[file]
/ convenience for the common case where config lives in a csv - returns it, does not store
/ pass the result to sorttab, e.g. srt.sorttab[srt.readcsv `:sort.csv;`trade;dirs]
/ the csv must have the columns tabname,att,column,sort (in any order)
file:hsym file;
t:parsecsv @[readfile; file; readerr[file]];
.z.m.log[`info][`readcsv;"read ",(string count t)," sort param row(s) from ",string file];
:t;
};

/ internal - protected file read; only the i/o so a genuine read failure gets the readerr message
readfile:{[file]
/ returns the raw csv lines; header validation and parsing happen in parsecsv
.z.m.log[`info][`readcsv;"reading sort params from ",string file];
:read0 file;
};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format string "SSSB" in readfile parses all four CSV columns as S S S B (symbol symbol symbol boolean). The att column is a symbol, but the column column is also a symbol — that is fine. However, the att column will read the empty string "" as the symbol ` which is the intended no-attribute sentinel, so that part is correct. The real bug is that a missing or mismatched column count in the CSV will silently misalign the parse (e.g. tmp_sort_3col.csv has only 3 columns but "SSSB" will still attempt to read 4 columns, producing a type error or wrong data rather than a clear config-validation error). This is a correctness/robustness issue: 0: with a type-list longer than the actual column count causes a length error instead of the clear di.sort: prefixed error that checkconfig would produce, so the user gets a confusing low-level error rather than the documented one.

/ internal - log and rethrow a csv read failure
readerr:{[file;e]
/ build the message once, surface it under the readcsv context, then rethrow it to the caller
m:"failed to read ",string[file],": ",e;
.z.m.log[`error][`readcsv;m];
'm;
};

/ internal - validate the header and parse csv lines into a config table
parsecsv:{[lines]
/ map types by column name so the csv column order does not matter; reject any other shape
/ outside the readfile i/o trap so a bad header surfaces as a clear di.sort: error
if[0=count lines;
'"di.sort: csv has no header row"];
hdr:`$"," vs first lines;
if[not (asc distinct hdr)~`att`column`sort`tabname;
'"di.sort: csv header must be exactly tabname,att,column,sort; got: ",", " sv string hdr];
types:{$[x=`sort;"B";"S"]} each hdr;
:`tabname`att`column`sort#(types;enlist",") 0: lines;
};

sorttab:{[config;tabname;dirs]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readerr constructs the rethrown error message by string-concatenating e directly: '"failed to read ",string[file],": ",e. In q, the error value e passed to a trap handler is a string (character vector). If e is not a string (e.g. it is a symbol or other type in some edge cases), this concatenation will fail with a type error, swallowing the original error. This is a low-probability edge case but could mask real failures.

/ sort and apply attributes to the on-disk partition directories for one table
/ config: a sort-config table (build it directly or via readcsv); tabname: symbol; dirs: hsym or list of hsyms
/ example: srt.sorttab[srt.readcsv `:sort.csv;`trade;`:/hdb/2024.01.01/trade]
checkconfig config;
if[not -11h=type tabname;
'"di.sort: tabname must be a symbol, got type ",string type tabname];
st:string tabname;
.z.m.log[`info][`sorttab;"sorting the ",st," table"];
sp:getsortparams[config;tabname;st];
if[not count sp; :()];
sortdir[sp] each distinct (),dirs;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getsortparams is called with [config;tabname;st] where tabname is the raw symbol and st is its string form. Inside getsortparams the parameter is named t, and select from config where tabname=t compares the config tabname column against the local t — but the local variable t shadows the column name tabname only if q resolves the column first. In q, inside select … where tabname=t, tabname refers to the column and t refers to the local variable; this works correctly. However, the variable name t is also the name of the function parameter for the table name, and the first line of the function body does select from config where tabname=tconfig is both the table and a parameter name used elsewhere: this is fine in isolation, but within getsortparams, config is passed as the first parameter and is used as a table in the select. The real risk: tabname is both a column name in config AND a column name in the result. The select from config where tabname=t statement uses t (the local symbol parameter), but if t were ever a table it would be silently misinterpreted. This is a latent naming-collision risk rather than an immediate bug, but worth noting. More concretely: getsortparams returns 0#config (the empty config table) on the no-match path, but sorttab checks if[not count sp; :()]count of a 0-row table is 0, so this returns () correctly. No bug here, but the 0#config shape is important and only works because count of an empty table is 0.

.z.m.log[`info][`sorttab;"finished sorting the ",st," table"];
};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorttab does not validate that tabname is a symbol before passing it to getsortparams. The test CSV (test.csv line ~72) includes a test srt.sorttab[cfg;42;enlist\:/]that is expected tofail, but checkconfigonly validatesconfig— it does not validatetabname. The call getsortparams[config;42;st]wherest:string 42="42"will attemptselect from config where tabname=42, which in q produces a type error (comparing a symbol column to a long). This will signal an error, so the test passes, but the error message will be a raw q type error rather than the documented di.sort:-prefixed error. The API contract says errors are prefixed di.sort:` but this case breaks that contract.


/ internal - validate a sort-config table, signalling a clear error if it is malformed
checkconfig:{[t]
/ guards every sorttab call so a hand-built or csv-derived table is rejected early if wrong
if[98h<>type t;
'"di.sort: config must be a table with columns `tabname`att`column`sort"];
c:cols t;
badcols:c where not c in `tabname`att`column`sort;
if[count badcols;
'"di.sort: unrecognised config column(s): ",", " sv string badcols];
missingcols:(`tabname`att`column`sort) where not (`tabname`att`column`sort) in c;
if[count missingcols;
'"di.sort: missing required config column(s): ",", " sv string missingcols];
if[not 1h=type t`sort;
'"di.sort: the sort column must be boolean"];
badatts:at where not (at:distinct t`att) in validatts;
if[count badatts;
'"di.sort: unrecognised attribute(s) in att column: ",", " sv string badatts];
};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkconfig checks for unrecognised columns with badcols:c where not c in \tabname`att`column`sortand then separately checks for missing columns. However, a table with all four required columns PLUS extra columns will trigger thebadcolserror. That is intentional. But a table with only a subset of the four columns and no extra columns will pass thebadcolscheck and only fail atmissingcols. The real bug is in the badcolscheck itself: it usesc where not c in `tabname`att`column`sort, which flags ANY column not in the allowed set — including legitimate columns. This means a config table with exactly `tabname`att`column`sortpasses, but one with those four in a different order also passes becausein` is order-independent. This is correct. No bug here on reflection — withdrawing this point.


/ internal - log a sorttab message then return the resolved rows
logreturn:{[lvl;msg;rows]
/ keeps each branch body in getsortparams to a single statement
.z.m.log[lvl][`sorttab;msg];
:rows;
};

/ internal - resolve which config rows apply to a table
getsortparams:{[config;tab;st]
/ tab is the table-name symbol (NOT a table); named to avoid clashing with the tabname column
/ a table uses its own rows; unlisted tables fall back to the default row, else are skipped
if[count tabsp:select from config where tabname=tab;
:logreturn[`info;"sort parameters have been retrieved for: ",st;tabsp]];
if[count defsp:select from config where tabname=`default;
:logreturn[`info;"no sort parameters have been specified for: ",st,". using default parameters";defsp]];
:logreturn[`warn;"no sort parameters have been found for: ",st,". the table will not be sorted";0#config];
};

/ internal - log a sort failure without rethrowing so remaining partitions still run
sorterr:{[sc;dl;e]
/ a single partition failure should not halt the whole run
.z.m.log[`error][`sorttab;"failed to sort ",string[dl]," by these columns: ",(", " sv string sc),". the error was: ",e];
:();
};

/ internal - sort one partition directory by the given columns
sortcolumns:{[dloc;sortcols]
/ split out of sortdir so the conditional body there stays a single statement
.z.m.log[`info][`sorttab;"sorting ",string[dloc]," by these columns: ",", " sv string sortcols];
.[xasc;(sortcols;dloc);
sorterr[sortcols;dloc]];
};

/ internal - sort columns and apply attributes for a single on-disk partition directory
sortdir:{[sp;dloc]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In sortdir, the attrcols filter is select column,att from sp where not null att but does NOT filter out the empty-symbol ` sentinel (which represents "no attribute"). In q, null of a symbol returns 0b for ` (the empty symbol is NOT null — null `` ` is `0b`). So rows with `att: `` `` (the no-attribute sentinel, which is the empty symbol) will pass thenot null attfilter and be included inattrcols, causing applyattrto be called withatt=`. applyattr does guard against this with if[null att; :()], but null`` ` is `0b` in q, so that guard also fails to catch the empty symbol. The `@[{@[x;y;z#]};(dloc;colname; ``);attrerr[…]] call will then attempt to apply `# to the column on disk, which will either be a no-op or error. The validatts list includes ` as a valid att value meaning "none", but it should be explicitly excluded from the attrcols selection in sortdir using where (not null att), att<>`` `` or equivalentlywhere att in `p`s`g`u`.

/ sort by the columns flagged sort=1b, then attribute the columns that request one
sortcols:exec column from sp where sort, not null column;
if[count sortcols; sortcolumns[dloc;sortcols]];
attrcols:select column,att from sp where att in `p`s`g`u;
if[count attrcols; applyattr[dloc;;]'[attrcols`column;attrcols`att]];
};

/ internal - log an attribute application failure without rethrowing
attrerr:{[dl;cn;at;e]
/ logs failure and continues so other columns and partitions still get processed
.z.m.log[`error][`applyattr;"unable to apply ",string[at]," attr to the ",string[cn]," column in ",string[dl],". the error was: ",e];
:();
};

/ internal - apply a single attribute to a specific column in an on-disk partition
applyattr:{[dloc;colname;att]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In applyattr, the null-att guard if[null att; :()] uses null, which in q returns 0b for the empty symbol `. So passing att:`` bypasses this guard. The comment says "sortdir only passes non-null atts", but as noted above, not null`` ` is `1b`, so sortdir WILL pass the empty symbol through. The guard here should be `if[(null att) or att= ``; :()] or simply if[not att in \p`s`g`u; :()]` to be safe.

/ skip anything that is not a real attribute - covers the empty none-sentinel and any bad value
/ sortdir already filters to valid atts; this guards a direct call to applyattr
if[not att in `p`s`g`u; :()];
.z.m.log[`info][`applyattr;"applying ",string[att]," attr to the ",string[colname]," column in ",string dloc];
.[{@[x;y;z#]};
(dloc;colname;att);
attrerr[dloc;colname;att]];
};
Loading