-
Notifications
You must be signed in to change notification settings - Fork 7
Feature Sort #102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feature Sort #102
Changes from all commits
c7f8ed0
c5183f4
44d3e7e
ff64401
094d291
4c4ece8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| / library for sorting and applying attributes to on-disk kdb+ tables | ||
|
|
||
| \l ::sort.q | ||
|
|
||
| export:([init;readcsv;sorttab]) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,178 @@ | ||
| # di.sort | ||
|
|
||
| Module for sorting and applying attributes to on-disk kdb+ tables. Driven by a **config table** that specifies which columns to sort by and which attributes to apply per table. You pass that table straight to `sorttab`; if you keep your config in a CSV file, `readcsv` reads one into the right shape for you. If a table has no explicit entry, `di.sort` falls back to a `default` row. Extracted from the `.sort` namespace in TorQ's `dbwriteutils.q`. | ||
|
|
||
| ## Usage | ||
|
|
||
| ```q | ||
| srt:use`di.sort | ||
|
|
||
| / inject dependencies — log is required | ||
| log:use`di.log | ||
| logdep:`info`warn`error!(log.info;log.warn;log.error) | ||
| srt.init[enlist[`log]!enlist logdep] | ||
|
|
||
| / build a config table directly ... | ||
| config:([] tabname:`trade`trade`default; att:`p``p; column:`sym`time`sym; sort:101b) | ||
| srt.sorttab[config; `trade; `:/hdb/2000.01.01/trade`:/hdb/2000.01.02/trade] | ||
|
|
||
| / ... or read the config from a csv and pass it straight in | ||
| srt.sorttab[srt.readcsv `:config/sort.csv; `trade; `:/hdb/2000.01.01/trade] | ||
| ``` | ||
|
|
||
| `di.sort` holds no state of its own: every `sorttab` call takes the config it should use, so you can build that config however you like (by hand, from a query, or from a CSV) and reuse or vary it freely. | ||
|
|
||
| ### Typical HDB sort loop | ||
|
|
||
| ```q | ||
| srt:use`di.sort | ||
| log:use`di.log | ||
| logdep:`info`warn`error!(log.info;log.warn;log.error) | ||
| srt.init[enlist[`log]!enlist logdep] | ||
|
|
||
| config:srt.readcsv `:config/sort.csv | ||
|
|
||
| / .Q.par[root;date;table] builds the on-disk partition path | ||
| hdb:`:/hdb | ||
| dates:2000.01.01 2000.01.02 | ||
| tabs:`trade`quote | ||
| / for each table, build its partition paths, then sort every table with the same config | ||
| pdirs:{[hdb;dates;t] .Q.par[hdb;;t] each dates}[hdb;dates] each tabs | ||
| srt.sorttab[config]'[tabs; pdirs] | ||
| ``` | ||
|
|
||
| ## The config table | ||
|
|
||
| `di.sort` is configured with a table of this shape: | ||
|
|
||
| | Column | Type | Description | | ||
| |---|---|---| | ||
| | `tabname` | symbol | Table name, or `default` to apply to all unlisted tables | | ||
| | `att` | symbol | Attribute to apply after sort: `p` `s` `g` `u` or empty (`` ` ``) for none | | ||
| | `column` | symbol | Column to sort/attribute | | ||
| | `sort` | boolean | `1b` to use this column as a sort key; `0b` to apply an attribute only | | ||
|
|
||
| Build it any way you like — in code, from a query, or from a CSV via `readcsv`: | ||
|
|
||
| ```q | ||
| ([] tabname:`trade`trade`default; att:`p``p; column:`sym`time`sym; sort:101b) | ||
| ``` | ||
|
|
||
| Multiple rows for the same table are supported. All rows with `sort=1b` for a table form the compound sort key, in the order they appear. Attribute rows with `sort=0b` are applied independently after sorting. | ||
|
|
||
| **Attributes:** | ||
|
|
||
| | Value | Description | | ||
| |---|---| | ||
| | `p` | Parted — all rows with the same value are contiguous. Requires the column to be sorted first (`sort=1b`). | | ||
| | `s` | Sorted — values are in ascending order. Applied automatically by `xasc`; set explicitly here if wanted after the sort step. | | ||
| | `g` | Grouped — inverse index stored on disk. Suitable for low-to-medium cardinality unsorted columns. | | ||
| | `u` | Unique — all values are distinct. | | ||
| | ` ` (empty) | No attribute applied. Column may still participate in the sort if `sort=1b`. | | ||
|
|
||
| ### sort.csv format (for `readcsv`) | ||
|
|
||
| A CSV consumed by `readcsv` must have these four columns, in this order: | ||
|
|
||
| ``` | ||
| tabname,att,column,sort | ||
| trade,p,sym,1 | ||
| trade,,time,0 | ||
| quote,p,sym,1 | ||
| default,p,sym,1 | ||
| ``` | ||
|
|
||
| ## API | ||
|
|
||
| ### `init[deps]` | ||
|
|
||
| Wire injectable dependencies. Must be called before any other function. | ||
|
|
||
| | Key | Required | Type | Description | | ||
| |---|---|---|---| | ||
| | `` `log `` | yes | dict | Functions keyed `` `info`warn`error ``, each with signature `{[ctx;msg]}` | | ||
|
|
||
| Errors with prefix `di.sort:` if `deps` is not a dict, if `log` is missing, or if the log dict does not contain all three required keys. | ||
|
|
||
| ```q | ||
| srt.init[enlist[`log]!enlist logdep] | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ### `readcsv[file]` | ||
|
|
||
| Read a config CSV and **return** it as a table. Does not store it — pass the result to `sorttab`. Use this only when your config lives in a CSV; a hand-built table goes straight to `sorttab`. | ||
|
|
||
| | Parameter | Type | Description | | ||
| |---|---|---| | ||
| | `file` | hsym (or symbol) | Path to the CSV. Coerced with `hsym`, so `` `:config/sort.csv `` and `` `config/sort.csv `` both work. | | ||
|
|
||
| The CSV must have exactly the four columns `tabname`, `att`, `column`, `sort` — in **any** order (the result is normalised to canonical column order). The header is validated as it is read: a missing, extra, or misnamed column raises a clear `di.sort:` error rather than silently mis-parsing or dropping data. Logs info messages while reading (the read start and the row count) and an error message on file-read failure (then rethrows). Attribute-value validation (e.g. an unknown `att`) happens later in `sorttab`. | ||
|
|
||
| ```q | ||
| config:srt.readcsv `:config/sort.csv | ||
| srt.sorttab[config; `trade; dirs] | ||
|
|
||
| / or in one line | ||
| srt.sorttab[srt.readcsv `:config/sort.csv; `trade; dirs] | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ### `sorttab[config;tabname;dirs]` | ||
|
|
||
| Sort and apply attributes to on-disk partitions for a single table, using the supplied config table. | ||
|
|
||
| | Parameter | Type | Description | | ||
| |---|---|---| | ||
| | `config` | table | A config table with columns `` `tabname`att`column`sort `` (see [The config table](#the-config-table)) | | ||
| | `tabname` | symbol | Table name | | ||
| | `dirs` | hsym, or list of hsyms | Partition directory (or directories) for that table | | ||
|
|
||
| `config` is validated first; `sorttab` errors (prefixed `di.sort:`) if it is not a table, has unknown or missing columns, has a non-boolean `sort` column, or has an `att` value outside `` ` `p`s`g`u ``. It also errors (prefixed `di.sort:`) if `tabname` is not a symbol. | ||
|
|
||
| Lookup order for sort parameters within `config`: | ||
| 1. Rows where `tabname` matches the supplied table name | ||
| 2. Rows where `tabname = \`default` | ||
| 3. If neither found — logs a warn and returns `()` without error | ||
|
|
||
| Each partition directory is processed independently: a failure on one partition is logged (as an error) and does not halt remaining partitions. | ||
|
|
||
| ```q | ||
| / single partition | ||
| srt.sorttab[config; `trade; enlist `:/hdb/2000.01.01/trade] | ||
|
|
||
| / multiple partitions | ||
| srt.sorttab[config; `trade; `:/hdb/2000.01.01/trade`:/hdb/2000.01.02/trade] | ||
| ``` | ||
|
|
||
| ## Log dependency contract | ||
|
|
||
| `di.sort` requires a log dependency dictionary with keys `` `info`warn`error ``, each a function with signature `{[ctx;msg]}`: | ||
|
|
||
| ```q | ||
| `info`warn`error!({[ctx;msg] ...};{[ctx;msg] ...};{[ctx;msg] ...}) | ||
| ``` | ||
|
|
||
| `di.log` satisfies this contract out of the box: | ||
|
|
||
| ```q | ||
| log:use`di.log | ||
| logdep:`info`warn`error!(log.info;log.warn;log.error) | ||
| srt.init[enlist[`log]!enlist logdep] | ||
| ``` | ||
|
|
||
| You can supply any custom implementation with the same signatures. | ||
|
|
||
| Context symbols used by `di.sort` in log calls: | ||
|
|
||
| | Context | Level | When | | ||
| |---|---|---| | ||
| | `` `readcsv `` | info | CSV read start and row count on successful read | | ||
| | `` `readcsv `` | error | File read failure (rethrown after logging) | | ||
| | `` `sorttab `` | info | Sort start, params lookup result, column list, sort completion | | ||
| | `` `sorttab `` | warn | Table has no matching params and no default row | | ||
| | `` `sorttab `` | error | `xasc` failure on a partition (non-fatal — remaining partitions continue) | | ||
| | `` `applyattr `` | info | Attribute applied to a column | | ||
| | `` `applyattr `` | error | Attribute application failure (non-fatal) | |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,152 @@ | ||
| / library for sorting and applying attributes to on-disk kdb+ tables | ||
|
|
||
| / attributes that may legitimately be applied on disk (empty leaves a column unattributed) | ||
| validatts:``p`s`g`u; | ||
|
|
||
| init:{[deps] | ||
| / wire the injectable log dependency so the module reports through the host's logger | ||
| / deps: a dict with a `log key -> `info`warn`error!({[c;m]};{[c;m]};{[c;m]}) | ||
| / see di.log for a default implementation, or pass any matching dict | ||
| / example: srt.init[enlist[`log]!enlist logdep] | ||
| if[99h<>type deps; | ||
| '"di.sort: deps must be a dict with a `log key; see di.log for a default logger"]; | ||
| if[not `log in key deps; | ||
| '"di.sort: log dependency is required; pass `info`warn`error functions keyed on `log"]; | ||
| if[99h<>type deps`log; | ||
| '"di.sort: log value must be a dict of `info`warn`error functions"]; | ||
| if[not all (`info`warn`error) in key deps`log; | ||
| '"di.sort: log dict must have `info`warn`error keys; got: ",(", " sv string key deps`log)]; | ||
| .z.m.log:deps`log; | ||
| }; | ||
|
|
||
| readcsv:{[file] | ||
| / convenience for the common case where config lives in a csv - returns it, does not store | ||
| / pass the result to sorttab, e.g. srt.sorttab[srt.readcsv `:sort.csv;`trade;dirs] | ||
| / the csv must have the columns tabname,att,column,sort (in any order) | ||
| file:hsym file; | ||
| t:parsecsv @[readfile; file; readerr[file]]; | ||
| .z.m.log[`info][`readcsv;"read ",(string count t)," sort param row(s) from ",string file]; | ||
| :t; | ||
| }; | ||
|
|
||
| / internal - protected file read; only the i/o so a genuine read failure gets the readerr message | ||
| readfile:{[file] | ||
| / returns the raw csv lines; header validation and parsing happen in parsecsv | ||
| .z.m.log[`info][`readcsv;"reading sort params from ",string file]; | ||
| :read0 file; | ||
| }; | ||
|
|
||
| / internal - log and rethrow a csv read failure | ||
| readerr:{[file;e] | ||
| / build the message once, surface it under the readcsv context, then rethrow it to the caller | ||
| m:"failed to read ",string[file],": ",e; | ||
| .z.m.log[`error][`readcsv;m]; | ||
| 'm; | ||
| }; | ||
|
|
||
| / internal - validate the header and parse csv lines into a config table | ||
| parsecsv:{[lines] | ||
| / map types by column name so the csv column order does not matter; reject any other shape | ||
| / outside the readfile i/o trap so a bad header surfaces as a clear di.sort: error | ||
| if[0=count lines; | ||
| '"di.sort: csv has no header row"]; | ||
| hdr:`$"," vs first lines; | ||
| if[not (asc distinct hdr)~`att`column`sort`tabname; | ||
| '"di.sort: csv header must be exactly tabname,att,column,sort; got: ",", " sv string hdr]; | ||
| types:{$[x=`sort;"B";"S"]} each hdr; | ||
| :`tabname`att`column`sort#(types;enlist",") 0: lines; | ||
| }; | ||
|
|
||
| sorttab:{[config;tabname;dirs] | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| / sort and apply attributes to the on-disk partition directories for one table | ||
| / config: a sort-config table (build it directly or via readcsv); tabname: symbol; dirs: hsym or list of hsyms | ||
| / example: srt.sorttab[srt.readcsv `:sort.csv;`trade;`:/hdb/2024.01.01/trade] | ||
| checkconfig config; | ||
| if[not -11h=type tabname; | ||
| '"di.sort: tabname must be a symbol, got type ",string type tabname]; | ||
| st:string tabname; | ||
| .z.m.log[`info][`sorttab;"sorting the ",st," table"]; | ||
| sp:getsortparams[config;tabname;st]; | ||
| if[not count sp; :()]; | ||
| sortdir[sp] each distinct (),dirs; | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| .z.m.log[`info][`sorttab;"finished sorting the ",st," table"]; | ||
| }; | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
|
||
| / internal - validate a sort-config table, signalling a clear error if it is malformed | ||
| checkconfig:{[t] | ||
| / guards every sorttab call so a hand-built or csv-derived table is rejected early if wrong | ||
| if[98h<>type t; | ||
| '"di.sort: config must be a table with columns `tabname`att`column`sort"]; | ||
| c:cols t; | ||
| badcols:c where not c in `tabname`att`column`sort; | ||
| if[count badcols; | ||
| '"di.sort: unrecognised config column(s): ",", " sv string badcols]; | ||
| missingcols:(`tabname`att`column`sort) where not (`tabname`att`column`sort) in c; | ||
| if[count missingcols; | ||
| '"di.sort: missing required config column(s): ",", " sv string missingcols]; | ||
| if[not 1h=type t`sort; | ||
| '"di.sort: the sort column must be boolean"]; | ||
| badatts:at where not (at:distinct t`att) in validatts; | ||
| if[count badatts; | ||
| '"di.sort: unrecognised attribute(s) in att column: ",", " sv string badatts]; | ||
| }; | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
|
||
| / internal - log a sorttab message then return the resolved rows | ||
| logreturn:{[lvl;msg;rows] | ||
| / keeps each branch body in getsortparams to a single statement | ||
| .z.m.log[lvl][`sorttab;msg]; | ||
| :rows; | ||
| }; | ||
|
|
||
| / internal - resolve which config rows apply to a table | ||
| getsortparams:{[config;tab;st] | ||
| / tab is the table-name symbol (NOT a table); named to avoid clashing with the tabname column | ||
| / a table uses its own rows; unlisted tables fall back to the default row, else are skipped | ||
| if[count tabsp:select from config where tabname=tab; | ||
| :logreturn[`info;"sort parameters have been retrieved for: ",st;tabsp]]; | ||
| if[count defsp:select from config where tabname=`default; | ||
| :logreturn[`info;"no sort parameters have been specified for: ",st,". using default parameters";defsp]]; | ||
| :logreturn[`warn;"no sort parameters have been found for: ",st,". the table will not be sorted";0#config]; | ||
| }; | ||
|
|
||
| / internal - log a sort failure without rethrowing so remaining partitions still run | ||
| sorterr:{[sc;dl;e] | ||
| / a single partition failure should not halt the whole run | ||
| .z.m.log[`error][`sorttab;"failed to sort ",string[dl]," by these columns: ",(", " sv string sc),". the error was: ",e]; | ||
| :(); | ||
| }; | ||
|
|
||
| / internal - sort one partition directory by the given columns | ||
| sortcolumns:{[dloc;sortcols] | ||
| / split out of sortdir so the conditional body there stays a single statement | ||
| .z.m.log[`info][`sorttab;"sorting ",string[dloc]," by these columns: ",", " sv string sortcols]; | ||
| .[xasc;(sortcols;dloc); | ||
| sorterr[sortcols;dloc]]; | ||
| }; | ||
|
|
||
| / internal - sort columns and apply attributes for a single on-disk partition directory | ||
| sortdir:{[sp;dloc] | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In |
||
| / sort by the columns flagged sort=1b, then attribute the columns that request one | ||
| sortcols:exec column from sp where sort, not null column; | ||
| if[count sortcols; sortcolumns[dloc;sortcols]]; | ||
| attrcols:select column,att from sp where att in `p`s`g`u; | ||
| if[count attrcols; applyattr[dloc;;]'[attrcols`column;attrcols`att]]; | ||
| }; | ||
|
|
||
| / internal - log an attribute application failure without rethrowing | ||
| attrerr:{[dl;cn;at;e] | ||
| / logs failure and continues so other columns and partitions still get processed | ||
| .z.m.log[`error][`applyattr;"unable to apply ",string[at]," attr to the ",string[cn]," column in ",string[dl],". the error was: ",e]; | ||
| :(); | ||
| }; | ||
|
|
||
| / internal - apply a single attribute to a specific column in an on-disk partition | ||
| applyattr:{[dloc;colname;att] | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In |
||
| / skip anything that is not a real attribute - covers the empty none-sentinel and any bad value | ||
| / sortdir already filters to valid atts; this guards a direct call to applyattr | ||
| if[not att in `p`s`g`u; :()]; | ||
| .z.m.log[`info][`applyattr;"applying ",string[att]," attr to the ",string[colname]," column in ",string dloc]; | ||
| .[{@[x;y;z#]}; | ||
| (dloc;colname;att); | ||
| attrerr[dloc;colname;att]]; | ||
| }; | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The format string
"SSSB"inreadfileparses all four CSV columns asS S S B(symbol symbol symbol boolean). Theattcolumn is a symbol, but thecolumncolumn is also a symbol — that is fine. However, theattcolumn will read the empty string""as the symbol`which is the intended no-attribute sentinel, so that part is correct. The real bug is that a missing or mismatched column count in the CSV will silently misalign the parse (e.g.tmp_sort_3col.csvhas only 3 columns but"SSSB"will still attempt to read 4 columns, producing a type error or wrong data rather than a clear config-validation error). This is a correctness/robustness issue:0:with a type-list longer than the actual column count causes a length error instead of the cleardi.sort:prefixed error thatcheckconfigwould produce, so the user gets a confusing low-level error rather than the documented one.