Feature Sort#102
Conversation
…est.csv. working on ensuring code is robust
…rect sorting of on disk data
jonnypress
left a comment
There was a problem hiding this comment.
I think we need to modify the interface a little. We should be able to pass in a table for the sort params, which should be possible to read from a csv file, but isn’t required to be in a csv file. Does it do that? Don’t be afraid to rethink the interface to make it more user friendly - it looks quite complex at the minute.
…nctionality. Reduced number of public functions and ordering issues to reduce complexity
| .z.m.log[`info][`readcsv;"reading sort params from ",string file]; | ||
| :("SSSB";enlist",") 0: file; | ||
| }; | ||
|
|
There was a problem hiding this comment.
The format string "SSSB" in readfile parses all four CSV columns as S S S B (symbol symbol symbol boolean). The att column is a symbol, but the column column is also a symbol — that is fine. However, the att column will read the empty string "" as the symbol ` which is the intended no-attribute sentinel, so that part is correct. The real bug is that a missing or mismatched column count in the CSV will silently misalign the parse (e.g. tmp_sort_3col.csv has only 3 columns but "SSSB" will still attempt to read 4 columns, producing a type error or wrong data rather than a clear config-validation error). This is a correctness/robustness issue: 0: with a type-list longer than the actual column count causes a length error instead of the clear di.sort: prefixed error that checkconfig would produce, so the user gets a confusing low-level error rather than the documented one.
| .z.m.log[`info][`sorttab;"sorting the ",st," table"]; | ||
| sp:getsortparams[config;tabname;st]; | ||
| if[not count sp; :()]; | ||
| sortdir[sp] each distinct (),dirs; |
There was a problem hiding this comment.
getsortparams is called with [config;tabname;st] where tabname is the raw symbol and st is its string form. Inside getsortparams the parameter is named t, and select from config where tabname=t compares the config tabname column against the local t — but the local variable t shadows the column name tabname only if q resolves the column first. In q, inside select … where tabname=t, tabname refers to the column and t refers to the local variable; this works correctly. However, the variable name t is also the name of the function parameter for the table name, and the first line of the function body does select from config where tabname=t — config is both the table and a parameter name used elsewhere: this is fine in isolation, but within getsortparams, config is passed as the first parameter and is used as a table in the select. The real risk: tabname is both a column name in config AND a column name in the result. The select from config where tabname=t statement uses t (the local symbol parameter), but if t were ever a table it would be silently misinterpreted. This is a latent naming-collision risk rather than an immediate bug, but worth noting. More concretely: getsortparams returns 0#config (the empty config table) on the no-match path, but sorttab checks if[not count sp; :()] — count of a 0-row table is 0, so this returns () correctly. No bug here, but the 0#config shape is important and only works because count of an empty table is 0.
| badatts:at where not (at:distinct t`att) in validatts; | ||
| if[count badatts; | ||
| '"di.sort: unrecognised attribute(s) in att column: ",", " sv string badatts]; | ||
| }; |
There was a problem hiding this comment.
checkconfig checks for unrecognised columns with badcols:c where not c in \tabname`att`column`sortand then separately checks for missing columns. However, a table with all four required columns PLUS extra columns will trigger thebadcolserror. That is intentional. But a table with only a subset of the four columns and no extra columns will pass thebadcolscheck and only fail atmissingcols. The real bug is in the badcolscheck itself: it usesc where not c in `tabname`att`column`sort, which flags ANY column not in the allowed set — including legitimate columns. This means a config table with exactly `tabname`att`column`sortpasses, but one with those four in a different order also passes becausein` is order-independent. This is correct. No bug here on reflection — withdrawing this point.
| }; | ||
|
|
||
| / internal - sort columns and apply attributes for a single on-disk partition directory | ||
| sortdir:{[sp;dloc] |
There was a problem hiding this comment.
In sortdir, the attrcols filter is select column,att from sp where not null att but does NOT filter out the empty-symbol ` sentinel (which represents "no attribute"). In q, null of a symbol returns 0b for ` (the empty symbol is NOT null — null `` ` is `0b`). So rows with `att: `` `` (the no-attribute sentinel, which is the empty symbol) will pass thenot null attfilter and be included inattrcols, causing applyattrto be called withatt=`. applyattr does guard against this with if[null att; :()], but null`` ` is `0b` in q, so that guard also fails to catch the empty symbol. The `@[{@[x;y;z#]};(dloc;colname; ``);attrerr[…]] call will then attempt to apply `# to the column on disk, which will either be a no-op or error. The validatts list includes ` as a valid att value meaning "none", but it should be explicitly excluded from the attrcols selection in sortdir using where (not null att), att<>`` `` or equivalentlywhere att in `p`s`g`u`.
| }; | ||
|
|
||
| / internal - apply a single attribute to a specific column in an on-disk partition | ||
| applyattr:{[dloc;colname;att] |
There was a problem hiding this comment.
In applyattr, the null-att guard if[null att; :()] uses null, which in q returns 0b for the empty symbol `. So passing att:`` bypasses this guard. The comment says "sortdir only passes non-null atts", but as noted above, not null`` ` is `1b`, so sortdir WILL pass the empty symbol through. The guard here should be `if[(null att) or att= ``; :()] or simply if[not att in \p`s`g`u; :()]` to be safe.
| '"failed to read ",string[file],": ",e; | ||
| }; | ||
|
|
||
| sorttab:{[config;tabname;dirs] |
There was a problem hiding this comment.
readerr constructs the rethrown error message by string-concatenating e directly: '"failed to read ",string[file],": ",e. In q, the error value e passed to a trap handler is a string (character vector). If e is not a string (e.g. it is a symbol or other type in some edge cases), this concatenation will fail with a type error, swallowing the original error. This is a low-probability edge case but could mask real failures.
| if[not count sp; :()]; | ||
| sortdir[sp] each distinct (),dirs; | ||
| .z.m.log[`info][`sorttab;"finished sorting the ",st," table"]; | ||
| }; |
There was a problem hiding this comment.
sorttab does not validate that tabname is a symbol before passing it to getsortparams. The test CSV (test.csv line ~72) includes a test srt.sorttab[cfg;42;enlist\:/]that is expected tofail, but checkconfigonly validatesconfig— it does not validatetabname. The call getsortparams[config;42;st]wherest:string 42="42"will attemptselect from config where tabname=42, which in q produces a type error (comparing a symbol column to a long). This will signal an error, so the test passes, but the error message will be a raw q type error rather than the documented di.sort:-prefixed error. The API contract says errors are prefixed di.sort:` but this case breaks that contract.
DIReview Summary3 critical | 4 warning(s) | 0 suggestion(s) |
di.sort — TorQ Modularisation PR
Summary
Extracts the
.sortnamespace from TorQ'sdbwriteutils.qinto a standalone kdb-x module:di.sort. The module sorts on-disk kdb+ tables and applies column attributes, driven by asort.csvconfiguration file. It satisfies the di.* module contract: dependency injection viainit, a clean exported API, and no hard dependencies beyond an injected logger.Background
TorQ's
dbwriteutils.qcontains a.sortnamespace used to sort HDB partitions and apply kdb+ attributes (p,s,g,u) to columns after data is written to disk. This is a common post-write step in kdb+ database maintenance — sorting by a symbol column and applying the parted attribute, for example, is standard practice for efficient data access.This PR is part of the broader TorQ → kdb-x modularisation effort. The goal is to make each functional area independently loadable, testable, and injectable, removing the dependency on TorQ's global process framework.
Changes
New files
di/sort/sort.qinit,getsortcsv,sorttab,getparamsand internal helpersdi/sort/init.qsort.qand declares the export listdi/sort/test.csvdi/sort/sort.mdDifferences from TorQ original
.sortdi.sort.lg.o/.lg.ecallslogdependency ({[ctx;msg]}functions).proc.getconfigfile["sort.csv"]global lookupsavedirkey ininitdeps; defaults to`:sort.csv@[\.sort;`params;:;params]` amend.z.m.paramsmutable module state@[f;args;handler]traps onxascandapplyattr; per-partition failures are logged and do not halt the runinitvalidates all three keys (info,warn,error) are present and errors with a descriptive message if notusesingleton,init[deps]pattern,export:listExported API
sort.csv format
sort=1— column is used as a sort key (xasc)sort=0— attribute applied without sortingtabname=default— fallback row for tables not listed explicitlyTest coverage
Tests are in
test.csvand run viak4unit:Areas covered
init— dependency validation::, null log, empty dict, missinglogkey, non-dict log value, partial log dict (missing keys)init— error message prefix"di.sort"prefix on all thrown errorsinit— valid configssavedirgetparams— initial stategetsortcsvcallsorttab— auto-loadgetsortcsvfromsavedirwhenparamsis emptygetsortcsv— validationattvalues, missing required columns, non-existent filegetsortcsv— empty CSVparamsremains emptygetsortcsv— successful loadsorttab— no-params path()when no matching row and no default; emits warn logsorttab— empty partition list()partition list without errorgetsortcsvcontext symbol on success; error message and context on file failureapplyattrcontext symbol on attribute applicationxascandapplyattrfailuresTotal: 63 tests, all passing.
Test results screenshot
Integration test
Verified against real on-disk HDB data with two trade partitions and one quote partition:
symcolumn sorted ascending;p(parted) attribute applied correctlyvolcolumn receivedg(grouped) attribute without being used as a sort keypricecolumn (no sort config row) reordered in step withsymsort; no attribute applieddefaultrow applied correctly to thequotetable()without errorNotes
di.sorthas no hard module dependencies. Only the injectedlogdict is required.di.logsatisfies the log contract out of the box:logdep:\info`warn`error!(log.info;log.warn;log.error)`.xascfailure on one partition is logged as an error and the run continues — consistent with TorQ's original intent that a bad partition should not halt a full HDB sort.defaultrow insort.csvis looked up bytabname=\default. Any table name matching a row exactly takes precedence overdefault`.