Skip to content

feat: add findGrouped() to Adapter + ClickHouse implementation#119

Open
lohanidamodar wants to merge 1 commit into
mainfrom
feat-find-grouped
Open

feat: add findGrouped() to Adapter + ClickHouse implementation#119
lohanidamodar wants to merge 1 commit into
mainfrom
feat-find-grouped

Conversation

@lohanidamodar
Copy link
Copy Markdown
Contributor

Summary

Adds a new findGrouped(array $queries, string $groupBy, string $interval): array method to the audit Adapter abstraction and implements it in the ClickHouse adapter. It produces time-bucketed grouped counts for charts of the form events per <interval>, grouped by <attribute>.

Public API

/**
 * @param  array<\Utopia\Audit\Query>  $queries  Filter/limit/offset queries; callers should include a time BETWEEN filter
 * @param  string  $groupBy  One of {`event`, `userType`, `resourceType`}
 * @param  string  $interval One of {`hour`, `day`, `week`, `month`}
 * @return array<int, array{value: string, bucket: string, count: int}>
 */
abstract public function findGrouped(array $queries, string $groupBy, string $interval): array;

The Audit facade gains a matching passthrough.

Behavior

  • $groupBy and $interval are validated; invalid values throw.
  • Filter queries are passed through the existing parseQueries() pipeline so all existing Query::* filters work unchanged. getTenantFilter() is honored exactly like find() / count() so shared-table installs stay tenant-safe.
  • Query::limit() / Query::offset() are applied to the groups list. Default limit 25, hard max 100 (over-limit values are clamped). Top-N is selected by SUM(count) desc, ties by groupValue asc.
  • Every bucket between the smallest and largest bucket implied by the time filter is zero-filled for every returned group. Buckets are aligned to UTC via toStartOfInterval(time, INTERVAL 1 <unit>, 'UTC') and returned as ISO-8601 UTC strings (matching how find() formats time).
  • Result rows are ordered by value ASC, bucket ASC for predictable consumption.

SQL approach

Single round-trip using four CTEs:

  1. top_groups — top-N groups by SUM(count) desc.
  2. boundsMIN(time) / MAX(time) aligned to bucket starts.
  3. bucketsarrayJoin(arrayMap(i -> date_add(<unit>, i, bucket_from), range(0, ...))) produces the dense bucket axis.
  4. agg(value, bucket) -> COUNT(*) aggregate.

Then CROSS JOIN buckets + LEFT JOIN agg yields the dense (value, bucket, count) grid. All identifiers go through escapeIdentifier; all values go through ClickHouse bound parameters.

Database adapter

The Database adapter throws explicitly — grouped time-bucketed aggregations are designed for an analytical backend (ClickHouse).

Schema note

Added a _key_resource_type bloom-filter index to the ClickHouse adapter's getIndexes() so new installs cover the resourceType column natively. Existing installs need a one-off migration to add this index; that migration is intentionally not in this PR (separate concern).

Test plan

  • composer lint (pint --test) — pass
  • composer check (phpstan analyse --level max) — pass
  • Full PHPUnit suite (ClickHouse + Database adapters) — 99 tests, 942 assertions, all green
  • New tests cover: invalid groupBy, invalid interval, grouped counts by event, ordering invariant (value asc, bucket asc), zero-fill, limit clamping, offset paging, alternate groupBy=userType, empty time range, Database adapter "not supported"

Add a new method `findGrouped(array $queries, string $groupBy, string
$interval): array` to the audit Adapter abstraction and implement it in
the ClickHouse adapter. It produces time-bucketed grouped counts for
charts of the form "events per <interval>, grouped by <attribute>".

The method validates `$groupBy` against {event, userType, resourceType}
and `$interval` against {hour, day, week, month}, then reuses the
existing `parseQueries()` filter/param pipeline to apply both the query
filters and `getTenantFilter()` for tenant-safe scoping. `Query::limit()`
/ `Query::offset()` are applied to the group set (default 25, clamped
to 100), with top-N selection by `SUM(count)` desc and ties broken by
groupValue asc.

A single SQL round-trip drives the aggregation via CTEs: top-N groups,
time bounds, a zero-filled bucket axis built with `arrayJoin(range(...))`
+ `date_add(<unit>, i, bucket_from)`, and the per-bucket aggregate. A
final `CROSS JOIN` + `LEFT JOIN` produces the dense (groupValue, bucket,
count) result set, with buckets aligned via
`toStartOfInterval(time, INTERVAL 1 <unit>, 'UTC')` and formatted as
ISO-8601 UTC strings to match how `find()` returns `time`.

The Database adapter throws explicitly — grouped time-bucketed
aggregations are an analytical-store feature.

Schema: a `_key_resource_type` bloom-filter index has been added to the
ClickHouse `getIndexes()` definition so new installs get it. Existing
installs need a one-off migration to add it; left out of this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 17, 2026

Greptile Summary

This PR adds findGrouped(array $queries, string $groupBy, string $interval): array to the Adapter abstraction and implements it in the ClickHouse adapter using a single-round-trip four-CTE query that produces zero-filled, time-bucketed group counts. The Database adapter explicitly throws "not supported".

  • The ClickHouse SQL is well-structured: $groupBy and $interval are validated against hardcoded whitelists before any interpolation, named bound parameters are used for all runtime values, and tenant scoping is applied consistently with find() / count().
  • The agg CTE computes per-(group, bucket) counts for all distinct group values in the filtered dataset before the top-N filter is applied, and the outer SELECT contains a redundant WHERE g.value IN (SELECT value FROM top_groups) predicate (both flagged as suggestions).
  • Test coverage is comprehensive, exercising invalid arguments, bucketed counts, ordering, zero-fill, limit clamping, offset paging, alternate groupBy, and the empty-range case.

Confidence Score: 4/5

The change is safe to merge; the only findings are minor query inefficiencies with no impact on correctness or security.

The SQL logic is correct, all user-supplied inputs are whitelist-validated before interpolation, and tenant isolation matches the existing methods. The two flagged issues are query optimisation opportunities that do not affect result correctness.

The four-CTE query in src/Audit/Adapter/ClickHouse.php is worth a second pass on the agg CTE filtering; all other files are straightforward.

Important Files Changed

Filename Overview
src/Audit/Adapter/ClickHouse.php Adds findGrouped() with a four-CTE ClickHouse query. SQL logic is correct; $unit and $groupBy are whitelist-validated before interpolation; tenant filter is applied consistently with other methods. The agg CTE aggregates all groups (not just top_groups), and the outer WHERE is redundant — both are minor inefficiencies.
src/Audit/Adapter.php Adds abstract findGrouped() declaration with clear contract documentation; no issues.
src/Audit/Adapter/Database.php Implements findGrouped() as a deliberate not-supported throw; correct and straightforward.
src/Audit/Audit.php Thin passthrough to the adapter; no issues.
tests/Audit/Adapter/ClickHouseTest.php Comprehensive test coverage for all documented behaviours: invalid args, bucketed counts, ordering, zero-fill, limit clamping, offset paging, alternate groupBy, and empty range.
tests/Audit/Adapter/DatabaseTest.php Single test confirming the Database adapter throws as expected; no issues.

Reviews (1): Last reviewed commit: "feat: add findGrouped() to Adapter + Cli..." | Re-trigger Greptile

Comment on lines +1218 to +1222
FROM top_groups g
CROSS JOIN buckets b
LEFT JOIN agg a ON a.value = g.value AND a.bucket = b.bucket
WHERE g.value IN (SELECT value FROM top_groups)
ORDER BY value ASC, bucket ASC
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The outer WHERE g.value IN (SELECT value FROM top_groups) predicate is redundant — g is already derived directly from top_groups, so every row of g is guaranteed to satisfy this condition. The clause adds an extra correlated subquery that ClickHouse may or may not optimise away, and it obscures the intent of the JOIN.

Suggested change
FROM top_groups g
CROSS JOIN buckets b
LEFT JOIN agg a ON a.value = g.value AND a.bucket = b.bucket
WHERE g.value IN (SELECT value FROM top_groups)
ORDER BY value ASC, bucket ASC
FROM top_groups g
CROSS JOIN buckets b
LEFT JOIN agg a ON a.value = g.value AND a.bucket = b.bucket
ORDER BY value ASC, bucket ASC

Comment on lines +1206 to +1213
agg AS (
SELECT
{$escapedGroup} AS value,
toStartOfInterval({$escapedTime}, INTERVAL 1 {$unit}, 'UTC') AS bucket,
COUNT(*) AS count
FROM {$escapedTable}{$whereClause}
GROUP BY value, bucket
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The agg CTE aggregates every distinct group value that matches $whereClause, not just the top_groups. For high-cardinality columns like resourceType this can mean scanning and grouping many more rows than needed. Adding an IN-subquery filter restricts the aggregation to only the top-N groups already selected, at the cost of one extra read of the small top_groups result.

Suggested change
agg AS (
SELECT
{$escapedGroup} AS value,
toStartOfInterval({$escapedTime}, INTERVAL 1 {$unit}, 'UTC') AS bucket,
COUNT(*) AS count
FROM {$escapedTable}{$whereClause}
GROUP BY value, bucket
)
agg AS (
SELECT
{$escapedGroup} AS value,
toStartOfInterval({$escapedTime}, INTERVAL 1 {$unit}, 'UTC') AS bucket,
COUNT(*) AS count
FROM {$escapedTable}{$whereClause}
WHERE {$escapedGroup} IN (SELECT value FROM top_groups)
GROUP BY value, bucket
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant