feat: add findGrouped() to Adapter + ClickHouse implementation by lohanidamodar · Pull Request #119 · utopia-php/audit

lohanidamodar · 2026-05-17T03:58:04Z

Summary

Adds a new findGrouped(array $queries, string $groupBy, string $interval): array method to the audit Adapter abstraction and implements it in the ClickHouse adapter. It produces time-bucketed grouped counts for charts of the form events per <interval>, grouped by <attribute>.

Public API

/**
 * @param  array<\Utopia\Audit\Query>  $queries  Filter/limit/offset queries; callers should include a time BETWEEN filter
 * @param  string  $groupBy  One of {`event`, `userType`, `resourceType`}
 * @param  string  $interval One of {`hour`, `day`, `week`, `month`}
 * @return array<int, array{value: string, bucket: string, count: int}>
 */
abstract public function findGrouped(array $queries, string $groupBy, string $interval): array;

The Audit facade gains a matching passthrough.

Behavior

$groupBy and $interval are validated; invalid values throw.
Filter queries are passed through the existing parseQueries() pipeline so all existing Query::* filters work unchanged. getTenantFilter() is honored exactly like find() / count() so shared-table installs stay tenant-safe.
Query::limit() / Query::offset() are applied to the groups list. Default limit 25, hard max 100 (over-limit values are clamped). Top-N is selected by SUM(count) desc, ties by groupValue asc.
Every bucket between the smallest and largest bucket implied by the time filter is zero-filled for every returned group. Buckets are aligned to UTC via toStartOfInterval(time, INTERVAL 1 <unit>, 'UTC') and returned as ISO-8601 UTC strings (matching how find() formats time).
Result rows are ordered by value ASC, bucket ASC for predictable consumption.

SQL approach

Single round-trip using four CTEs:

top_groups — top-N groups by SUM(count) desc.
bounds — MIN(time) / MAX(time) aligned to bucket starts.
buckets — arrayJoin(arrayMap(i -> date_add(<unit>, i, bucket_from), range(0, ...))) produces the dense bucket axis.
agg — (value, bucket) -> COUNT(*) aggregate.

Then CROSS JOIN buckets + LEFT JOIN agg yields the dense (value, bucket, count) grid. All identifiers go through escapeIdentifier; all values go through ClickHouse bound parameters.

Database adapter

The Database adapter throws explicitly — grouped time-bucketed aggregations are designed for an analytical backend (ClickHouse).

Schema note

Added a _key_resource_type bloom-filter index to the ClickHouse adapter's getIndexes() so new installs cover the resourceType column natively. Existing installs need a one-off migration to add this index; that migration is intentionally not in this PR (separate concern).

Test plan

composer lint (pint --test) — pass
composer check (phpstan analyse --level max) — pass
Full PHPUnit suite (ClickHouse + Database adapters) — 99 tests, 942 assertions, all green
New tests cover: invalid groupBy, invalid interval, grouped counts by event, ordering invariant (value asc, bucket asc), zero-fill, limit clamping, offset paging, alternate groupBy=userType, empty time range, Database adapter "not supported"

Add a new method `findGrouped(array $queries, string $groupBy, string $interval): array` to the audit Adapter abstraction and implement it in the ClickHouse adapter. It produces time-bucketed grouped counts for charts of the form "events per <interval>, grouped by <attribute>". The method validates `$groupBy` against {event, userType, resourceType} and `$interval` against {hour, day, week, month}, then reuses the existing `parseQueries()` filter/param pipeline to apply both the query filters and `getTenantFilter()` for tenant-safe scoping. `Query::limit()` / `Query::offset()` are applied to the group set (default 25, clamped to 100), with top-N selection by `SUM(count)` desc and ties broken by groupValue asc. A single SQL round-trip drives the aggregation via CTEs: top-N groups, time bounds, a zero-filled bucket axis built with `arrayJoin(range(...))` + `date_add(<unit>, i, bucket_from)`, and the per-bucket aggregate. A final `CROSS JOIN` + `LEFT JOIN` produces the dense (groupValue, bucket, count) result set, with buckets aligned via `toStartOfInterval(time, INTERVAL 1 <unit>, 'UTC')` and formatted as ISO-8601 UTC strings to match how `find()` returns `time`. The Database adapter throws explicitly — grouped time-bucketed aggregations are an analytical-store feature. Schema: a `_key_resource_type` bloom-filter index has been added to the ClickHouse `getIndexes()` definition so new installs get it. Existing installs need a one-off migration to add it; left out of this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-05-17T04:02:36Z

Greptile Summary

This PR adds findGrouped(array $queries, string $groupBy, string $interval): array to the Adapter abstraction and implements it in the ClickHouse adapter using a single-round-trip four-CTE query that produces zero-filled, time-bucketed group counts. The Database adapter explicitly throws "not supported".

The ClickHouse SQL is well-structured: $groupBy and $interval are validated against hardcoded whitelists before any interpolation, named bound parameters are used for all runtime values, and tenant scoping is applied consistently with find() / count().
The agg CTE computes per-(group, bucket) counts for all distinct group values in the filtered dataset before the top-N filter is applied, and the outer SELECT contains a redundant WHERE g.value IN (SELECT value FROM top_groups) predicate (both flagged as suggestions).
Test coverage is comprehensive, exercising invalid arguments, bucketed counts, ordering, zero-fill, limit clamping, offset paging, alternate groupBy, and the empty-range case.

Confidence Score: 4/5

The change is safe to merge; the only findings are minor query inefficiencies with no impact on correctness or security.

The SQL logic is correct, all user-supplied inputs are whitelist-validated before interpolation, and tenant isolation matches the existing methods. The two flagged issues are query optimisation opportunities that do not affect result correctness.

The four-CTE query in src/Audit/Adapter/ClickHouse.php is worth a second pass on the agg CTE filtering; all other files are straightforward.

Important Files Changed

Filename	Overview
src/Audit/Adapter/ClickHouse.php	Adds `findGrouped()` with a four-CTE ClickHouse query. SQL logic is correct; `$unit` and `$groupBy` are whitelist-validated before interpolation; tenant filter is applied consistently with other methods. The `agg` CTE aggregates all groups (not just top_groups), and the outer WHERE is redundant — both are minor inefficiencies.
src/Audit/Adapter.php	Adds abstract `findGrouped()` declaration with clear contract documentation; no issues.
src/Audit/Adapter/Database.php	Implements `findGrouped()` as a deliberate not-supported throw; correct and straightforward.
src/Audit/Audit.php	Thin passthrough to the adapter; no issues.
tests/Audit/Adapter/ClickHouseTest.php	Comprehensive test coverage for all documented behaviours: invalid args, bucketed counts, ordering, zero-fill, limit clamping, offset paging, alternate groupBy, and empty range.
tests/Audit/Adapter/DatabaseTest.php	Single test confirming the Database adapter throws as expected; no issues.

_{Reviews (1): Last reviewed commit: "feat: add findGrouped() to Adapter + Cli..." | Re-trigger Greptile}

greptile-apps · 2026-05-17T04:02:40Z

+            FROM top_groups g
+            CROSS JOIN buckets b
+            LEFT JOIN agg a ON a.value = g.value AND a.bucket = b.bucket
+            WHERE g.value IN (SELECT value FROM top_groups)
+            ORDER BY value ASC, bucket ASC


The outer WHERE g.value IN (SELECT value FROM top_groups) predicate is redundant — g is already derived directly from top_groups, so every row of g is guaranteed to satisfy this condition. The clause adds an extra correlated subquery that ClickHouse may or may not optimise away, and it obscures the intent of the JOIN.

Suggested change

FROM top_groups g

CROSS JOIN buckets b

LEFT JOIN agg a ON a.value = g.value AND a.bucket = b.bucket

WHERE g.value IN (SELECT value FROM top_groups)

ORDER BY value ASC, bucket ASC

FROM top_groups g

CROSS JOIN buckets b

LEFT JOIN agg a ON a.value = g.value AND a.bucket = b.bucket

ORDER BY value ASC, bucket ASC

greptile-apps · 2026-05-17T04:02:41Z

+                agg AS (
+                    SELECT
+                        {$escapedGroup} AS value,
+                        toStartOfInterval({$escapedTime}, INTERVAL 1 {$unit}, 'UTC') AS bucket,
+                        COUNT(*) AS count
+                    FROM {$escapedTable}{$whereClause}
+                    GROUP BY value, bucket
+                )


The agg CTE aggregates every distinct group value that matches $whereClause, not just the top_groups. For high-cardinality columns like resourceType this can mean scanning and grouping many more rows than needed. Adding an IN-subquery filter restricts the aggregation to only the top-N groups already selected, at the cost of one extra read of the small top_groups result.

Suggested change

agg AS (

SELECT

{$escapedGroup} AS value,

toStartOfInterval({$escapedTime}, INTERVAL 1 {$unit}, 'UTC') AS bucket,

COUNT(*) AS count

FROM {$escapedTable}{$whereClause}

GROUP BY value, bucket

)

agg AS (

SELECT

{$escapedGroup} AS value,

toStartOfInterval({$escapedTime}, INTERVAL 1 {$unit}, 'UTC') AS bucket,

COUNT(*) AS count

FROM {$escapedTable}{$whereClause}

WHERE {$escapedGroup} IN (SELECT value FROM top_groups)

GROUP BY value, bucket

)

greptile-apps Bot reviewed May 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add findGrouped() to Adapter + ClickHouse implementation#119

feat: add findGrouped() to Adapter + ClickHouse implementation#119
lohanidamodar wants to merge 1 commit into
mainfrom
feat-find-grouped

lohanidamodar commented May 17, 2026

Uh oh!

greptile-apps Bot commented May 17, 2026

Uh oh!

greptile-apps Bot May 17, 2026

Uh oh!

greptile-apps Bot May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lohanidamodar commented May 17, 2026

Summary

Public API

Behavior

SQL approach

Database adapter

Schema note

Test plan

Uh oh!

greptile-apps Bot commented May 17, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

greptile-apps Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant