Skip to content

Commit 80909dd

Browse files
benjamib112benjamib112
authored andcommitted
added automatic timestamp detection, updated query generation script to generate both query files in one run, and added automatic streaming/inference config generation
1 parent 9f32c3a commit 80909dd

2 files changed

Lines changed: 329 additions & 178 deletions

File tree

asap-tools/execution-utilities/benchmark/README.md

Lines changed: 97 additions & 96 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,10 @@ data_file → export_to_database.py run_benchmark.py → results/
1919
ClickHouse :8123 (baseline)
2020
```
2121

22+
**Key difference from the old pipeline:** Arroyo reads directly from a local
23+
file (`single_file_custom` connector) rather than from a Kafka input topic.
24+
Kafka is still required for the **sketch output** topic (`sketch_topic`).
25+
2226
---
2327

2428
## Prerequisites
@@ -27,8 +31,8 @@ data_file → export_to_database.py run_benchmark.py → results/
2731
export INSTALL_DIR=/scratch/sketch_db_for_prometheus
2832
pip3 install --user -r requirements.txt
2933

30-
# Build binaries (one-time) — workspace target is at ~/ASAPQuery/target/release/
31-
cd ~/ASAPQuery && cargo build --release
34+
# Build binaries (one-time)
35+
cd ~/ASAPQuery/asap-query-engine && cargo build --release
3236
```
3337

3438
---
@@ -56,7 +60,6 @@ The Arroyo file source requires RFC3339 timestamps and string metadata columns.
5660
This step converts the raw ClickBench JSON:
5761

5862
```bash
59-
cd ~/ASAPQuery/asap-tools/execution-utilities/benchmark
6063
python prepare_data.py \
6164
--dataset clickbench \
6265
--input ./data/hits.json.gz \
@@ -71,19 +74,17 @@ This produces `hits_arroyo.json` with:
7174

7275
### Step 3 — Start infrastructure
7376

74-
Skip any service that is already running.
75-
7677
```bash
77-
# Kafka — skip if `kafka-topics.sh --list` succeeds
78+
# Kafka
7879
~/ASAPQuery/asap-tools/installation/kafka/run.sh $INSTALL_DIR/kafka
7980

80-
# Create sketch output topic — skip if sketch_topic already exists
81+
# Create sketch output topic
8182
KAFKA=$INSTALL_DIR/kafka/bin
8283
$KAFKA/kafka-topics.sh --bootstrap-server localhost:9092 --create \
8384
--topic sketch_topic --partitions 1 --replication-factor 1 \
8485
--config max.message.bytes=20971520
8586

86-
# ClickHouse — skip if port 8123 is already listening
87+
# ClickHouse
8788
~/ASAPQuery/asap-tools/installation/clickhouse/run.sh $INSTALL_DIR
8889
```
8990

@@ -95,12 +96,36 @@ $KAFKA/kafka-topics.sh --bootstrap-server localhost:9092 --create \
9596
> /tmp/arroyo.log 2>&1 &
9697
```
9798

98-
### Step 5 — Launch Arroyo sketch pipeline (file source)
99+
### Step 5 — Generate queries and configs
100+
101+
```bash
102+
python generate_queries.py \
103+
--table-name hits \
104+
--ts-column EventTime \
105+
--value-column ResolutionWidth \
106+
--group-by-columns RegionID,OS,UserAgent,TraficSourceID \
107+
--window-size 10 \
108+
--num-queries 50 \
109+
--window-form dateadd \
110+
--generate-configs \
111+
--auto-detect-timestamps \
112+
--data-file ./data/hits_arroyo.json \
113+
--data-file-format json \
114+
--output-prefix ./queries/clickbench
115+
```
116+
117+
This writes:
118+
- `queries/clickbench_asap.sql` — ASAP queries (ISO timestamps)
119+
- `queries/clickbench_clickhouse.sql` — ClickHouse queries (datetime timestamps)
120+
- `queries/clickbench_streaming.yaml` — Arroyo streaming config
121+
- `queries/clickbench_inference.yaml` — QueryEngineRust inference config
122+
123+
### Step 6 — Launch Arroyo sketch pipeline (file source)
99124

100125
```bash
101-
cd ~/ASAPQuery/asap-tools/execution-utilities/benchmark
102126
python export_to_arroyo.py \
103-
--streaming-config ./configs/clickbench_streaming.yaml \
127+
--streaming-config ./queries/clickbench_streaming.yaml \
128+
--source-type file \
104129
--input-file ./data/hits_arroyo.json \
105130
--file-format json \
106131
--ts-format rfc3339 \
@@ -109,21 +134,21 @@ python export_to_arroyo.py \
109134
--output-dir ./arroyo_outputs
110135
```
111136

112-
### Step 6 — Start QueryEngineRust
137+
### Step 7 — Start QueryEngineRust
113138

114139
```bash
115-
cd ~/ASAPQuery
140+
cd ~/ASAPQuery/asap-query-engine
116141
nohup ./target/release/query_engine_rust \
117142
--kafka-topic sketch_topic --input-format json \
118-
--config ~/ASAPQuery/asap-tools/execution-utilities/benchmark/configs/clickbench_inference.yaml \
119-
--streaming-config ~/ASAPQuery/asap-tools/execution-utilities/benchmark/configs/clickbench_streaming.yaml \
143+
--config ~/ASAPQuery/asap-tools/execution-utilities/benchmark/queries/clickbench_inference.yaml \
144+
--streaming-config ~/ASAPQuery/asap-tools/execution-utilities/benchmark/queries/clickbench_streaming.yaml \
120145
--http-port 8088 --delete-existing-db --log-level DEBUG \
121-
--output-dir ./asap-query-engine/output --streaming-engine arroyo \
146+
--output-dir ./output --streaming-engine arroyo \
122147
--query-language SQL --lock-strategy per-key \
123148
--prometheus-scrape-interval 1 > /tmp/query_engine.log 2>&1 &
124149
```
125150

126-
### Step 7 — Load data into ClickHouse (baseline)
151+
### Step 8 — Load data into ClickHouse (baseline)
127152

128153
```bash
129154
cd ~/ASAPQuery/asap-tools/execution-utilities/benchmark
@@ -136,35 +161,14 @@ python export_to_database.py \
136161

137162
Verify: `$INSTALL_DIR/clickhouse client --query "SELECT count(*) FROM hits"`
138163

139-
### Step 8 — Generate SQL query files
140-
141-
```bash
142-
cd ~/ASAPQuery/asap-tools/execution-utilities/benchmark
143-
python generate_queries.py \
144-
--table-name hits \
145-
--ts-column EventTime \
146-
--value-column ResolutionWidth \
147-
--group-by-columns RegionID,OS,UserAgent,TraficSourceID \
148-
--window-size 10 \
149-
--num-queries 50 \
150-
--ts-format datetime \
151-
--window-form dateadd \
152-
--auto-detect-timestamps \
153-
--data-file ./data/hits_arroyo.json \
154-
--data-file-format json \
155-
--output-prefix ./queries/clickbench
156-
```
157-
158-
This writes `queries/clickbench.sql`.
159-
160164
### Step 9 — Run benchmark
161165

162166
```bash
163-
cd ~/ASAPQuery/asap-tools/execution-utilities/benchmark
164167
python run_benchmark.py \
165168
--mode both \
166-
--asap-sql-file ./queries/clickbench.sql \
167-
--baseline-sql-file ./queries/clickbench.sql \
169+
--asap-sql-file ./queries/clickbench_asap.sql \
170+
--baseline-sql-file ./queries/clickbench_clickhouse.sql \
171+
--asap-url "http://localhost:8088/api/v1/query" \
168172
--output-dir ./results \
169173
--output-prefix clickbench
170174
```
@@ -179,14 +183,12 @@ Results: `results/clickbench_asap.csv`, `results/clickbench_baseline.csv`,
179183
### Step 1 — Download dataset
180184

181185
```bash
182-
cd ~/ASAPQuery/asap-tools/execution-utilities/benchmark
183186
python download_dataset.py --dataset h2o --output-dir ./data
184187
```
185188

186189
### Step 2 — Prepare data for Arroyo file source
187190

188191
```bash
189-
cd ~/ASAPQuery/asap-tools/execution-utilities/benchmark
190192
python prepare_data.py \
191193
--dataset h2o \
192194
--input ./data/G1_1e7_1e2_0_0.csv \
@@ -196,12 +198,29 @@ python prepare_data.py \
196198

197199
### Steps 3–4 — Start infrastructure and Arroyo (same as ClickBench)
198200

199-
### Step 5 — Launch Arroyo sketch pipeline
201+
### Step 5 — Generate queries and configs
202+
203+
```bash
204+
python generate_queries.py \
205+
--table-name h2o_groupby \
206+
--ts-column timestamp \
207+
--value-column v1 \
208+
--group-by-columns id1,id2 \
209+
--window-size 10 \
210+
--num-queries 50 \
211+
--generate-configs \
212+
--auto-detect-timestamps \
213+
--data-file ./data/h2o_arroyo.json \
214+
--data-file-format json \
215+
--output-prefix ./queries/h2o
216+
```
217+
218+
### Step 6 — Launch Arroyo sketch pipeline
200219

201220
```bash
202-
cd ~/ASAPQuery/asap-tools/execution-utilities/benchmark
203221
python export_to_arroyo.py \
204-
--streaming-config ./configs/h2o_streaming.yaml \
222+
--streaming-config ./queries/h2o_streaming.yaml \
223+
--source-type file \
205224
--input-file ./data/h2o_arroyo.json \
206225
--file-format json \
207226
--ts-format rfc3339 \
@@ -210,57 +229,38 @@ python export_to_arroyo.py \
210229
--output-dir ./arroyo_outputs
211230
```
212231

213-
### Step 6 — Start QueryEngineRust
232+
### Step 7 — Start QueryEngineRust
214233

215234
```bash
216-
cd ~/ASAPQuery
235+
cd ~/ASAPQuery/asap-query-engine
217236
nohup ./target/release/query_engine_rust \
218237
--kafka-topic sketch_topic --input-format json \
219-
--config ~/ASAPQuery/asap-tools/execution-utilities/benchmark/configs/h2o_inference.yaml \
220-
--streaming-config ~/ASAPQuery/asap-tools/execution-utilities/benchmark/configs/h2o_streaming.yaml \
238+
--config ~/ASAPQuery/asap-tools/execution-utilities/benchmark/queries/h2o_inference.yaml \
239+
--streaming-config ~/ASAPQuery/asap-tools/execution-utilities/benchmark/queries/h2o_streaming.yaml \
221240
--http-port 8088 --delete-existing-db --log-level DEBUG \
222-
--output-dir ./asap-query-engine/output --streaming-engine arroyo \
241+
--output-dir ./output --streaming-engine arroyo \
223242
--query-language SQL --lock-strategy per-key \
224243
--prometheus-scrape-interval 1 > /tmp/query_engine.log 2>&1 &
225244
```
226245

227-
### Step 7 — Load data into ClickHouse (baseline)
246+
### Step 8 — Load data into ClickHouse (baseline)
228247

229248
```bash
230-
cd ~/ASAPQuery/asap-tools/execution-utilities/benchmark
231249
python export_to_database.py \
232250
--dataset h2o \
233251
--file-path ./data/G1_1e7_1e2_0_0.csv \
234252
--init-sql-file ./configs/h2o_init.sql \
235253
--max-rows 1000000
236254
```
237255

238-
### Step 8 — Generate SQL query files
239-
240-
```bash
241-
cd ~/ASAPQuery/asap-tools/execution-utilities/benchmark
242-
python generate_queries.py \
243-
--table-name h2o_groupby \
244-
--ts-column timestamp \
245-
--value-column v1 \
246-
--group-by-columns id1,id2 \
247-
--window-size 10 \
248-
--num-queries 50 \
249-
--ts-format iso \
250-
--auto-detect-timestamps \
251-
--data-file ./data/h2o_arroyo.json \
252-
--data-file-format json \
253-
--output-prefix ./queries/h2o
254-
```
255-
256256
### Step 9 — Run benchmark
257257

258258
```bash
259-
cd ~/ASAPQuery/asap-tools/execution-utilities/benchmark
260259
python run_benchmark.py \
261260
--mode both \
262-
--asap-sql-file ./queries/h2o.sql \
263-
--baseline-sql-file ./queries/h2o.sql \
261+
--asap-sql-file ./queries/h2o_asap.sql \
262+
--baseline-sql-file ./queries/h2o_clickhouse.sql \
263+
--asap-url "http://localhost:8088/api/v1/query" \
264264
--output-dir ./results \
265265
--output-prefix h2o
266266
```
@@ -270,48 +270,49 @@ python run_benchmark.py \
270270
## Custom Dataset
271271

272272
```bash
273-
cd ~/ASAPQuery/asap-tools/execution-utilities/benchmark
274-
275273
# 1. Download (any HTTP URL)
276274
python download_dataset.py --dataset custom \
277275
--custom-url https://example.com/mydata.json.gz \
278276
--output-dir ./data
279277

280278
# 2. Prepare (edit prepare_data.py for your schema, or skip if already RFC3339)
281279

282-
# 3. Export to Arroyo
280+
# 3. Generate queries and configs
281+
python generate_queries.py \
282+
--table-name my_table \
283+
--ts-column event_time \
284+
--value-column metric_value \
285+
--group-by-columns region,host \
286+
--window-size 10 \
287+
--num-queries 50 \
288+
--generate-configs \
289+
--auto-detect-timestamps \
290+
--data-file ./data/mydata.json \
291+
--output-prefix ./queries/my_dataset
292+
293+
# 4. Export to Arroyo
283294
python export_to_arroyo.py \
284-
--streaming-config ./configs/my_streaming.yaml \
295+
--streaming-config ./queries/my_dataset_streaming.yaml \
296+
--source-type file \
285297
--input-file ./data/mydata.json \
286298
--file-format json \
287299
--ts-format rfc3339 \
288300
--pipeline-name my_pipeline \
289301
--arroyosketch-dir ~/ASAPQuery/asap-summary-ingest
290302

291-
# 4. Export to ClickHouse
303+
# 5. Export to ClickHouse
292304
python export_to_database.py \
293305
--dataset custom \
294306
--file-path ./data/mydata.json \
295307
--init-sql-file ./configs/my_init.sql \
296308
--table-name my_table
297309

298-
# 5. Generate queries
299-
python generate_queries.py \
300-
--table-name my_table \
301-
--ts-column event_time \
302-
--value-column metric_value \
303-
--group-by-columns region,host \
304-
--window-size 10 \
305-
--num-queries 50 \
306-
--auto-detect-timestamps \
307-
--data-file ./data/mydata.json \
308-
--output-prefix ./queries/my_dataset
309-
310310
# 6. Run benchmark
311311
python run_benchmark.py \
312312
--mode both \
313-
--asap-sql-file ./queries/my_dataset.sql \
314-
--baseline-sql-file ./queries/my_dataset.sql \
313+
--asap-sql-file ./queries/my_dataset_asap.sql \
314+
--baseline-sql-file ./queries/my_dataset_clickhouse.sql \
315+
--asap-url "http://localhost:8088/api/v1/query" \
315316
--output-dir ./results
316317
```
317318

@@ -344,8 +345,8 @@ $INSTALL_DIR/clickhouse client --query "TRUNCATE TABLE hits"
344345
|------|---------|
345346
| `download_dataset.py` | Download ClickBench, H2O, or custom datasets |
346347
| `prepare_data.py` | Convert raw data to Arroyo file source format (RFC3339, string columns) |
347-
| `export_to_arroyo.py` | Launch Arroyo sketch pipeline from a local file source |
348+
| `export_to_arroyo.py` | Launch Arroyo sketch pipeline (file or kafka source) |
348349
| `export_to_database.py` | Load data into ClickHouse for baseline |
349-
| `generate_queries.py` | Generate a single SQL query file (database-style, compatible with both ASAP and ClickHouse) |
350+
| `generate_queries.py` | Generate paired ASAP + ClickHouse SQL query files and streaming/inference YAML configs |
350351
| `run_benchmark.py` | Run queries and produce CSV results + plots |
351-
| `configs/` | Dataset-specific streaming/inference YAML and ClickHouse init SQL |
352+
| `configs/` | ClickHouse init SQL (CREATE TABLE statements) |

0 commit comments

Comments
 (0)