@@ -19,6 +19,10 @@ data_file → export_to_database.py run_benchmark.py → results/
1919 ClickHouse :8123 (baseline)
2020```
2121
22+ ** Key difference from the old pipeline:** Arroyo reads directly from a local
23+ file (` single_file_custom ` connector) rather than from a Kafka input topic.
24+ Kafka is still required for the ** sketch output** topic (` sketch_topic ` ).
25+
2226---
2327
2428## Prerequisites
@@ -27,8 +31,8 @@ data_file → export_to_database.py run_benchmark.py → results/
2731export INSTALL_DIR=/scratch/sketch_db_for_prometheus
2832pip3 install --user -r requirements.txt
2933
30- # Build binaries (one-time) — workspace target is at ~/ASAPQuery/target/release/
31- cd ~ /ASAPQuery && cargo build --release
34+ # Build binaries (one-time)
35+ cd ~ /ASAPQuery/asap-query-engine && cargo build --release
3236```
3337
3438---
@@ -56,7 +60,6 @@ The Arroyo file source requires RFC3339 timestamps and string metadata columns.
5660This step converts the raw ClickBench JSON:
5761
5862``` bash
59- cd ~ /ASAPQuery/asap-tools/execution-utilities/benchmark
6063python prepare_data.py \
6164 --dataset clickbench \
6265 --input ./data/hits.json.gz \
@@ -71,19 +74,17 @@ This produces `hits_arroyo.json` with:
7174
7275### Step 3 — Start infrastructure
7376
74- Skip any service that is already running.
75-
7677``` bash
77- # Kafka — skip if `kafka-topics.sh --list` succeeds
78+ # Kafka
7879~ /ASAPQuery/asap-tools/installation/kafka/run.sh $INSTALL_DIR /kafka
7980
80- # Create sketch output topic — skip if sketch_topic already exists
81+ # Create sketch output topic
8182KAFKA=$INSTALL_DIR /kafka/bin
8283$KAFKA /kafka-topics.sh --bootstrap-server localhost:9092 --create \
8384 --topic sketch_topic --partitions 1 --replication-factor 1 \
8485 --config max.message.bytes=20971520
8586
86- # ClickHouse — skip if port 8123 is already listening
87+ # ClickHouse
8788~ /ASAPQuery/asap-tools/installation/clickhouse/run.sh $INSTALL_DIR
8889```
8990
@@ -95,12 +96,36 @@ $KAFKA/kafka-topics.sh --bootstrap-server localhost:9092 --create \
9596 > /tmp/arroyo.log 2>&1 &
9697```
9798
98- ### Step 5 — Launch Arroyo sketch pipeline (file source)
99+ ### Step 5 — Generate queries and configs
100+
101+ ``` bash
102+ python generate_queries.py \
103+ --table-name hits \
104+ --ts-column EventTime \
105+ --value-column ResolutionWidth \
106+ --group-by-columns RegionID,OS,UserAgent,TraficSourceID \
107+ --window-size 10 \
108+ --num-queries 50 \
109+ --window-form dateadd \
110+ --generate-configs \
111+ --auto-detect-timestamps \
112+ --data-file ./data/hits_arroyo.json \
113+ --data-file-format json \
114+ --output-prefix ./queries/clickbench
115+ ```
116+
117+ This writes:
118+ - ` queries/clickbench_asap.sql ` — ASAP queries (ISO timestamps)
119+ - ` queries/clickbench_clickhouse.sql ` — ClickHouse queries (datetime timestamps)
120+ - ` queries/clickbench_streaming.yaml ` — Arroyo streaming config
121+ - ` queries/clickbench_inference.yaml ` — QueryEngineRust inference config
122+
123+ ### Step 6 — Launch Arroyo sketch pipeline (file source)
99124
100125``` bash
101- cd ~ /ASAPQuery/asap-tools/execution-utilities/benchmark
102126python export_to_arroyo.py \
103- --streaming-config ./configs/clickbench_streaming.yaml \
127+ --streaming-config ./queries/clickbench_streaming.yaml \
128+ --source-type file \
104129 --input-file ./data/hits_arroyo.json \
105130 --file-format json \
106131 --ts-format rfc3339 \
@@ -109,21 +134,21 @@ python export_to_arroyo.py \
109134 --output-dir ./arroyo_outputs
110135```
111136
112- ### Step 6 — Start QueryEngineRust
137+ ### Step 7 — Start QueryEngineRust
113138
114139``` bash
115- cd ~ /ASAPQuery
140+ cd ~ /ASAPQuery/asap-query-engine
116141nohup ./target/release/query_engine_rust \
117142 --kafka-topic sketch_topic --input-format json \
118- --config ~ /ASAPQuery/asap-tools/execution-utilities/benchmark/configs /clickbench_inference.yaml \
119- --streaming-config ~ /ASAPQuery/asap-tools/execution-utilities/benchmark/configs /clickbench_streaming.yaml \
143+ --config ~ /ASAPQuery/asap-tools/execution-utilities/benchmark/queries /clickbench_inference.yaml \
144+ --streaming-config ~ /ASAPQuery/asap-tools/execution-utilities/benchmark/queries /clickbench_streaming.yaml \
120145 --http-port 8088 --delete-existing-db --log-level DEBUG \
121- --output-dir ./asap-query-engine/ output --streaming-engine arroyo \
146+ --output-dir ./output --streaming-engine arroyo \
122147 --query-language SQL --lock-strategy per-key \
123148 --prometheus-scrape-interval 1 > /tmp/query_engine.log 2>&1 &
124149```
125150
126- ### Step 7 — Load data into ClickHouse (baseline)
151+ ### Step 8 — Load data into ClickHouse (baseline)
127152
128153``` bash
129154cd ~ /ASAPQuery/asap-tools/execution-utilities/benchmark
@@ -136,35 +161,14 @@ python export_to_database.py \
136161
137162Verify: ` $INSTALL_DIR/clickhouse client --query "SELECT count(*) FROM hits" `
138163
139- ### Step 8 — Generate SQL query files
140-
141- ``` bash
142- cd ~ /ASAPQuery/asap-tools/execution-utilities/benchmark
143- python generate_queries.py \
144- --table-name hits \
145- --ts-column EventTime \
146- --value-column ResolutionWidth \
147- --group-by-columns RegionID,OS,UserAgent,TraficSourceID \
148- --window-size 10 \
149- --num-queries 50 \
150- --ts-format datetime \
151- --window-form dateadd \
152- --auto-detect-timestamps \
153- --data-file ./data/hits_arroyo.json \
154- --data-file-format json \
155- --output-prefix ./queries/clickbench
156- ```
157-
158- This writes ` queries/clickbench.sql ` .
159-
160164### Step 9 — Run benchmark
161165
162166``` bash
163- cd ~ /ASAPQuery/asap-tools/execution-utilities/benchmark
164167python run_benchmark.py \
165168 --mode both \
166- --asap-sql-file ./queries/clickbench.sql \
167- --baseline-sql-file ./queries/clickbench.sql \
169+ --asap-sql-file ./queries/clickbench_asap.sql \
170+ --baseline-sql-file ./queries/clickbench_clickhouse.sql \
171+ --asap-url " http://localhost:8088/api/v1/query" \
168172 --output-dir ./results \
169173 --output-prefix clickbench
170174```
@@ -179,14 +183,12 @@ Results: `results/clickbench_asap.csv`, `results/clickbench_baseline.csv`,
179183### Step 1 — Download dataset
180184
181185``` bash
182- cd ~ /ASAPQuery/asap-tools/execution-utilities/benchmark
183186python download_dataset.py --dataset h2o --output-dir ./data
184187```
185188
186189### Step 2 — Prepare data for Arroyo file source
187190
188191``` bash
189- cd ~ /ASAPQuery/asap-tools/execution-utilities/benchmark
190192python prepare_data.py \
191193 --dataset h2o \
192194 --input ./data/G1_1e7_1e2_0_0.csv \
@@ -196,12 +198,29 @@ python prepare_data.py \
196198
197199### Steps 3–4 — Start infrastructure and Arroyo (same as ClickBench)
198200
199- ### Step 5 — Launch Arroyo sketch pipeline
201+ ### Step 5 — Generate queries and configs
202+
203+ ``` bash
204+ python generate_queries.py \
205+ --table-name h2o_groupby \
206+ --ts-column timestamp \
207+ --value-column v1 \
208+ --group-by-columns id1,id2 \
209+ --window-size 10 \
210+ --num-queries 50 \
211+ --generate-configs \
212+ --auto-detect-timestamps \
213+ --data-file ./data/h2o_arroyo.json \
214+ --data-file-format json \
215+ --output-prefix ./queries/h2o
216+ ```
217+
218+ ### Step 6 — Launch Arroyo sketch pipeline
200219
201220``` bash
202- cd ~ /ASAPQuery/asap-tools/execution-utilities/benchmark
203221python export_to_arroyo.py \
204- --streaming-config ./configs/h2o_streaming.yaml \
222+ --streaming-config ./queries/h2o_streaming.yaml \
223+ --source-type file \
205224 --input-file ./data/h2o_arroyo.json \
206225 --file-format json \
207226 --ts-format rfc3339 \
@@ -210,57 +229,38 @@ python export_to_arroyo.py \
210229 --output-dir ./arroyo_outputs
211230```
212231
213- ### Step 6 — Start QueryEngineRust
232+ ### Step 7 — Start QueryEngineRust
214233
215234``` bash
216- cd ~ /ASAPQuery
235+ cd ~ /ASAPQuery/asap-query-engine
217236nohup ./target/release/query_engine_rust \
218237 --kafka-topic sketch_topic --input-format json \
219- --config ~ /ASAPQuery/asap-tools/execution-utilities/benchmark/configs /h2o_inference.yaml \
220- --streaming-config ~ /ASAPQuery/asap-tools/execution-utilities/benchmark/configs /h2o_streaming.yaml \
238+ --config ~ /ASAPQuery/asap-tools/execution-utilities/benchmark/queries /h2o_inference.yaml \
239+ --streaming-config ~ /ASAPQuery/asap-tools/execution-utilities/benchmark/queries /h2o_streaming.yaml \
221240 --http-port 8088 --delete-existing-db --log-level DEBUG \
222- --output-dir ./asap-query-engine/ output --streaming-engine arroyo \
241+ --output-dir ./output --streaming-engine arroyo \
223242 --query-language SQL --lock-strategy per-key \
224243 --prometheus-scrape-interval 1 > /tmp/query_engine.log 2>&1 &
225244```
226245
227- ### Step 7 — Load data into ClickHouse (baseline)
246+ ### Step 8 — Load data into ClickHouse (baseline)
228247
229248``` bash
230- cd ~ /ASAPQuery/asap-tools/execution-utilities/benchmark
231249python export_to_database.py \
232250 --dataset h2o \
233251 --file-path ./data/G1_1e7_1e2_0_0.csv \
234252 --init-sql-file ./configs/h2o_init.sql \
235253 --max-rows 1000000
236254```
237255
238- ### Step 8 — Generate SQL query files
239-
240- ``` bash
241- cd ~ /ASAPQuery/asap-tools/execution-utilities/benchmark
242- python generate_queries.py \
243- --table-name h2o_groupby \
244- --ts-column timestamp \
245- --value-column v1 \
246- --group-by-columns id1,id2 \
247- --window-size 10 \
248- --num-queries 50 \
249- --ts-format iso \
250- --auto-detect-timestamps \
251- --data-file ./data/h2o_arroyo.json \
252- --data-file-format json \
253- --output-prefix ./queries/h2o
254- ```
255-
256256### Step 9 — Run benchmark
257257
258258``` bash
259- cd ~ /ASAPQuery/asap-tools/execution-utilities/benchmark
260259python run_benchmark.py \
261260 --mode both \
262- --asap-sql-file ./queries/h2o.sql \
263- --baseline-sql-file ./queries/h2o.sql \
261+ --asap-sql-file ./queries/h2o_asap.sql \
262+ --baseline-sql-file ./queries/h2o_clickhouse.sql \
263+ --asap-url " http://localhost:8088/api/v1/query" \
264264 --output-dir ./results \
265265 --output-prefix h2o
266266```
@@ -270,48 +270,49 @@ python run_benchmark.py \
270270## Custom Dataset
271271
272272``` bash
273- cd ~ /ASAPQuery/asap-tools/execution-utilities/benchmark
274-
275273# 1. Download (any HTTP URL)
276274python download_dataset.py --dataset custom \
277275 --custom-url https://example.com/mydata.json.gz \
278276 --output-dir ./data
279277
280278# 2. Prepare (edit prepare_data.py for your schema, or skip if already RFC3339)
281279
282- # 3. Export to Arroyo
280+ # 3. Generate queries and configs
281+ python generate_queries.py \
282+ --table-name my_table \
283+ --ts-column event_time \
284+ --value-column metric_value \
285+ --group-by-columns region,host \
286+ --window-size 10 \
287+ --num-queries 50 \
288+ --generate-configs \
289+ --auto-detect-timestamps \
290+ --data-file ./data/mydata.json \
291+ --output-prefix ./queries/my_dataset
292+
293+ # 4. Export to Arroyo
283294python export_to_arroyo.py \
284- --streaming-config ./configs/my_streaming.yaml \
295+ --streaming-config ./queries/my_dataset_streaming.yaml \
296+ --source-type file \
285297 --input-file ./data/mydata.json \
286298 --file-format json \
287299 --ts-format rfc3339 \
288300 --pipeline-name my_pipeline \
289301 --arroyosketch-dir ~ /ASAPQuery/asap-summary-ingest
290302
291- # 4 . Export to ClickHouse
303+ # 5 . Export to ClickHouse
292304python export_to_database.py \
293305 --dataset custom \
294306 --file-path ./data/mydata.json \
295307 --init-sql-file ./configs/my_init.sql \
296308 --table-name my_table
297309
298- # 5. Generate queries
299- python generate_queries.py \
300- --table-name my_table \
301- --ts-column event_time \
302- --value-column metric_value \
303- --group-by-columns region,host \
304- --window-size 10 \
305- --num-queries 50 \
306- --auto-detect-timestamps \
307- --data-file ./data/mydata.json \
308- --output-prefix ./queries/my_dataset
309-
310310# 6. Run benchmark
311311python run_benchmark.py \
312312 --mode both \
313- --asap-sql-file ./queries/my_dataset.sql \
314- --baseline-sql-file ./queries/my_dataset.sql \
313+ --asap-sql-file ./queries/my_dataset_asap.sql \
314+ --baseline-sql-file ./queries/my_dataset_clickhouse.sql \
315+ --asap-url " http://localhost:8088/api/v1/query" \
315316 --output-dir ./results
316317```
317318
@@ -344,8 +345,8 @@ $INSTALL_DIR/clickhouse client --query "TRUNCATE TABLE hits"
344345| ------| ---------|
345346| ` download_dataset.py ` | Download ClickBench, H2O, or custom datasets |
346347| ` prepare_data.py ` | Convert raw data to Arroyo file source format (RFC3339, string columns) |
347- | ` export_to_arroyo.py ` | Launch Arroyo sketch pipeline from a local file source |
348+ | ` export_to_arroyo.py ` | Launch Arroyo sketch pipeline (file or kafka source) |
348349| ` export_to_database.py ` | Load data into ClickHouse for baseline |
349- | ` generate_queries.py ` | Generate a single SQL query file (database-style, compatible with both ASAP and ClickHouse) |
350+ | ` generate_queries.py ` | Generate paired ASAP + ClickHouse SQL query files and streaming/inference YAML configs |
350351| ` run_benchmark.py ` | Run queries and produce CSV results + plots |
351- | ` configs/ ` | Dataset-specific streaming/inference YAML and ClickHouse init SQL |
352+ | ` configs/ ` | ClickHouse init SQL (CREATE TABLE statements) |
0 commit comments