Skip to content

Commit 1cf31d6

Browse files
RichardHughes-ampgeorgesittas
authored andcommitted
fix(annotate): register typing for FirstValue and RegexpExtract
FirstValue had no entry in base EXPRESSION_METADATA, so all dialects except BigQuery returned UNKNOWN — despite its sister LastValue being registered. Move it next to LastValue in the first-arg propagation block and drop the now-redundant BigQuery duplicate. RegexpExtract had no base entry either, so only BigQuery and Snowflake typed it. Register it as constant-VARCHAR in the Hive typing module (covers Hive/Spark2/Spark/Databricks through the existing chain). Keep BigQuery's _annotate_by_args override since BigQuery genuinely overloads on STRING vs BYTES input. Snowflake's existing entry is preserved. Scoping the registration to Hive (not base) avoids leaking VARCHAR onto dialects with different semantics — most notably DuckDB, where REGEXP_EXTRACT can return a STRUCT when group names are passed. Adds fixture coverage in annotate_functions.sql: - cross-dialect FIRST_VALUE on BIGINT and STRING - spark/databricks REGEXP_EXTRACT on STRING and BINARY input (proves the dialect's constant-STRING behavior, distinct from BigQuery's input-type overload) - snowflake REGEXP_SUBSTR on STRING - duckdb REGEXP_EXTRACT pinned at UNKNOWN to lock in the Hive-only scoping
1 parent fd48100 commit 1cf31d6

4 files changed

Lines changed: 20 additions & 1 deletion

File tree

sqlglot/typing/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -252,6 +252,7 @@
252252
exp.ArrayReverse,
253253
exp.ArraySlice,
254254
exp.Filter,
255+
exp.FirstValue,
255256
exp.HavingMax,
256257
exp.LastValue,
257258
exp.Limit,

sqlglot/typing/bigquery.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -178,7 +178,6 @@ def _annotate_array(self: TypeAnnotator, expression: exp.Array) -> exp.Array:
178178
exp.DateAdd,
179179
exp.DateTrunc,
180180
exp.DatetimeTrunc,
181-
exp.FirstValue,
182181
exp.GroupConcat,
183182
exp.IgnoreNulls,
184183
exp.JSONExtract,

sqlglot/typing/hive.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@
2828
exp.CurrentSchema,
2929
exp.Hex,
3030
exp.NextDay,
31+
exp.RegexpExtract,
3132
exp.Repeat,
3233
exp.Replace,
3334
exp.Soundex,

tests/fixtures/optimizer/annotate_functions.sql

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,12 @@ BIGINT;
9191
LAST_VALUE(tbl.bigint_col) OVER (ORDER BY tbl.bigint_col);
9292
BIGINT;
9393

94+
FIRST_VALUE(tbl.bigint_col) OVER (ORDER BY tbl.bigint_col);
95+
BIGINT;
96+
97+
FIRST_VALUE(tbl.str_col) OVER (ORDER BY tbl.str_col);
98+
TEXT;
99+
94100
TO_BASE32(tbl.bytes_col);
95101
VARCHAR;
96102

@@ -199,6 +205,14 @@ STRING;
199205
SUBSTRING(tbl.bin_col, 0, 0);
200206
BINARY;
201207

208+
# dialect: spark2, spark, databricks
209+
REGEXP_EXTRACT(tbl.str_col, pattern, 0);
210+
STRING;
211+
212+
# dialect: spark2, spark, databricks
213+
REGEXP_EXTRACT(tbl.bin_col, pattern, 0);
214+
STRING;
215+
202216
# dialect: spark2, spark, databricks
203217
CONCAT(tbl.bin_col, tbl.bin_col);
204218
BINARY;
@@ -2375,6 +2389,10 @@ BIGINT;
23752389
ABS(tbl.bigint_col);
23762390
BIGINT;
23772391

2392+
# dialect: snowflake
2393+
REGEXP_SUBSTR(tbl.str_col, pattern, 1);
2394+
VARCHAR;
2395+
23782396
# dialect: snowflake
23792397
ABS(tbl.double_col);
23802398
DOUBLE;

0 commit comments

Comments
 (0)