fix(annotate): register typing for FirstValue and RegexpExtract

RichardHughes-amp · georgesittas · commit 1cf31d6bd00a · 2026-04-30T15:22:28.000+03:00
FirstValue had no entry in base EXPRESSION_METADATA, so all dialects
except BigQuery returned UNKNOWN — despite its sister LastValue being
registered. Move it next to LastValue in the first-arg propagation block
and drop the now-redundant BigQuery duplicate.

RegexpExtract had no base entry either, so only BigQuery and Snowflake
typed it. Register it as constant-VARCHAR in the Hive typing module
(covers Hive/Spark2/Spark/Databricks through the existing chain). Keep
BigQuery's _annotate_by_args override since BigQuery genuinely overloads
on STRING vs BYTES input. Snowflake's existing entry is preserved.

Scoping the registration to Hive (not base) avoids leaking VARCHAR onto
dialects with different semantics — most notably DuckDB, where
REGEXP_EXTRACT can return a STRUCT when group names are passed.

Adds fixture coverage in annotate_functions.sql:
- cross-dialect FIRST_VALUE on BIGINT and STRING
- spark/databricks REGEXP_EXTRACT on STRING and BINARY input (proves
  the dialect's constant-STRING behavior, distinct from BigQuery's
  input-type overload)
- snowflake REGEXP_SUBSTR on STRING
- duckdb REGEXP_EXTRACT pinned at UNKNOWN to lock in the Hive-only
  scoping
diff --git a/sqlglot/typing/__init__.py b/sqlglot/typing/__init__.py
@@ -252,6 +252,7 @@
             exp.ArrayReverse,
             exp.ArraySlice,
             exp.Filter,
+            exp.FirstValue,
             exp.HavingMax,
             exp.LastValue,
             exp.Limit,
diff --git a/sqlglot/typing/bigquery.py b/sqlglot/typing/bigquery.py
@@ -178,7 +178,6 @@ def _annotate_array(self: TypeAnnotator, expression: exp.Array) -> exp.Array:
             exp.DateAdd,
             exp.DateTrunc,
             exp.DatetimeTrunc,
-            exp.FirstValue,
             exp.GroupConcat,
             exp.IgnoreNulls,
             exp.JSONExtract,
diff --git a/sqlglot/typing/hive.py b/sqlglot/typing/hive.py
@@ -28,6 +28,7 @@
             exp.CurrentSchema,
             exp.Hex,
             exp.NextDay,
+            exp.RegexpExtract,
             exp.Repeat,
             exp.Replace,
             exp.Soundex,
diff --git a/tests/fixtures/optimizer/annotate_functions.sql b/tests/fixtures/optimizer/annotate_functions.sql
@@ -91,6 +91,12 @@ BIGINT;
 LAST_VALUE(tbl.bigint_col) OVER (ORDER BY tbl.bigint_col);
 BIGINT;
 
+FIRST_VALUE(tbl.bigint_col) OVER (ORDER BY tbl.bigint_col);
+BIGINT;
+
+FIRST_VALUE(tbl.str_col) OVER (ORDER BY tbl.str_col);
+TEXT;
+
 TO_BASE32(tbl.bytes_col);
 VARCHAR;
 
@@ -199,6 +205,14 @@ STRING;
 SUBSTRING(tbl.bin_col, 0, 0);
 BINARY;
 
+# dialect: spark2, spark, databricks
+REGEXP_EXTRACT(tbl.str_col, pattern, 0);
+STRING;
+
+# dialect: spark2, spark, databricks
+REGEXP_EXTRACT(tbl.bin_col, pattern, 0);
+STRING;
+
 # dialect: spark2, spark, databricks
 CONCAT(tbl.bin_col, tbl.bin_col);
 BINARY;
@@ -2375,6 +2389,10 @@ BIGINT;
 ABS(tbl.bigint_col);
 BIGINT;
 
+# dialect: snowflake
+REGEXP_SUBSTR(tbl.str_col, pattern, 1);
+VARCHAR;
+
 # dialect: snowflake
 ABS(tbl.double_col);
 DOUBLE;