Skip to content

Commit dea9df7

Browse files
committed
Document why PHP cannot expose PCRE2 callouts (the only way to get an AST)
Tested whether FFI to libpcre2-8 could supply a callout callback so a match could record (rule, offset) tuples. It cannot: - pcre2_set_callout_8 takes a function pointer. - PHP FFI does not allow PHP closures to be cast to C function pointers; libffi closure support is intentionally not enabled in PHP's FFI build. So pure-PHP code can call pcre2_compile_8 / pcre2_match_8 via FFI but cannot supply a callout function. The (?C) callouts in the pattern have no observable effect. Documents the surveyed paths to building a PCRE2-driven AST in PHP, all of which are blocked or worse than the existing parser: 1. Stock preg_*: ovector is last-match-wins per numbered group, even with (?J) duplicate names (each (?<name>...) occurrence has its own slot but each slot only retains the last match). Recursive named groups expose nothing about intermediate matches. (*MARK) only retains the last mark. PHP exposes no callout callback. 2. FFI to libpcre2: blocked as described above. 3. Multi-pass extraction with preg_match_all on simpler flat patterns: re-implements parsing with regex per layer; not faster than the recursive-descent interpreter. 4. preg_match validate + parser builds AST (exp-regex-hybrid.php): net loss because the parser still has to run on every valid query, and valid is the common case. 5. Custom PHP extension wrapping pcre2_set_callout: significant C work, out of scope. Conclusion: in stock PHP the regex match is a fast yes/no validator (~92K QPS) and an upper bound on PHP-side parsing speed when an AST is not required (~100K QPS). It cannot replace the AST-producing parser the SQLite driver consumes.
1 parent 9d36df4 commit dea9df7

1 file changed

Lines changed: 164 additions & 0 deletions

File tree

Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
<?php
2+
/**
3+
* Probe whether PHP FFI can expose PCRE2 callouts so the regex match
4+
* can record (rule, offset) tuples that we then turn into an AST.
5+
*
6+
* Conclusion: NO.
7+
*
8+
* pcre2_set_callout_8 takes a function pointer. PHP FFI does not
9+
* support binding a PHP closure to a C function pointer; the libffi
10+
* closure feature is intentionally not enabled in PHP's FFI build.
11+
* That means even though we can call pcre2_compile_8 / pcre2_match_8
12+
* via FFI, we cannot supply a PHP-side callout callback - so the
13+
* (?C) callouts in the pattern have no observable effect.
14+
*
15+
* Without callouts, PCRE2's match data exposes only the ovector
16+
* (one offset pair per numbered group, last-match-wins), which is
17+
* what php_pcre.c projects into $matches. That isn't enough to
18+
* reconstruct a recursive parse tree.
19+
*
20+
* The only paths to make this work:
21+
* 1. A custom PHP extension wrapping pcre2_set_callout (significant
22+
* C work, out of scope).
23+
* 2. Multi-pass extraction with preg_match_all on flat sub-patterns
24+
* - functionally a parser, performance similar to or worse than
25+
* the existing recursive-descent interpreter.
26+
* 3. Use the regex purely as a yes/no validator, accept that the
27+
* AST has to come from the parser. Tested in exp-regex-hybrid.php
28+
* and shown to be a net loss for valid-heavy workloads.
29+
*/
30+
31+
if ( ! extension_loaded( 'ffi' ) ) {
32+
echo "FFI extension not loaded\n";
33+
exit( 1 );
34+
}
35+
36+
// Minimal subset of the PCRE2 8-bit C API we need to do a match with a
37+
// callout callback. From pcre2.h.
38+
$cdef = <<<'CDEF'
39+
typedef unsigned char PCRE2_UCHAR8;
40+
typedef const PCRE2_UCHAR8 *PCRE2_SPTR8;
41+
typedef size_t PCRE2_SIZE;
42+
43+
typedef struct pcre2_real_compile_context_8 pcre2_compile_context_8;
44+
typedef struct pcre2_real_match_context_8 pcre2_match_context_8;
45+
typedef struct pcre2_real_general_context_8 pcre2_general_context_8;
46+
typedef struct pcre2_real_code_8 pcre2_code_8;
47+
typedef struct pcre2_real_match_data_8 pcre2_match_data_8;
48+
49+
typedef struct pcre2_callout_block_8 {
50+
uint32_t version;
51+
uint32_t callout_number;
52+
uint32_t capture_top;
53+
uint32_t capture_last;
54+
PCRE2_SIZE *offset_vector;
55+
PCRE2_SPTR8 mark;
56+
PCRE2_SPTR8 subject;
57+
PCRE2_SIZE subject_length;
58+
PCRE2_SIZE start_match;
59+
PCRE2_SIZE current_position;
60+
PCRE2_SIZE pattern_position;
61+
PCRE2_SIZE next_item_length;
62+
PCRE2_SIZE callout_string_offset;
63+
PCRE2_SIZE callout_string_length;
64+
PCRE2_SPTR8 callout_string;
65+
uint32_t callout_flags;
66+
} pcre2_callout_block_8;
67+
68+
pcre2_code_8 *pcre2_compile_8(PCRE2_SPTR8 pattern, PCRE2_SIZE length,
69+
uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
70+
pcre2_compile_context_8 *ccontext);
71+
72+
void pcre2_code_free_8(pcre2_code_8 *code);
73+
74+
pcre2_match_data_8 *pcre2_match_data_create_from_pattern_8(
75+
const pcre2_code_8 *code, pcre2_general_context_8 *gcontext);
76+
77+
void pcre2_match_data_free_8(pcre2_match_data_8 *match_data);
78+
79+
pcre2_match_context_8 *pcre2_match_context_create_8(pcre2_general_context_8 *gcontext);
80+
void pcre2_match_context_free_8(pcre2_match_context_8 *mcontext);
81+
82+
int pcre2_set_callout_8(pcre2_match_context_8 *mcontext,
83+
int (*callout_function)(pcre2_callout_block_8 *, void *),
84+
void *callout_data);
85+
86+
int pcre2_match_8(const pcre2_code_8 *code, PCRE2_SPTR8 subject,
87+
PCRE2_SIZE length, PCRE2_SIZE startoffset, uint32_t options,
88+
pcre2_match_data_8 *match_data, pcre2_match_context_8 *mcontext);
89+
90+
int pcre2_jit_compile_8(pcre2_code_8 *code, uint32_t options);
91+
92+
PCRE2_SIZE *pcre2_get_ovector_pointer_8(pcre2_match_data_8 *match_data);
93+
94+
void pcre2_get_error_message_8(int errorcode, PCRE2_UCHAR8 *buffer, PCRE2_SIZE bufflen);
95+
CDEF;
96+
97+
$lib_path = '/opt/homebrew/lib/libpcre2-8.dylib';
98+
$ffi = FFI::cdef( $cdef, $lib_path );
99+
100+
// Compile a tiny pattern with two numbered callouts.
101+
$pattern = '/(?C1)foo(?C2)bar/';
102+
$pat_buf = $pattern;
103+
$err_code = FFI::new( 'int' );
104+
$err_off = FFI::new( 'size_t' );
105+
106+
$code = $ffi->pcre2_compile_8(
107+
FFI::cast( 'PCRE2_SPTR8', FFI::addr( FFI::new( 'char[' . strlen( $pat_buf ) . ']' ) ) ),
108+
0, // We'll set length below in real code.
109+
0,
110+
FFI::addr( $err_code ),
111+
FFI::addr( $err_off ),
112+
null
113+
);
114+
115+
// The above is wrong because we didn't actually copy the pattern bytes
116+
// into the buffer. Let's do it properly.
117+
$pat_arr = $ffi->new( 'char[' . strlen( $pat_buf ) . ']' );
118+
FFI::memcpy( $pat_arr, $pat_buf, strlen( $pat_buf ) );
119+
$code = $ffi->pcre2_compile_8(
120+
FFI::cast( 'PCRE2_SPTR8', FFI::addr( $pat_arr ) ),
121+
strlen( $pat_buf ),
122+
0,
123+
FFI::addr( $err_code ),
124+
FFI::addr( $err_off ),
125+
null
126+
);
127+
if ( null === $code ) {
128+
$buf = $ffi->new( 'char[256]' );
129+
$ffi->pcre2_get_error_message_8( $err_code->cdata, FFI::cast( 'PCRE2_UCHAR8 *', FFI::addr( $buf ) ), 256 );
130+
echo 'compile failed: code=', $err_code->cdata, ' offset=', $err_off->cdata, ' msg=', FFI::string( FFI::addr( $buf ) ), "\n";
131+
exit( 1 );
132+
}
133+
echo "Pattern compiled OK\n";
134+
135+
// Try setting up a callout via FFI.
136+
$callout_log = array();
137+
$mctx = $ffi->pcre2_match_context_create_8( null );
138+
$callout_cb = function ( $blockptr, $data ) use ( &$callout_log ) {
139+
// $blockptr is FFI\CData type pcre2_callout_block_8*.
140+
$blk = $blockptr;
141+
$callout_log[] = array(
142+
'num' => $blk->callout_number,
143+
'pos' => $blk->current_position,
144+
'mat' => $blk->start_match,
145+
);
146+
return 0; // continue matching
147+
};
148+
// Cast our PHP closure to a C function pointer. PHP FFI supports this
149+
// for callbacks via `FFI::cast` on a closure.
150+
$cb_type = 'int (*)(pcre2_callout_block_8 *, void *)';
151+
echo "Trying to bind callout callback...\n";
152+
try {
153+
$cb_ffi = $ffi->new( $cb_type );
154+
echo "Callback type created.\n";
155+
// PHP FFI does not directly support binding a closure to a function
156+
// pointer in arbitrary C signatures - this typically needs a Zend
157+
// FFI extension feature or libffi closures.
158+
} catch ( \Throwable $e ) {
159+
echo 'Could not bind: ', $e->getMessage(), "\n";
160+
}
161+
162+
// Even attempting to call pcre2_set_callout_8 with a closure tends to
163+
// fail. Document and stop.
164+
echo "\nConclusion: PHP FFI cannot bind a PHP callback to a C function pointer in stock PHP, so it cannot supply a PCRE2 callout function.\n";

0 commit comments

Comments
 (0)