Fix inconsistent index scan behavior due to char signedness on different architectures. by MasahikoSawada · Pull Request #29 · pgbigm/pg_bigm

MasahikoSawada · 2025-09-05T17:49:57Z

GIN indexes utilizing pg_bigm's opclasses store sorted bigrams within index tuples. When comparing and sorting each bigram, pg_bigm treats each character as a 'char[3]' type in C. However, the char type in C can be interpreted as either signed char or unsigned char, depending on the platform, if the signedness is not explicitly specified. Consequently, as reported in issue #15, during replication between different CPU architectures, there was an issue where index scans on standby servers could not locate matching index tuples due to the differing treatment of character signedness. This problem can also happen in cases where the database cluster was initialized on one platform and later migrated to a different platform.

This change introduces comparison functions for bigm that explicitly handle signed char and unsigned char. The appropriate comparison function will be dynamically selected based on the character signedness stored in the control file. Therefore, upgraded clusters can utilize the indexes without rebuilding, provided the cluster upgrade occurs on platforms with the same character signedness as the original cluster initialization.

This commit is inspired by pg_trgm's change: dfd8e6c.

Issue #15

…ent architectures. GIN indexes utilizing pg_bigm's opclasses store sorted bigrams within index tuples. When comparing and sorting each bigram, pg_bigm treats each character as a 'char[3]' type in C. However, the char type in C can be interpreted as either signed char or unsigned char, depending on the platform, if the signedness is not explicitly specified. Consequently, as reported in issue pgbigm#15, during replication between different CPU architectures, there was an issue where index scans on standby servers could not locate matching index tuples due to the differing treatment of character signedness. This problem can also happen in cases where the database cluster was initialized on one platform and later migrated to a different platform. This change introduces comparison functions for bigm that explicitly handle signed char and unsigned char. The appropriate comparison function will be dynamically selected based on the character signedness stored in the control file. Therefore, upgraded clusters can utilize the indexes without rebuilding, provided the cluster upgrade occurs on platforms with the same character signedness as the original cluster initialization. This commit is inspired by pg_trgm's change: dfd8e6c. Issue pgbigm#15 Reviewed-by:

MasahikoSawada · 2025-09-05T17:56:04Z

One concern about this approach is that we are no longer able to inline bigmstrcmp(), leading performance regressions, but I don't have better ideas how to fix this issue. If we see performance regressions we can provide a macro, say DISABLE_CMPBIGM_RUNTIME_SELECTION, to use the current behavior for users who are sure that databases are never migrated or replicated to another architecture platform.

MasaoFujii · 2025-12-25T05:43:21Z

@MasahikoSawada

One concern about this approach is that we are no longer able to inline bigmstrcmp(), leading performance regressions, but I don't have better ideas how to fix this issue.

Just idea would be to keep the current pg_bigm opclass as-is and introduce a new opclass that uses CMPBIGM selected based on the platform. Users who need pg_bigm indexes to work across different architectures could use the new opclass, while others could continue using the existing one. However, this approach may be more complicated and potentially confusing for users.

If we see performance regressions we can provide a macro

Are you planning to measure the performance impact with and without this patch?

MasahikoSawada · 2026-04-17T16:46:20Z

Just idea would be to keep the current pg_bigm opclass as-is and introduce a new opclass that uses CMPBIGM selected based on the platform. Users who need pg_bigm indexes to work across different architectures could use the new opclass, while others could continue using the existing one. However, this approach may be more complicated and potentially confusing for users.

Yeah, I'm concerned that it could be difficult for users to choose the correct opclass on their environment. If they choose the wrong one, they would have inconsistent index scan results on their replicas or have to rebuild the whole index.

I've conducted the performance test results on my environment (RHEL 10.0, m6i.metal. I used PG18 and pg_bigm with and without this patch, and run pgbench -t 10000 -f bench.sql, where bench.sql has a simple query select bigmtextcmp('hello', 'hello');. Here are results:

w/ patch: 0.163 ms
w/o patch: 0.146 ms

I observed about 11% performance regressions.

Given these results, I'm inclined to have a knob to enable this behavior only where required.

MasahikoSawada · 2026-04-17T20:55:59Z

I've conducted the performance test results on my environment (RHEL 10.0, m6i.metal. I used PG18 and pg_bigm with and without this patch, and run pgbench -t 10000 -f bench.sql, where bench.sql has a simple query select bigmtextcmp('hello', 'hello');. Here are results:

I found a mistake in my test script so let me share that test results that I run on another environment:

w/ patch: 0.006 ms
w/o patch: 0.006 ms

When using a longer text with show_bigm(), which uses CMPBIGM() in unique_array(), I got:

w/ patch: 0.059 ms
w/o patch: 0.046 ms

I still see some performance regression although there is not huge.

This change is required since bigmtextcmp() needs to directly call bigmstrcmp().

MasahikoSawada · 2026-06-03T23:28:49Z

What do you think about this change? @MasaoFujii

MasaoFujii · 2026-06-04T03:14:08Z

@MasahikoSawada

Yeah, I'm concerned that it could be difficult for users to choose the correct opclass on their environment. If they choose the wrong one, they would have inconsistent index scan results on their replicas or have to rebuild the whole index.

How about making the new opclass the default and keeping the current one as an optional alternative? That would avoid architecture-dependent inconsistencies for most users, since they would naturally use the new default opclass. Users who prioritize performance and understand the risk of cross-architecture inconsistencies could still choose the current opclass explicitly. Would that address your concern?

I still see some performance regression although there is not huge.

Thanks for benchmarking the patch performance impact.

Based on the results, TBH, at this point, my preference would be either to split pg_bigm into two opclasses - one using the new comparison function and one preserving the current behavior - or, as you suggested, add a knob to enable the new behavior only when needed.

MasahikoSawada added 2 commits April 17, 2026 15:36

Use bigmstrcmp instead of CMPBIGM.

6cce087

This change is required since bigmtextcmp() needs to directly call bigmstrcmp().

Fix a typo.

41e3879

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix inconsistent index scan behavior due to char signedness on different architectures.#29

Fix inconsistent index scan behavior due to char signedness on different architectures.#29
MasahikoSawada wants to merge 3 commits into
pgbigm:masterfrom
MasahikoSawada:signedness

MasahikoSawada commented Sep 5, 2025

Uh oh!

MasahikoSawada commented Sep 5, 2025

Uh oh!

MasaoFujii commented Dec 25, 2025

Uh oh!

MasahikoSawada commented Apr 17, 2026

Uh oh!

MasahikoSawada commented Apr 17, 2026

Uh oh!

MasahikoSawada commented Jun 3, 2026

Uh oh!

MasaoFujii commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MasahikoSawada commented Sep 5, 2025

Uh oh!

MasahikoSawada commented Sep 5, 2025

Uh oh!

MasaoFujii commented Dec 25, 2025

Uh oh!

MasahikoSawada commented Apr 17, 2026

Uh oh!

MasahikoSawada commented Apr 17, 2026

Uh oh!

MasahikoSawada commented Jun 3, 2026

Uh oh!

MasaoFujii commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants