Skip to content

Frequent failover errors: nodeKindToString: unknown node kind xx #1107

@tboevil

Description

@tboevil

My long-term test environment has been performing manual failover. After a while, it throws the error: nodeKindToString: unknown node kind xx. By adding logs, I found that when the timeline exceeds 1024, nodeKind is overwritten by a character that has overflowed from memory.

code:

	diag_log_pgkind_snapshot(const char *tag, LocalPostgresServer *postgres)
    {
    	unsigned char *pgKindPtr = (unsigned char *) &(postgres->pgKind);
    
    	log_info("DIAGNOSTIC [%s] pgKind snapshot: postgres->pgKind=%d (0x%02x) "
    			 "bytes=[%02x %02x %02x %02x %02x %02x %02x %02x]",
    			 tag,
    			 postgres->pgKind, postgres->pgKind,
    			 pgKindPtr[-8], pgKindPtr[-7], pgKindPtr[-6], pgKindPtr[-5],
    			 pgKindPtr[-4], pgKindPtr[-3], pgKindPtr[-2], pgKindPtr[-1]);
    }

        ......
	diag_log_pgkind_snapshot("RPL-SNAP-03b1 enter-pgctl_identify_system", postgres);

	if (!prepare_primary_conninfo(primaryConnInfo,
								  MAXCONNINFO,
								  primaryNode->host,
								  primaryNode->port,
								  replicationSource->userName,
								  NULL, /* no database */
								  replicationSource->password,
								  replicationSource->applicationName,
								  replicationSource->sslOptions,
								  false)) /* no need for escaping */
	{
		/* errors have already been logged. */
		return false;
	}

	/*
	 * Per https://www.postgresql.org/docs/12/protocol-replication.html:
	 *
	 * To initiate streaming replication, the frontend sends the replication
	 * parameter in the startup message. A Boolean value of true (or on, yes,
	 * 1) tells the backend to go into physical replication walsender mode,
	 * wherein a small set of replication commands, shown below, can be issued
	 * instead of SQL statements.
	 */
	int len = sformat(primaryConnInfoReplication, MAXCONNINFO,
					  "%s replication=1",
					  primaryConnInfo);

	if (len >= MAXCONNINFO)
	{
		log_warn("Failed to call IDENTIFY_SYSTEM: primary_conninfo too large");
		return false;
	}

	diag_log_pgkind_snapshot("RPL-SNAP-03b2 after-sformat-replicationConninfo", postgres);

log:

Dec 16 20:44:49 xxx-1 pg_autoctl[3163148]: 20:44:49 3163148 INFO  DIAGNOSTIC [RPL-SNAP-03b1 enter-pgctl_identify_system] pgKind snapshot: postgres->pgKind=1 (0x01) bytes=[00 00 00 00 00 00 00 00]
Dec 16 20:44:49 xxx-1 pg_autoctl[3163148]: 20:44:49 3163148 INFO  DIAGNOSTIC [RPL-SNAP-03b2 after-sformat-replicationConninfo] pgKind snapshot: postgres->pgKind=1 (0x01) bytes=[00 00 00 00 00 00 00 00]
Dec 16 20:44:49 xxx-1 pg_autoctl[3163148]: 20:44:49 3163148 INFO  DIAGNOSTIC [RPL-SNAP-03b3 after-pgsql_identify_system] pgKind snapshot: postgres->pgKind=66 (0x42) bytes=[42 00 00 00 98 9f 7a 4c]

Please check if my analysis is correct, and why we need to set PG_AUTOCTL_MAX_TIMELINES = 1024?Is there anything special about this value?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions