Merging with PostgreSQL

Greenplum is a fork of PostgreSQL. After it was forked from PostgreSQL 8.2, a lot of new features have been added to PostgreSQL. We'd like to have them in Greenplum as well.

Plan

The plan is to merge PostgreSQL 8.3, then 8.4, then 9.0, and so forth into Greenplum. In theory, this means doing "git merge pgsql/REL8_3_0", and fixing all the merge conflicts until the regression tests pass again. In practice, of course, it's a bit more complicated.

##Tracking which upstream patches are applicable or not: https://docs.google.com/spreadsheets/d/1dDDDqV4ysVmDHjzLPjZGO6IMJO5DKldbgIZv8x3uMJQ/edit?usp=sharing.

When we start doing the merge, we will update this spreadsheet as we go.

Status

The current status is that we have merged all of 8.3 (including the final minor release of 8.3.23) into Greenplum. This merge is yet to be part of an official Greenplum release but it is in the master branch.

Things we can do before we start merging

##src/backend/nodes

Every time a new field is added/removed in one of the parse or plan tree structs, its read/out/equal functions need to be adjusted accordingly. This is laborious, and often produces bugs of omission in PostgreSQL too. For Greenplum, we just need to remember to update these functions whenever we modify Greenplum-specific structs, like upstream does.

There's one extra twist that presents a merge hazard however: In addition to the out and read functions for each struct, Greenplum has a copy of each out and read functions, in outfast.c and readfast.c. They need to be manually kept in sync.

Let's try to refactor outfast.c and readfast.c so that we don't need to update these files whenever a new field is added in upstream. Perhaps we can put some #ifdefs in outfuncs.c and readfuncs.c so that we can compile those files twice, with different READ_* and WRITE_* macros, to produce the same code that we have in outfast.c and readfast.c.

CaQL

Most syscache lookups in Greenplum have been converted to use the so-called caql interface, with simplified SQL queries against the catalog tables. The CaQL system performs the same syscache lookups, or scans the catalog tables with systable_beginscan.

That's going to create a lot of merge conflicts. There is no particular advantage to using CaQL over direct SearchSysCache calls, so in new code, there's no point in converting SearchSysCache calls to CaQL anymore. But we also don't need to rip out and replace all existing CaQL calls right now. I suggest that we replace CaQL lookups with the upstream version whenever we get a merge conflict and have to touch a piece of code, to make future merges easier. But no need to touch lookups when there's no conflict.

##CVS tags and Copyright notices

There are a lot of trivial conflicts from the $PostgreSQL$ tags in file headers:

@@@ -4,7 -4,7 +4,11 @@@
  #    Makefile for access/common
  #
  # IDENTIFICATION
++<<<<<<< HEAD
 +#    $PostgreSQL: pgsql/src/backend/access/common/Makefile,v 1.25 2008/02/19 11:49:12 petere Exp $
++=======
+ #    $PostgreSQL: pgsql/src/backend/access/common/Makefile,v 1.23 2007/01/20 17:16:10 petere Exp $
++>>>>>>> REL8_3_0
  #

These tags were removed in PostgreSQL 9.1, when PostgreSQL migrated to Git. I guess the best approach is to take the upstream versions of these tags, so that the next merge will not conflict on the same. Same with copyright notices. Once we catch up with 9.1, the problem will go away.

I wrote a small perl script to resolve that kinds of conflicts automatically (attached). You still need to run the script, but it helps a little bit.

tidycat/catullus

We've replaced entirely the way built-in functions and types are stored in the source code. In PostgreSQL, they are stored in pg_proc.h and pg_type.h as DATA rows. In GPDB, they are stored in DDL-like CREATE commands in pg_proc.sql and pg_type.sql.

While that's a nice system on its own, it's going to cause major merge conflicts. Every modification to the catalogs in the upstream will conflict.

We need to refactor the build scripts so that upstream catalog entries can live in pg_proc.h and pg_type.h, as they are in the upstream. We can keep the tidycat/catullus mechanism for GPDB-added entries, however.

MirroredLock

Every place where a buffer is accessed, there are calls to MIRROREDLOCK_BUFMGR_LOCK and similar macros, to protect the buffer lookups with MirroredLock. That's pretty troublesome, as there will be bugs of omission, when new ReadBuffer() calls are introduced in PostgreSQL, and they would need to be decorated with the MIRROREDLOCK macros to work correctly. There are also replacements for direct filesystem access with open/write/etc., with MirroredFile_Open etc.

It would be nice to get rid of this mechanism, but I don't know where to start. Any ideas?

One idea would be to replace the whole file mirroring and change tracking mechanism with streaming replication. That's a rather large undertaking, so I'm inclined to not make that a pre-requisite to merging. But it would certainly help a lot. It's something that a separate team could work on, in parallel with the merge. It might also be helpful to get to PostgreSQL 9.0 first, so that you could rely on streaming replication.

One approach would be leave the system as it is, and go ahead with the merge without worrying about missing MIRROREDLOCKs. After the merge, perform a scan through the whole codebase, grepping for ReadBuffer and similar calls, and add any missing MIRROREDLOCKs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging with PostgreSQL

Plan

Status

Things we can do before we start merging

CaQL

tidycat/catullus

MirroredLock

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally