Skip to content

Linkage state machine + set_rust_owned for orphan ownership#195

Merged
dginev merged 7 commits into
masterfrom
rust-owned-orphans
May 9, 2026
Merged

Linkage state machine + set_rust_owned for orphan ownership#195
dginev merged 7 commits into
masterfrom
rust-owned-orphans

Conversation

@dginev
Copy link
Copy Markdown
Member

@dginev dginev commented May 9, 2026

Summary

Replaces the _Node.unlinked: bool field with a 3-variant Linkage enum
(Linked / Unlinked / RustOwned) that drives _Node::Drop. The old
"free if unlinked" heuristic conflated tree-detachment with Rust ownership
and could fire xmlFreeNode against memory the source document still
considered live — corrupting the dictionary on the next xmlCopyNode
against a sibling, with no last-error.

The new Drop rules:

  • Linked — source document owns; never free.
  • UnlinkedxmlUnlinkNode ran but node->doc is still set; free
    only if node->doc == NULL (true C-level orphan).
  • RustOwned — caller transferred ownership via set_rust_owned;
    always free.

New API

  • Node::set_rust_owned(&self) — transfer the C allocation's lifetime to
    the wrapper. After this, the last Node clone's Drop calls
    xmlFreeNode, regardless of node->doc. Sticky across clones.
  • Node::is_rust_owned(&self) -> bool (also on RoNode, always false).
  • Document::dup_node_into_new_doc(&Node) -> Result<Document, ()> — deep
    copy a subtree into a fresh, independent document. Works around
    xmlDocCopyNode returning NULL on the second call within one source
    doc.

Misuse-mode guards

  • set_rust_owned debug-asserts on Linked input.
  • set_linked (crate-internal, called from add_child / add_*_sibling)
    debug-asserts on RustOwned input to catch re-link-after-takeover.
  • Document::import_node returns Err(()) on RustOwned source —
    release-mode guard since the set_linked debug-assert is compiled out.

debug_assert! is fully elided in release builds, so hot paths keep
their prior cost.

Test plan

  • cargo test --tests — 98 tests across 11 files, all pass
  • set_rust_owned happy paths: default state, idempotent, sticky
    across clones, post-dup_node_into_new_doc, drop-before-source-doc,
    mixed reattach, descendant-wrappers-drop-before-parent, 16-section
    stress
  • Misuse #[should_panic] tests in debug builds
  • import_node rejects RustOwned source (release-mode regression
    test)
  • ASAN (-Z sanitizer=address): zero leaks across all rust_owned_*
    tests; the pre-existing node_can_unbind leak (379 B / 6 allocs) is
    preserved as the documented "unlinked-and-forgotten" baseline that
    set_rust_owned is the cure for

Note on documented constraints

Three lifetime constraints are documented but not enforced in release:

  1. A RustOwned orphan must drop before its source document is freed
    (libxml2's xmlFreeNode walks node->doc->dict).
  2. Don't call set_rust_owned on a Linked node.
  3. Don't re-link a RustOwned node via add_child / add_*_sibling /
    import_node.

Each is debug-asserted; misuse in release is silent UB. This matches the
prior shape of the crate's safety contracts.

Motivation

The latexml-oxide post-processing pipeline implements --split by walking
the source document and, for each chapter or section, unlinking the
subtree, deep-copying it into a fresh sub-document, serializing it as a
standalone page, and discarding the source-side detached node. This
exercises every edge in the new state machine: the Unlinked → RustOwned
transition replaces the prior UAF (the second xmlCopyNode returned NULL
because the first wrapper drop had freed memory the source dict still
referenced), dup_node_into_new_doc is the per-page extraction primitive,
and set_rust_owned reclaims each extracted node so the per-document
allocation cost stays bounded across hundreds of split pages.

dginev and others added 7 commits May 9, 2026 07:36
Replace the `_Node.unlinked: bool` field with a 3-variant `Linkage` enum
(`Linked` / `Unlinked` / `RustOwned`) that drives `_Node::Drop`. The old
"free if unlinked" heuristic conflated tree-detachment with Rust
ownership and could free memory the source document still considered
live (corrupting the dictionary on the next `xmlCopyNode` against a
sibling).

The new Drop rules:
  * `Linked` — source document owns; never free.
  * `Unlinked` — `xmlUnlinkNode` ran but `node->doc` is set; free only
    if `node->doc == NULL` (true C-level orphan).
  * `RustOwned` — caller transferred ownership via `set_rust_owned`;
    always free.

Public API additions: `Node::set_rust_owned`, `Node::is_rust_owned`.
`set_rust_owned` debug-asserts on `Linked`-input. `set_linked` (crate
internal, called from `add_child` etc.) debug-asserts on `RustOwned`-
input to catch re-link-after-takeover misuse. `debug_assert!` is fully
elided in release builds, so hot paths keep their prior cost.

`unlink_node`'s rustdoc spells out the three valid post-unlink
dispositions: re-attach, copy out via `Document::dup_node_into_new_doc`,
or call `set_rust_owned` to free. An unlinked node that does none of
these leaks until process exit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `Document::dup_node_into_new_doc(&Node) -> Result<Document, ()>`:
build a fresh document whose root is a deep copy of the given node's
subtree. Unlike `import_node`, it does not require the source to be
unlinked and does not mutate the source wrapper's state. This is the
canonical path for repeatedly extracting subtrees from a single source
document — `xmlDocCopyNode(src, dst, 1)` returns NULL on the second
sibling within one source doc; this method works around that by
copying into a fresh doc per call.

Drop the legacy `node.set_linked()` workaround inside
`dup_node_into_new_doc`. With the new `_Node::Drop` rules, a detached
subtree whose `node->doc` still points at the source is treated as
doc-owned and not freed by the wrapper; the workaround is unneeded.

Harden `Document::import_node` to reject `RustOwned` source nodes.
Without the early return, a release build would silently no-op on
the internal `set_linked` transition (debug-assert is compiled out)
and set up a later double-free.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`RoNode` is a `Copy` borrow into a document tree owned by a parent
`Document`. It has no `Drop` and cannot take ownership of the C
allocation, so `is_rust_owned` is unconditionally `false`. Mirrors
`Node::is_rust_owned` so callers can write code generic over the two
node types without special-casing `RoNode`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three new test files exercising the wrapper's lifetime model:

* tests/dup_node_into_new_doc_tests.rs (9 tests) — basic, multi-sibling
  extraction, source-dropped-first, post-unlink chain, post-XPath
  mutation, repeated-namespaces, large-doc stress (uses the new
  resources/large_doc.xml fixture), and mixed-XPath at scale.

* tests/xml_copy_invariant_tests.rs (10 tests) — direct libxml2 FFI
  invariants. Verify that source-doc state remains intact across
  xmlCopyDoc / xmlCopyNode round-trips, that consecutive xmlCopyNode
  calls against the same source don't corrupt the dictionary, and
  that the split-extract pattern lifecycles cleanly.

* tests/rust_owned_tests.rs (11 tests) — `set_rust_owned` happy paths
  (default state, idempotent, sticky-across-clones, post-dup,
  drop-before-source-doc, mixed reattach, descendant-wrappers-drop-
  before-parent, 16-section stress), plus misuse-mode debug-assert
  guards (`#[should_panic]` tests for `set_rust_owned` on Linked
  input and re-link via add_child) and the `import_node` rejection
  test.

All 49 tests pass; the rust_owned suite is leak-clean under
`-Z sanitizer=address`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Patch-version bump to bindgen 0.72.1 for an upstream codegen fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`cargo clippy --fix --all-targets` plus a few manual touch-ups:

* `bool_assert_comparison` — auto-fixed in tests/c14n.rs, tests/tree_tests.rs,
  tests/xml_copy_invariant_tests.rs, tests/xpath_tests.rs.
* `assertions_on_constants` — replaced `assert!(false)` with `unreachable!`
  in tests/tree_tests.rs and dropped a no-op `assert!(true, ...)` in
  tests/xpath_tests.rs.
* `manual_c_str_literals` — switched to `c"..."` literals in
  tests/xml_copy_invariant_tests.rs.
* `doc_lazy_continuation` — added paragraph-break blank `///` lines after
  embedded bullet/numbered lists in src/tree/node.rs and
  tests/rust_owned_tests.rs rustdoc.

Remaining warnings (13) all originate in the bindgen-generated
`bindings.rs` under target/; they regenerate on every build and are not
editable from this crate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dginev dginev merged commit b61a809 into master May 9, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant