Skip to content

[Detail Bug] Java docs ingestion stores incorrect package names for Java 9+ modular Javadoc URLs #65

Description

@detail-app

Detail Bug Report

https://app.detail.dev/org_befd6425-a158-4e24-9d4d-1e5c08769515/bugs/bug_9c08315a-f502-4836-83fd-e350a8e09fbb

Introduced in 6f249e0 by @WilliamAGH on Sep 1, 2025

Summary

  • Context: JavaPackageExtractor extracts Java package names from Javadoc URLs for ingestion metadata, used downstream for citation URL anchor generation and ingestion deduplication.
  • Bug: The URL-based extraction strategy incorrectly includes the JPMS module name (e.g., java.base, java.sql) as part of the package name for Java 9+ modular Javadoc URLs.
  • Actual vs. expected: For URL .../api/java.base/java/lang/String.html, returns java.base.java.lang instead of java.lang.
  • Impact: Incorrect anchor generation for methods with non-hardcoded java.lang parameter types (CharSequence, StringBuilder, etc.), and incorrect package metadata stored in Qdrant.

Code with Bug

if (url.contains(API_PATH_SEGMENT)) {
    int apiPathOffset = url.indexOf(API_PATH_SEGMENT) + API_PATH_SEGMENT.length();
    String pathAfterApi = url.substring(apiPathOffset);
    String[] pathSegments = pathAfterApi.split("/");
    StringBuilder packageBuilder = new StringBuilder();
    for (String pathSegment : pathSegments) {
        if (pathSegment.endsWith(".html")) break;
        if (packageBuilder.length() > 0) packageBuilder.append('.');
        packageBuilder.append(pathSegment); // <-- BUG 🔴 includes JPMS module segment as package component
    }
    String packageName = packageBuilder.toString();
    if (packageName.contains(".")) return packageName;
}

Explanation

Modern Oracle Javadoc URLs are modular (Java 9+): the first segment after /api/ is a JPMS module name (e.g., java.base), followed by the package path (e.g., java/lang). The extractor joins all pre-.html segments into the “package”, so it incorrectly returns java.base.java.lang instead of java.lang.

This was verified end-to-end using real Oracle Java 25 pages where the configured HTML selectors (.subTitle, .package) match no elements, forcing the URL-based path. Examples:

  • .../api/java.base/java/lang/String.html → extracted java.base.java.lang (expected java.lang)
  • .../api/java.sql/java/sql/Connection.html → extracted java.sql.java.sql (expected java.sql)

The contaminated package is stored as document metadata and later reused for citation/member anchor refinement, producing incorrect anchors for simple, non-hardcoded java.lang types (e.g., CharSequence).

Codebase Inconsistency

Production defaults to Oracle Java 25 docs (modular URLs), making this bug deterministic under default config:

app.docs.root-url=${DOCS_ROOT_URL:https://docs.oracle.com/en/java/javase/25/}

Recommended Fix

Skip the first path segment after /api/ when it is a JPMS module name (e.g., starts with java., jdk., javafx.) before building the package name.

History

This bug was introduced in commit 6f249e0. The original implementation joined all URL path segments after /api/ into a package name, assuming a non-modular URL structure (/api/java/lang/String.html) rather than modular (/api/java.base/java/lang/String.html).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions