Detail Bug Report
https://app.detail.dev/org_befd6425-a158-4e24-9d4d-1e5c08769515/bugs/bug_9c08315a-f502-4836-83fd-e350a8e09fbb
Introduced in 6f249e0 by @WilliamAGH on Sep 1, 2025
Summary
- Context:
JavaPackageExtractor extracts Java package names from Javadoc URLs for ingestion metadata, used downstream for citation URL anchor generation and ingestion deduplication.
- Bug: The URL-based extraction strategy incorrectly includes the JPMS module name (e.g.,
java.base, java.sql) as part of the package name for Java 9+ modular Javadoc URLs.
- Actual vs. expected: For URL
.../api/java.base/java/lang/String.html, returns java.base.java.lang instead of java.lang.
- Impact: Incorrect anchor generation for methods with non-hardcoded java.lang parameter types (CharSequence, StringBuilder, etc.), and incorrect package metadata stored in Qdrant.
Code with Bug
if (url.contains(API_PATH_SEGMENT)) {
int apiPathOffset = url.indexOf(API_PATH_SEGMENT) + API_PATH_SEGMENT.length();
String pathAfterApi = url.substring(apiPathOffset);
String[] pathSegments = pathAfterApi.split("/");
StringBuilder packageBuilder = new StringBuilder();
for (String pathSegment : pathSegments) {
if (pathSegment.endsWith(".html")) break;
if (packageBuilder.length() > 0) packageBuilder.append('.');
packageBuilder.append(pathSegment); // <-- BUG 🔴 includes JPMS module segment as package component
}
String packageName = packageBuilder.toString();
if (packageName.contains(".")) return packageName;
}
Explanation
Modern Oracle Javadoc URLs are modular (Java 9+): the first segment after /api/ is a JPMS module name (e.g., java.base), followed by the package path (e.g., java/lang). The extractor joins all pre-.html segments into the “package”, so it incorrectly returns java.base.java.lang instead of java.lang.
This was verified end-to-end using real Oracle Java 25 pages where the configured HTML selectors (.subTitle, .package) match no elements, forcing the URL-based path. Examples:
.../api/java.base/java/lang/String.html → extracted java.base.java.lang (expected java.lang)
.../api/java.sql/java/sql/Connection.html → extracted java.sql.java.sql (expected java.sql)
The contaminated package is stored as document metadata and later reused for citation/member anchor refinement, producing incorrect anchors for simple, non-hardcoded java.lang types (e.g., CharSequence).
Codebase Inconsistency
Production defaults to Oracle Java 25 docs (modular URLs), making this bug deterministic under default config:
app.docs.root-url=${DOCS_ROOT_URL:https://docs.oracle.com/en/java/javase/25/}
Recommended Fix
Skip the first path segment after /api/ when it is a JPMS module name (e.g., starts with java., jdk., javafx.) before building the package name.
History
This bug was introduced in commit 6f249e0. The original implementation joined all URL path segments after /api/ into a package name, assuming a non-modular URL structure (/api/java/lang/String.html) rather than modular (/api/java.base/java/lang/String.html).
Detail Bug Report
https://app.detail.dev/org_befd6425-a158-4e24-9d4d-1e5c08769515/bugs/bug_9c08315a-f502-4836-83fd-e350a8e09fbb
Introduced in 6f249e0 by @WilliamAGH on Sep 1, 2025
Summary
JavaPackageExtractorextracts Java package names from Javadoc URLs for ingestion metadata, used downstream for citation URL anchor generation and ingestion deduplication.java.base,java.sql) as part of the package name for Java 9+ modular Javadoc URLs..../api/java.base/java/lang/String.html, returnsjava.base.java.langinstead ofjava.lang.Code with Bug
Explanation
Modern Oracle Javadoc URLs are modular (Java 9+): the first segment after
/api/is a JPMS module name (e.g.,java.base), followed by the package path (e.g.,java/lang). The extractor joins all pre-.htmlsegments into the “package”, so it incorrectly returnsjava.base.java.langinstead ofjava.lang.This was verified end-to-end using real Oracle Java 25 pages where the configured HTML selectors (
.subTitle, .package) match no elements, forcing the URL-based path. Examples:.../api/java.base/java/lang/String.html→ extractedjava.base.java.lang(expectedjava.lang).../api/java.sql/java/sql/Connection.html→ extractedjava.sql.java.sql(expectedjava.sql)The contaminated package is stored as document metadata and later reused for citation/member anchor refinement, producing incorrect anchors for simple, non-hardcoded
java.langtypes (e.g.,CharSequence).Codebase Inconsistency
Production defaults to Oracle Java 25 docs (modular URLs), making this bug deterministic under default config:
app.docs.root-url=${DOCS_ROOT_URL:https://docs.oracle.com/en/java/javase/25/}Recommended Fix
Skip the first path segment after
/api/when it is a JPMS module name (e.g., starts withjava.,jdk.,javafx.) before building the package name.History
This bug was introduced in commit 6f249e0. The original implementation joined all URL path segments after
/api/into a package name, assuming a non-modular URL structure (/api/java/lang/String.html) rather than modular (/api/java.base/java/lang/String.html).