CaptionForge — experimental ComfyUI developer preview

CaptionForge is an experimental ComfyUI captioning framework for LoRA dataset preparation.

The current v0.1.0 preview is ready for ComfyUI users and node developers to install, test, compare, critique, and improve.

CaptionForge runs as a practical multi-pass captioning workflow: raw caption witnesses, text-LLM distillation, image-aware VLM validation, and auditable TXT/JSONL exports.

This is still an early public preview, so node names, JSONL schemas, prompts, model defaults, and documentation may evolve before a more formal release. Caption quality may also vary by dataset and model choice.

The main question this preview is trying to answer is simple: can multi-engine witness captions, text-LLM distillation, image-aware VLM validation, and JSONL audit trails produce better LoRA training captions than a single strong captioner alone?

Feedback, caption-quality comparisons, model recommendations, bug reports, and workflow suggestions are welcome.

What CaptionForge is

CaptionForge is a local, model-agnostic captioning framework for building richer, more auditable captions for LoRA dataset preparation inside ComfyUI.

A single vision-language model caption can be useful, but it is not always reliable. CaptionForge treats raw captions as imperfect witness statements, then uses a multi-pass process to merge, validate, correct, and export final training captions.

The current developer preview is tuned primarily for character and style LoRA captioning: detailed human, humanoid, illustrated, rendered, doll-like, pageant, cosplay, fashion, portrait, and stylized subjects where clothing, pose, body shape, facial traits, hair, eyes, makeup, materials, color, lighting, framing, and visible style cues matter.

Other image domains may work, but will require custom prompts, model choices, or workflow tuning. These are in the works and will be added in the near future.

Current status

CaptionForge v0.1.0 is an experimental developer preview.

That means:

the project is intended to be installed and test-driven
the workflow may still change
node names and categories may still be refined
JSONL schemas and audit fields may evolve
model behavior depends heavily on the selected backend models
it may or may not outperform a strong standalone captioner for your dataset
bug reports and comparison feedback are valuable

CaptionForge does not ship model weights. Large Joy, Qwen, Ollama, and other model downloads remain user-controlled.

Experimental or unsupported branches should remain unregistered by default and should not appear as active ComfyUI nodes unless explicitly imported by the package.

Main workflow

The active CaptionForge workflow is a multi-pass pipeline:

Pass A: Raw witness captions
  Joy Python caption witness xN
  Qwen Python caption witness xN
  optional Ollama VLM caption witness xN

Pass B: Text-LLM distillation
  a text-only Ollama LLM treats raw captions as witness ballots
  it emits accepted claims, plausible singleton candidates, rejected conflicts,
  and rich/taggy draft captions

Pass C: Image-aware VLM validation
  an Ollama VLM checks the Pass B evidence against the actual image
  it removes unsupported details, corrects visible errors, preserves supported
  LoRA-relevant detail, and writes validated final-caption candidates

Pass D: Export
  natural captions and taggy/comma-style captions are exported as TXT/JSONL sidecars

The intended behavior is not merely “summarize several captions.” The goal is to preserve useful trainable visual detail while reducing unsupported hallucinations through later image-aware validation.

Why use this instead of standalone Joy or some other standard captioning method/approach?

You may not need to.

JoyCaption can be excellent by itself, especially with a good local setup or a strong hosted implementation. CaptionForge is an experiment in whether a structured local pipeline can do better often enough to justify the extra workflow complexity.

CaptionForge may be useful when:

one raw captioner is strong but misses details
different captioners notice different useful details
you want JSONL audit trails of intermediate records
you want a text LLM to consolidate repeated witness evidence
you want a VLM to validate the draft against the source image
you want natural and taggy exports from the same audited run
you are preparing LoRA training captions and care about visible, trainable detail

A useful negative result is still useful: if a dataset is better served by standalone Joy, Qwen, or another captioner, that is exactly the kind of comparison this preview is meant to surface.

What CaptionForge tries to optimize

CaptionForge currently favors captions that are:

richer than a single generic image caption
less hallucinated than unvalidated text-only synthesis
useful for LoRA training
auditable through JSONL sidecars
locally runnable
prompt-configurable
model-agnostic enough to swap better witnesses, distillers, validators, and formatters over time

Useful caption details often include:

subject type and visible style
face shape and facial traits
hair color and hairstyle
eye color and makeup as separate details
expression and pose
hands and body position
body shape and visible proportions when relevant
clothing construction, layers, and fit
accessories, jewelry, nails, props, and distinctive details
colors, materials, textures, lighting, background, framing, and crop

Visible sensual, glamour, swimwear, lingerie, revealing clothing, cleavage, side openings, exposed midriff, or similar styling may be described neutrally when it is actually visible and relevant to the dataset. CaptionForge prompts should not invent hidden anatomy, unseen clothing, explicit acts, or contradicted details.

Active node families

Node categories are being normalized under:

Captioning/CaptionForge

with active caption nodes under:

Captioning/CaptionForge/Caption Nodes

Pipeline and orchestration

JLC CaptionForge Pipeline Planner

The central planning node for normal runs.

It coordinates:

input image path or direct image passthrough
recursive folder traversal
filename glob filtering
output directory
run name
overwrite behavior
Pass A witness run counts
seed schedules
sampling schedules
max image size
max token budget
LoRA trigger word
user caption anchor
distiller settings
validator settings
final export settings
derived JSONL/TXT/config paths

JLC CaptionForge

The main capstone/orchestration node.

Current target behavior:

consume Pass A raw caption records
run Pass B text-LLM distillation
run Pass C image-aware VLM validation
export final natural and taggy captions

The natural caption should come from the image-aware VLM validation pass. A text-only stage may be used for formatting/taggy output, but it should not blindly rewrite the natural final caption.

Caption witnesses

JLC CaptionForge Joy Caption

Python-based JoyCaption/LLaVA-family Pass A witness.

Joy remains one of the strongest raw caption witnesses and is treated as a major first-class CaptionForge caption source.

JLC CaptionForge Qwen Caption

Python-based Qwen-family Pass A witness.

Qwen is useful as a second caption voice, especially when its model behavior complements Joy. Optional 8-bit loading may be available where supported.

JLC CaptionForge Ollama Caption

Ollama-backed VLM Pass A witness.

This node delegates model execution to a local Ollama server rather than loading Hugging Face/PyTorch model weights inside ComfyUI. It can use configured Ollama VLM tags such as:

gemma4:26b
qwen3.6:35B-A3B
huihui_ai/gemma-4-abliterated:26b

The Ollama Caption node is an optional raw-caption witness. It does not replace the later VLM validator/capstone role.

Prompt and option helpers

JLC CaptionForge Template Options

Shared prompt-option sidecar for caption nodes.

It is intended to help request consistent LoRA-relevant detail across caption witnesses without forcing every backend model into the same prompt implementation.

CaptionForge Ollama model dropdown config

The file:

config/captionforge_ollama_models.json

defines user-editable Ollama model tags for dropdowns used by distiller, validator, formatter, and Ollama caption-witness nodes.

Output files

CaptionForge writes auditable sidecars during planned runs. Current conventions include:

<run_name>__A_RAW_CAPTIONS.jsonl
<run_name>__B_DISTILL.jsonl
<run_name>__B_DISTILL_readable.jsonl
<run_name>__B_DISTILL_readable.json
<run_name>__B_DISTILL_prompts.jsonl
<run_name>__C_VLM_VALIDATED.jsonl
<run_name>__C_VLM_VALIDATED_readable/
<run_name>__C_VLM_VALIDATOR_prompts.jsonl
<run_name>__D_FINAL_EXPORT.jsonl
<run_name>__TXT/
<run_name>__output_paths.json
<run_name>__run_config.json

Exact filenames and schemas may change during the developer-preview phase.

Final outputs are expected to include:

Natural caption:  VLM-validated prose
Taggy caption:    comma-separated LoRA-style caption

Model configuration

CaptionForge uses two model ecosystems:

Python / Hugging Face model folders for Joy and Qwen witness engines.
Ollama models for text-LLM distillation, image-aware VLM validation, optional formatting, and Ollama-backed caption witnesses.

A typical Ollama dropdown config:

{
  "_meta": {
    "name": "CaptionForge Ollama Model Dropdowns",
    "version": "0.1.0",
    "description": "User-editable Ollama model dropdown configuration for CaptionForge nodes and engines.",
    "consumed_by": [
      "nodes/captionforge_ollama_model_dropdowns.py",
      "CaptionForge Pipeline Planner",
      "JLC CaptionForge capstone",
      "JLC CaptionForge Ollama Caption"
    ],
    "notes": [
      "Values should be concrete Ollama model tags used exactly as written.",
      "distiller_models are used for text-only LLM distillation and formatting stages.",
      "validator_models are used for image-aware VLM validation.",
      "caption_models are used by Ollama-backed Pass A caption witness nodes.",
      "Set include_custom to true to expose a custom model-tag entry in supported nodes."
    ]
  },
  "distiller_models": [
    "mistral-small:24b",
    "VladimirGav/gemma4-26b-16GB-VRAM-Uncensored",
    "deepseek-r1:32b",
    "tarruda/neuraldaredevil-8b-abliterated:fp16",
    "gpt-oss:20b"
  ],
  "validator_models": [
    "gemma4:26b",
    "qwen3.6:35B-A3B",
    "huihui_ai/gemma-4-abliterated:26b"
  ],
  "format_models": [
    "mistral-small:24b",
    "VladimirGav/gemma4-26b-16GB-VRAM-Uncensored",
    "gpt-oss:20b",
    "deepseek-r1:32b"
  ],
  "caption_models": [
    "gemma4:26b",
    "qwen3.6:35B-A3B",
    "huihui_ai/gemma-4-abliterated:26b"
  ],
  "defaults": {
    "distiller_model": "mistral-small:24b",
    "validator_model": "gemma4:26b",
    "format_model": "mistral-small:24b",
    "caption_model": "gemma4:26b"
  },
  "include_custom": true
}

Terminology:

distiller_model   text-only LLM for Pass B distillation
validator_model   image-aware VLM for Pass C validation
format_model      text-only LLM for formatting/taggy conversion when used
caption_model     Ollama-backed Pass A image-caption witness model

Model locations

Large model weights are intentionally not stored in this repository.

Python-based witness models are expected under ComfyUI model folders, for example:

ComfyUI/models/LLM/JLC_QwenCaption/
ComfyUI/models/LLM/JLC_JoyCaption/

Ollama models must be installed and runnable through Ollama outside this repository.

CaptionForge does not require every supported backend to be installed for every workflow. Users can test smaller subsets first.

Installation

ComfyUI custom node install

Clone or copy CaptionForge into your ComfyUI custom nodes directory:

ComfyUI/custom_nodes/CaptionForge

Then restart ComfyUI.

Dependencies

Python dependencies are declared in pyproject.toml.

Typical local use may involve:

torch
transformers
accelerate
huggingface-hub
pillow
numpy
safetensors
qwen-vl-utils

Optional quantization support may involve:

bitsandbytes

Ollama-backed stages require a working local Ollama installation and installed Ollama model tags.

Hardware notes

CaptionForge is designed for local workflows, but strong results may require large local models.

Practical performance depends on:

GPU VRAM
system RAM
model size
quantization mode
Ollama version
context length
image size
number of Pass A witness runs
whether models are kept loaded or unloaded between runs

The author's active development environment includes an RTX 4090 Laptop GPU with 16 GB VRAM. Larger models may be slow, may require careful quantization, or may need more capable hardware.

Experimental branches

Some experimental or unsupported code may exist in the repository for future A/B testing or research.

Experimental branches should be:

clearly labeled
kept out of the normal ComfyUI registration path
not imported by __init__.py
not shown as mainline nodes unless deliberately enabled
treated as unsupported starting points rather than stable user features

The active public workflow should be the main Planner → Pass A witnesses → Distiller → VLM Validator → Export path.

Development principles

CaptionForge currently prioritizes:

local execution
auditable intermediate records
JSONL sidecars
reusable engines separated from ComfyUI node wrappers
planner-driven workflows
model cache and VRAM hygiene
strong defaults for LoRA captioning
explicit prompt roles
model-agnostic backends
visible, trainable detail over generic caption prose
practical feedback from real datasets

Feedback wanted

Useful feedback includes:

comparisons against standalone JoyCaption, Qwen, or other captioners
examples where CaptionForge improves caption quality
examples where CaptionForge makes captions worse
hallucination reports
missed-detail reports
model recommendations
prompt improvements
broken node reports
workflow usability feedback
VRAM/performance observations
JSONL/audit trail suggestions

Please include enough context to reproduce the issue or evaluate the result: selected nodes, model tags, relevant settings, whether the run used direct IMAGE input or a folder path, and a small sample of generated captions when possible.

Attribution & License

Concept and implementation by J. L. Córdova, with development assistance from ChatGPT (OpenAI).

CaptionForge's Joy/template-option workflow is locally adapted and was inspired in part by the practical template interface pattern used by the public JoyCaption Beta One Hugging Face Space:

https://huggingface.co/spaces/fffiloni/JoyCaption-Beta-One

Released under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets/icons		assets/icons
config		config
engines		engines
nodes		nodes
web		web
.gitignore		.gitignore
DirectoryStructure.md		DirectoryStructure.md
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
captionforge_version.py		captionforge_version.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

CaptionForge — experimental ComfyUI developer preview

What CaptionForge is

Current status

Main workflow

Why use this instead of standalone Joy or some other standard captioning method/approach?

What CaptionForge tries to optimize

Active node families

Pipeline and orchestration

JLC CaptionForge Pipeline Planner

JLC CaptionForge

Caption witnesses

JLC CaptionForge Joy Caption

JLC CaptionForge Qwen Caption

JLC CaptionForge Ollama Caption

Prompt and option helpers

JLC CaptionForge Template Options

CaptionForge Ollama model dropdown config

Output files

Model configuration

Model locations

Installation

ComfyUI custom node install

Recommended test posture

Dependencies

Hardware notes

Experimental branches

Development principles

Feedback wanted

Attribution & License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages