Source code for the paper "From Emergence to Control: Probing and Modulating Self-Reflection in Language Models"
This project investigates self-reflection in Large Language Models (LLMs) through probing and steering techniques. We explore how reflection behaviors emerge and how they can be controlled through vector-based interventions.
- Probing Vectors: Techniques to detect and measure self-reflection patterns in model activations
- Model Insertion: Methods for injecting steering vectors to modulate reflection behavior
- Reflection Analysis: Frameworks for evaluating and understanding model self-reflection
uv sync# Lint check
uv run ruff check src/ tests/
# Format
uv run ruff format src/ tests/
# Type check
uv run mypy src/
# Run tests
uv run pytest.
├── src/probing_reflection/ # Source code
│ ├── __init__.py
│ └── py.typed
├── tests/ # Test files
├── docs/ # Documentation
│ └── design-docs/ # Design documents
├── AGENTS.md # AI agent instructions
├── ARCHITECTURE.md # System architecture
└── pyproject.toml # Project configuration
This project follows an AI-assisted research workflow. See AGENTS.md for detailed instructions on how AI agents should work in this repository.
If you use this code, please cite:
@article{probing_reflection_2024,
title={From Emergence to Control: Probing and Modulating Self-Reflection in Language Models},
author={[Authors]},
year={2024}
}[Add your license here]