Add a research-backed acceptance harness for dspy micro agent (minimal built dspy modules thin)

## Summary

Convert the repo's latent product contract into a repeatable benchmark suite with explicit pass/fail evidence.

This issue was generated from an org-wide EvalOps mining pass on 2026-05-10 07:57 UTC. It combines live GitHub repo signals with a per-repo arXiv search. Treat the research links as grounding for a concrete implementation, not as a request for a literature review.

## Repo Evidence

- Repository description: Minimal agent runtime built with DSPy modules and a thin Python loop. Includes CLI, FastAPI server, and eval harness with OpenAI/Ollama support.
- Tree signals: 0 docs files, 2 workflows, 0 proto files, 4 test-like files.
- `AGENTS.md:3` includes latent-spec language: This repository implements a minimal agent runtime on top of DSPy. It is intentionally small and opinionated: use DSPy modules for planning/tool-calls and a thin Python loop for orchestration, tracing, and evaluation.
- `AGENTS.md:48` includes latent-spec language: Evals - `evals/run_evals.py` runs a small dataset and prints:
- `AGENTS.md:49` includes latent-spec language: Evals - `evals/run_evals.py` runs a small dataset and prints: - `success_rate`, `contains_hit_rate`, `key_hit_rate`, `avg_latency_sec`, `avg_lm_calls`, `avg_tool_calls`, `avg_steps`, `avg_cost_usd`, `n`.
- `AGENTS.md:51` includes latent-spec language: - `success_rate`, `contains_hit_rate`, `key_hit_rate`, `avg_latency_sec`, `avg_lm_calls`, `avg_tool_calls`, `avg_steps`, `avg_cost_usd`, `n`. - Scoring combines substring hits and key-in-observation hits per `evals/rubrics.yaml`.
- `AGENTS.md:61` includes latent-spec language: - `micro_agent/optimize.py` — compile few‑shot demos and save for the OpenAI planner. - `evals/run_evals.py` — metrics harness with cost/latency/usage.
- `README.md:8` includes latent-spec language: - Thin runtime (`agent.py`) handles looping, tool routing, and trace persistence. - CLI and FastAPI server, plus a tiny eval harness.

## Research Grounding

Repo axes: research, evaluation, tooling, security

Search keywords: evals, tool, micro_agent, json, default, openai, tools, small, dspy, jsonl, ollama, ask

- [arXiv:2506.19773v2](https://arxiv.org/abs/2506.19773v2) Automatic Prompt Optimization for Knowledge Graph Construction: Insights from an Empirical Study (Nandana Mihindukulasooriya, Niharika S. D'Souza, Faisal Chowdhury, Horst Samulowitz), 2025.
- [arXiv:2507.03620v1](https://arxiv.org/abs/2507.03620v1) Is It Time To Treat Prompts As Code? A Multi-Use Case Study For Prompt Optimization Using DSPy (Francisca Lemos, Victor Alves, Filipa Ferraz), 2025.
- [arXiv:2412.15298v1](https://arxiv.org/abs/2412.15298v1) A Comparative Study of DSPy Teleprompter Algorithms for Aligning Large Language Models Evaluation Metrics to Human Evaluation (Bhaskarjit Sarmah, Kriti Dutta, Anna Grigoryan, Sachin Tiwari, Stefano Pasquali, Dhagash Mehta), 2024.
- [arXiv:2503.11118v1](https://arxiv.org/abs/2503.11118v1) UMB@PerAnsSumm 2025: Enhancing Perspective-Aware Summarization with Prompt Optimization and Supervised Fine-Tuning (Kristin Qi, Youxiang Zhu, Xiaohui Liang), 2025.
- [arXiv:2604.04869v1](https://arxiv.org/abs/2604.04869v1) Optimizing LLM Prompt Engineering with DSPy Based Declarative Learning (Shiek Ruksana, Sailesh Kiran Kurra, Thipparthi Sanjay Baradwaj), 2026.
- [arXiv:2508.04660v1](https://arxiv.org/abs/2508.04660v1) Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs (Noah Ziems, Dilara Soylu, Lakshya A Agrawal, Isaac Miller, Liheng Lai, Chen Qian), 2025.
- [arXiv:2310.03714v1](https://arxiv.org/abs/2310.03714v1) DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines (Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan), 2023.
- [arXiv:2507.14241v3](https://arxiv.org/abs/2507.14241v3) Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models (Rithesh Murthy, Ming Zhu, Liangwei Yang, Jielin Qiu, Juntao Tan, Shelby Heinecke), 2025.
- [arXiv:2601.22129v2](https://arxiv.org/abs/2601.22129v2) SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents (Yifeng Ding, Lingming Zhang), 2026.
- [arXiv:2602.00997v1](https://arxiv.org/abs/2602.00997v1) Error Taxonomy-Guided Prompt Optimization (Mayank Singh, Vikas Yadav, Eduardo Blanco), 2026.

## What To Build

- Define the smallest representative `dspy-micro-agent` golden workflow and capture expected inputs, outputs, and evidence artifacts.
- Add fixtures for a successful path, an ambiguous/degraded path, and a failure path.
- Publish a command that local agents and CI can run before shipping related changes.

## Acceptance Criteria

- [ ] A short design note names the repo-specific workflow, threat or correctness model, and the research assumptions being adopted.
- [ ] A runnable check, fixture, or verifier exercises the new contract in CI or an equivalent local command documented in the repo.
- [ ] The implementation emits or stores enough evidence for a downstream agent/operator to cite inputs, decisions, and outputs.
- [ ] At least one negative/degraded-mode case is covered so failures are observable rather than silently accepted.
- [ ] Documentation links the new behavior to the relevant EvalOps platform primitive or explicitly records why this repo remains standalone.

## Notes

- Generated issue 1/5 for `evalops/dspy-micro-agent` by `evalops_org_miner.py`.
- Before implementation, confirm the sampled latent-spec snippets still match `main`; this issue intentionally cites exact file paths/lines where the mining pass saw them.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a research-backed acceptance harness for dspy micro agent (minimal built dspy modules thin) #2

Summary

Repo Evidence

Research Grounding

What To Build

Acceptance Criteria

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add a research-backed acceptance harness for dspy micro agent (minimal built dspy modules thin) #2

Description

Summary

Repo Evidence

Research Grounding

What To Build

Acceptance Criteria

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions