Summary
Convert the repo's latent product contract into a repeatable benchmark suite with explicit pass/fail evidence.
This issue was generated from an org-wide EvalOps mining pass on 2026-05-10 07:57 UTC. It combines live GitHub repo signals with a per-repo arXiv search. Treat the research links as grounding for a concrete implementation, not as a request for a literature review.
Repo Evidence
- Repository description: Minimal agent runtime built with DSPy modules and a thin Python loop. Includes CLI, FastAPI server, and eval harness with OpenAI/Ollama support.
- Tree signals: 0 docs files, 2 workflows, 0 proto files, 4 test-like files.
AGENTS.md:3 includes latent-spec language: This repository implements a minimal agent runtime on top of DSPy. It is intentionally small and opinionated: use DSPy modules for planning/tool-calls and a thin Python loop for orchestration, tracing, and evaluation.
AGENTS.md:48 includes latent-spec language: Evals - evals/run_evals.py runs a small dataset and prints:
AGENTS.md:49 includes latent-spec language: Evals - evals/run_evals.py runs a small dataset and prints: - success_rate, contains_hit_rate, key_hit_rate, avg_latency_sec, avg_lm_calls, avg_tool_calls, avg_steps, avg_cost_usd, n.
AGENTS.md:51 includes latent-spec language: - success_rate, contains_hit_rate, key_hit_rate, avg_latency_sec, avg_lm_calls, avg_tool_calls, avg_steps, avg_cost_usd, n. - Scoring combines substring hits and key-in-observation hits per evals/rubrics.yaml.
AGENTS.md:61 includes latent-spec language: - micro_agent/optimize.py — compile few‑shot demos and save for the OpenAI planner. - evals/run_evals.py — metrics harness with cost/latency/usage.
README.md:8 includes latent-spec language: - Thin runtime (agent.py) handles looping, tool routing, and trace persistence. - CLI and FastAPI server, plus a tiny eval harness.
Research Grounding
Repo axes: research, evaluation, tooling, security
Search keywords: evals, tool, micro_agent, json, default, openai, tools, small, dspy, jsonl, ollama, ask
- arXiv:2506.19773v2 Automatic Prompt Optimization for Knowledge Graph Construction: Insights from an Empirical Study (Nandana Mihindukulasooriya, Niharika S. D'Souza, Faisal Chowdhury, Horst Samulowitz), 2025.
- arXiv:2507.03620v1 Is It Time To Treat Prompts As Code? A Multi-Use Case Study For Prompt Optimization Using DSPy (Francisca Lemos, Victor Alves, Filipa Ferraz), 2025.
- arXiv:2412.15298v1 A Comparative Study of DSPy Teleprompter Algorithms for Aligning Large Language Models Evaluation Metrics to Human Evaluation (Bhaskarjit Sarmah, Kriti Dutta, Anna Grigoryan, Sachin Tiwari, Stefano Pasquali, Dhagash Mehta), 2024.
- arXiv:2503.11118v1 UMB@PerAnsSumm 2025: Enhancing Perspective-Aware Summarization with Prompt Optimization and Supervised Fine-Tuning (Kristin Qi, Youxiang Zhu, Xiaohui Liang), 2025.
- arXiv:2604.04869v1 Optimizing LLM Prompt Engineering with DSPy Based Declarative Learning (Shiek Ruksana, Sailesh Kiran Kurra, Thipparthi Sanjay Baradwaj), 2026.
- arXiv:2508.04660v1 Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs (Noah Ziems, Dilara Soylu, Lakshya A Agrawal, Isaac Miller, Liheng Lai, Chen Qian), 2025.
- arXiv:2310.03714v1 DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines (Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan), 2023.
- arXiv:2507.14241v3 Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models (Rithesh Murthy, Ming Zhu, Liangwei Yang, Jielin Qiu, Juntao Tan, Shelby Heinecke), 2025.
- arXiv:2601.22129v2 SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents (Yifeng Ding, Lingming Zhang), 2026.
- arXiv:2602.00997v1 Error Taxonomy-Guided Prompt Optimization (Mayank Singh, Vikas Yadav, Eduardo Blanco), 2026.
What To Build
- Define the smallest representative
dspy-micro-agent golden workflow and capture expected inputs, outputs, and evidence artifacts.
- Add fixtures for a successful path, an ambiguous/degraded path, and a failure path.
- Publish a command that local agents and CI can run before shipping related changes.
Acceptance Criteria
Notes
- Generated issue 1/5 for
evalops/dspy-micro-agent by evalops_org_miner.py.
- Before implementation, confirm the sampled latent-spec snippets still match
main; this issue intentionally cites exact file paths/lines where the mining pass saw them.
Summary
Convert the repo's latent product contract into a repeatable benchmark suite with explicit pass/fail evidence.
This issue was generated from an org-wide EvalOps mining pass on 2026-05-10 07:57 UTC. It combines live GitHub repo signals with a per-repo arXiv search. Treat the research links as grounding for a concrete implementation, not as a request for a literature review.
Repo Evidence
AGENTS.md:3includes latent-spec language: This repository implements a minimal agent runtime on top of DSPy. It is intentionally small and opinionated: use DSPy modules for planning/tool-calls and a thin Python loop for orchestration, tracing, and evaluation.AGENTS.md:48includes latent-spec language: Evals -evals/run_evals.pyruns a small dataset and prints:AGENTS.md:49includes latent-spec language: Evals -evals/run_evals.pyruns a small dataset and prints: -success_rate,contains_hit_rate,key_hit_rate,avg_latency_sec,avg_lm_calls,avg_tool_calls,avg_steps,avg_cost_usd,n.AGENTS.md:51includes latent-spec language: -success_rate,contains_hit_rate,key_hit_rate,avg_latency_sec,avg_lm_calls,avg_tool_calls,avg_steps,avg_cost_usd,n. - Scoring combines substring hits and key-in-observation hits perevals/rubrics.yaml.AGENTS.md:61includes latent-spec language: -micro_agent/optimize.py— compile few‑shot demos and save for the OpenAI planner. -evals/run_evals.py— metrics harness with cost/latency/usage.README.md:8includes latent-spec language: - Thin runtime (agent.py) handles looping, tool routing, and trace persistence. - CLI and FastAPI server, plus a tiny eval harness.Research Grounding
Repo axes: research, evaluation, tooling, security
Search keywords: evals, tool, micro_agent, json, default, openai, tools, small, dspy, jsonl, ollama, ask
What To Build
dspy-micro-agentgolden workflow and capture expected inputs, outputs, and evidence artifacts.Acceptance Criteria
Notes
evalops/dspy-micro-agentbyevalops_org_miner.py.main; this issue intentionally cites exact file paths/lines where the mining pass saw them.