01/2026
AI Agents
PM-Tools | Workflows & Skills
Toolkit for PM work in Claude Code (partly built - partly copied from the best)
What is it about?
A library of reusable PM workflows, prompts and Claude Code skills for the full product lifecycle. Part built by me, part curated from people who know their craft.
Covers:
- Discovery & research (Mom Test interviews, JTBD, opportunity mapping)
- Strategy & positioning (competitor analysis, April Dunford’s framework)
- Prioritization (RICE, Shape Up appetite-first, bet sizing)
- PRD writing (structured prompts, agentic multi-round generation)
- Launch & measure (metrics framework, experiment design, roadmap planning)
- Evals (golden sets, LLM-as-judge, prompt testing, RAG evaluation)
Why I’m building it?
I’m trying to build a toolkit for me to use in product work. Haven’t yet tested many of these in action, so its more a “test this” repo for me.
How am I building it?
Mix of approaches:
- Workflows I’ve used in practice, translated to Claude Code skills
- Best frameworks from the field (Dunford, Shape Up, Mom Test) wrapped in opinionated prompts
- Eval skills from Hamel Husain’s work, adapted for product use cases
Markdown-based, Claude Code native. Works as standalone prompts or as part of agentic pipelines.
What’s my goal?
Have the right tool at hand at every stage of product work — and not have to rebuild it every time I start something new.
Tech Corner
// STATUS ──────────────────────────────────────────────────────
collected - needs to be validated more
lines_of_code: 35000 (mostly markdown) | mature_estimate: 100%
// PIPELINE ────────────────────────────────────────────────────
approach prompt library + Claude Code skills;
no runtime pipeline
deterministic template structure, output schemas, checklists
— scaffolding is fixed, LLM fills the reasoning
llm Claude for all skills; gpt-4o-mini in eval
check scripts; LLM-as-judge for output evals
reasoning every tool wraps an opinionated PM framework
(Mom Test, Shape Up, Dunford) — structure is
deterministic, domain reasoning is LLM's job
// HUMAN IN THE LOOP ───────────────────────────────────────────
where every tool is human-in-the-loop by design;
PRD workflows have explicit review gates
between generation rounds
why product decisions have real consequences —
tools surface options, judgment stays with PM
// EVALUATION ──────────────────────────────────────────────────
how golden set rubric (manual scoring);
LLM-as-judge (write-judge-prompt skill);
validate-evaluator to calibrate judges;
red-team adversarial scenarios;
automated pass/fail checks (Python)
traces trace schema skill captures execution details;
calibration tracker logs estimate accuracy
over time; synthetic data generation for
regression testing
// CONTEXT ─────────────────────────────────────────────────────
approach each SKILL.md is self-contained — full
instructions and output format in one file;
no shared state between skills
why skills are used across different projects and
contexts; can't depend on shared state;
self-contained = works as cold start anywhere
// INTEGRATIONS ─────────────────────────────────────────────────
claude_code native SKILL.md format — skills discovered,
versioned and invocable from Claude Code CLI
openai_api gpt-4o-mini in eval scripts — cheaper than
Claude for high-volume judge prompt runs
// OTHER TOOLS ──────────────────────────────────────────────────
skill_md vs plain prompts — Claude Code discovers and
invokes natively; versioned in git; shareable
hamel_evals adapted from hamelsmu/evals-skills — best
existing eval patterns; no point rebuilding