01/2026

AI Agents

PM-Tools | Workflows & Skills

Toolkit for PM work in Claude Code (partly built - partly copied from the best)

What is it about?

A library of reusable PM workflows, prompts and Claude Code skills for the full product lifecycle. Part built by me, part curated from people who know their craft.

Covers:

  • Discovery & research (Mom Test interviews, JTBD, opportunity mapping)
  • Strategy & positioning (competitor analysis, April Dunford’s framework)
  • Prioritization (RICE, Shape Up appetite-first, bet sizing)
  • PRD writing (structured prompts, agentic multi-round generation)
  • Launch & measure (metrics framework, experiment design, roadmap planning)
  • Evals (golden sets, LLM-as-judge, prompt testing, RAG evaluation)

Why I’m building it?

I’m trying to build a toolkit for me to use in product work. Haven’t yet tested many of these in action, so its more a “test this” repo for me.

How am I building it?

Mix of approaches:

  • Workflows I’ve used in practice, translated to Claude Code skills
  • Best frameworks from the field (Dunford, Shape Up, Mom Test) wrapped in opinionated prompts
  • Eval skills from Hamel Husain’s work, adapted for product use cases

Markdown-based, Claude Code native. Works as standalone prompts or as part of agentic pipelines.

What’s my goal?

Have the right tool at hand at every stage of product work — and not have to rebuild it every time I start something new.

Tech Corner

// STATUS ──────────────────────────────────────────────────────
collected - needs to be validated more

lines_of_code: 35000 (mostly markdown)   |   mature_estimate: 100%

// PIPELINE ────────────────────────────────────────────────────
approach        prompt library + Claude Code skills;
                no runtime pipeline

deterministic   template structure, output schemas, checklists
                — scaffolding is fixed, LLM fills the reasoning

llm             Claude for all skills; gpt-4o-mini in eval
                check scripts; LLM-as-judge for output evals

reasoning       every tool wraps an opinionated PM framework
                (Mom Test, Shape Up, Dunford) — structure is
                deterministic, domain reasoning is LLM's job

// HUMAN IN THE LOOP ───────────────────────────────────────────
where           every tool is human-in-the-loop by design;
                PRD workflows have explicit review gates
                between generation rounds

why             product decisions have real consequences —
                tools surface options, judgment stays with PM

// EVALUATION ──────────────────────────────────────────────────
how             golden set rubric (manual scoring);
                LLM-as-judge (write-judge-prompt skill);
                validate-evaluator to calibrate judges;
                red-team adversarial scenarios;
                automated pass/fail checks (Python)

traces          trace schema skill captures execution details;
                calibration tracker logs estimate accuracy
                over time; synthetic data generation for
                regression testing

// CONTEXT ─────────────────────────────────────────────────────
approach        each SKILL.md is self-contained — full
                instructions and output format in one file;
                no shared state between skills

why             skills are used across different projects and
                contexts; can't depend on shared state;
                self-contained = works as cold start anywhere

// INTEGRATIONS ─────────────────────────────────────────────────
claude_code     native SKILL.md format — skills discovered,
                versioned and invocable from Claude Code CLI

openai_api      gpt-4o-mini in eval scripts — cheaper than
                Claude for high-volume judge prompt runs

// OTHER TOOLS ──────────────────────────────────────────────────
skill_md        vs plain prompts — Claude Code discovers and
                invokes natively; versioned in git; shareable

hamel_evals     adapted from hamelsmu/evals-skills — best
                existing eval patterns; no point rebuilding