Capstone Grading Rubric¶

Scoring Overview¶

Total: 100 points

Passing threshold: 70 points

Strong completion: 85 points or higher

Exemplary completion: 95 points or higher

The capstone is graded on working behavior, agent design, safety, evaluation quality, and clarity of communication. Award partial credit for visible, tested work. Do not award points for claims that are not supported by files, demo output, or eval records.

1. End-to-End Workflow: 20 Points¶

Criteria	Points
`/review-pr` accepts a PR number, URL, or fixture path and routes the task to the orchestrator	4
The workflow reviews all three local fixtures without manual prompt rewriting	4
Final output includes verdict, summary, findings, test impact, and follow-up questions	4
Findings include evidence such as file names, line references, snippets, or clear fixture references	3
The workflow produces useful results without editing source code or changing repo state	3
README documents setup and usage clearly enough for another learner to run it	2

Full credit requires a repeatable path from command invocation to final review packet.

2. Orchestrator Design: 15 Points¶

Criteria	Points
`reviewer-orchestrator` has a clear role, scope, and output contract	3
It delegates to all required specialist agents with explicit handoff instructions	4
It synthesizes specialist findings instead of merely pasting raw outputs together	3
It distinguishes blockers, non-blocking notes, questions, and residual risk	2
It handles missing MCP access or fixture-only operation gracefully	2
It avoids unsupported claims and asks for human review when evidence is insufficient	1

Strong orchestrators make review decisions traceable to specialist evidence.

3. Specialist Agents: 20 Points¶

Criteria	Points
`security-checker` identifies concrete security issues and avoids vague fear-based findings	5
`style-linter` applies project conventions without treating preferences as blockers	4
`test-impact` maps changes to relevant tests and reports commands/results or limitations	5
Each specialist has a narrow role and does not duplicate the orchestrator's job	2
Specialist outputs use consistent severity and evidence conventions	2
Subagent prompts include clear non-goals and refusal boundaries	2

Deduct heavily if a specialist can make edits, run unsafe commands, or invent findings.

4. Skills and Reusable Knowledge: 10 Points¶

Criteria	Points
Includes an `owasp-review` skill with a practical security review checklist	3
Includes a `project-conventions` skill with local review standards and severity labels	3
Skill descriptions are specific enough to support on-demand discovery	2
Agents use skills at appropriate moments rather than embedding all knowledge in one prompt	2

Skills should be short, targeted, and reusable across future repos.

5. Permissions and Safety: 15 Points¶

Criteria	Points
Permissions follow least privilege for each agent	4
The repo documents permission rationale for all agents	3
Review agents do not auto-merge, approve, push, or modify code	3
`test-impact` is limited to safe local test commands and reports failures honestly	2
The workflow includes secret-handling guidance and does not place real secrets in fixtures	2
MCP or GitHub access is scoped and documented	1

A capstone that can change production state during review cannot receive more than 70 points, even if other parts work.

6. MCP Integration: 8 Points¶

Criteria	Points
Documents or implements GitHub MCP PR metadata/diff retrieval	3
Explains required setup, scopes, and expected inputs	2
Maintains a local fixture fallback for offline evaluation	2
Clearly states data privacy limits for external tool calls	1

If the learner cannot run MCP in their environment, award up to 6 points here for a complete, credible integration plan plus working local fixtures.

7. Evals: 10 Points¶

Criteria	Points
Provides `good-pr.md`, `risky-pr.md`, and `stylistic-pr.md` fixtures	2
Each fixture has an expected verdict and review target	2
The system correctly classifies the risky fixture as `request-changes`	2
The system avoids blocking the good fixture	1
The system treats stylistic issues proportionally	1
Eval results are recorded with actual verdicts, misses, and follow-up changes	2

The risky fixture is the most important eval. A system that misses a clear secret or auth bypass cannot receive full eval credit.

8. Demo and Reflection: 2 Points¶

Criteria	Points
Demo shows the workflow on at least two fixtures, including the risky fixture	1
Reflection explains one design tradeoff, one failure mode, and one future improvement	1

Automatic Caps¶

Apply these caps after totaling points:

Condition	Maximum Score
No runnable `/review-pr` command	65
Missing orchestrator agent	60
Missing two or more required specialist agents	70
No eval fixtures	75
Unsafe permissions allow review agents to edit, push, approve, merge, or deploy	70
Real secrets are committed in fixtures or docs	60
Work cannot be inspected because required files are missing from the repo	50

Grading Notes for Instructors¶

Look for evidence before polish. A plain but reliable pipeline should score higher than a flashy demo that cannot be rerun.

Recommended grading process:

Read the README and run the documented local fixture flow.
Inspect agent, command, and skill files for scope and permissions.
Run or review outputs for all three eval fixtures.
Check that the risky fixture produces a blocking finding with evidence.
Apply automatic caps last.

Learner Self-Check¶

Before submission, learners should confirm:

I can run /review-pr evals/fixtures/good-pr.md and get a non-blocking verdict.
I can run /review-pr evals/fixtures/risky-pr.md and get request-changes.
I can run /review-pr evals/fixtures/stylistic-pr.md and get proportional style feedback.
My orchestrator calls all three specialist agents.
My skills are discoverable and not just unused documents.
My permissions are documented and least-privilege.
My README includes setup, usage, architecture, eval results, and known limits.