Capstone Grading Rubric¶
Scoring Overview¶
Total: 100 points
Passing threshold: 70 points
Strong completion: 85 points or higher
Exemplary completion: 95 points or higher
The capstone is graded on working behavior, agent design, safety, evaluation quality, and clarity of communication. Award partial credit for visible, tested work. Do not award points for claims that are not supported by files, demo output, or eval records.
1. End-to-End Workflow: 20 Points¶
| Criteria | Points |
|---|---|
/review-pr accepts a PR number, URL, or fixture path and routes the task to the orchestrator |
4 |
| The workflow reviews all three local fixtures without manual prompt rewriting | 4 |
| Final output includes verdict, summary, findings, test impact, and follow-up questions | 4 |
| Findings include evidence such as file names, line references, snippets, or clear fixture references | 3 |
| The workflow produces useful results without editing source code or changing repo state | 3 |
| README documents setup and usage clearly enough for another learner to run it | 2 |
Full credit requires a repeatable path from command invocation to final review packet.
2. Orchestrator Design: 15 Points¶
| Criteria | Points |
|---|---|
reviewer-orchestrator has a clear role, scope, and output contract |
3 |
| It delegates to all required specialist agents with explicit handoff instructions | 4 |
| It synthesizes specialist findings instead of merely pasting raw outputs together | 3 |
| It distinguishes blockers, non-blocking notes, questions, and residual risk | 2 |
| It handles missing MCP access or fixture-only operation gracefully | 2 |
| It avoids unsupported claims and asks for human review when evidence is insufficient | 1 |
Strong orchestrators make review decisions traceable to specialist evidence.
3. Specialist Agents: 20 Points¶
| Criteria | Points |
|---|---|
security-checker identifies concrete security issues and avoids vague fear-based findings |
5 |
style-linter applies project conventions without treating preferences as blockers |
4 |
test-impact maps changes to relevant tests and reports commands/results or limitations |
5 |
| Each specialist has a narrow role and does not duplicate the orchestrator's job | 2 |
| Specialist outputs use consistent severity and evidence conventions | 2 |
| Subagent prompts include clear non-goals and refusal boundaries | 2 |
Deduct heavily if a specialist can make edits, run unsafe commands, or invent findings.
4. Skills and Reusable Knowledge: 10 Points¶
| Criteria | Points |
|---|---|
Includes an owasp-review skill with a practical security review checklist |
3 |
Includes a project-conventions skill with local review standards and severity labels |
3 |
| Skill descriptions are specific enough to support on-demand discovery | 2 |
| Agents use skills at appropriate moments rather than embedding all knowledge in one prompt | 2 |
Skills should be short, targeted, and reusable across future repos.
5. Permissions and Safety: 15 Points¶
| Criteria | Points |
|---|---|
| Permissions follow least privilege for each agent | 4 |
| The repo documents permission rationale for all agents | 3 |
| Review agents do not auto-merge, approve, push, or modify code | 3 |
test-impact is limited to safe local test commands and reports failures honestly |
2 |
| The workflow includes secret-handling guidance and does not place real secrets in fixtures | 2 |
| MCP or GitHub access is scoped and documented | 1 |
A capstone that can change production state during review cannot receive more than 70 points, even if other parts work.
6. MCP Integration: 8 Points¶
| Criteria | Points |
|---|---|
| Documents or implements GitHub MCP PR metadata/diff retrieval | 3 |
| Explains required setup, scopes, and expected inputs | 2 |
| Maintains a local fixture fallback for offline evaluation | 2 |
| Clearly states data privacy limits for external tool calls | 1 |
If the learner cannot run MCP in their environment, award up to 6 points here for a complete, credible integration plan plus working local fixtures.
7. Evals: 10 Points¶
| Criteria | Points |
|---|---|
Provides good-pr.md, risky-pr.md, and stylistic-pr.md fixtures |
2 |
| Each fixture has an expected verdict and review target | 2 |
The system correctly classifies the risky fixture as request-changes |
2 |
| The system avoids blocking the good fixture | 1 |
| The system treats stylistic issues proportionally | 1 |
| Eval results are recorded with actual verdicts, misses, and follow-up changes | 2 |
The risky fixture is the most important eval. A system that misses a clear secret or auth bypass cannot receive full eval credit.
8. Demo and Reflection: 2 Points¶
| Criteria | Points |
|---|---|
| Demo shows the workflow on at least two fixtures, including the risky fixture | 1 |
| Reflection explains one design tradeoff, one failure mode, and one future improvement | 1 |
Automatic Caps¶
Apply these caps after totaling points:
| Condition | Maximum Score |
|---|---|
No runnable /review-pr command |
65 |
| Missing orchestrator agent | 60 |
| Missing two or more required specialist agents | 70 |
| No eval fixtures | 75 |
| Unsafe permissions allow review agents to edit, push, approve, merge, or deploy | 70 |
| Real secrets are committed in fixtures or docs | 60 |
| Work cannot be inspected because required files are missing from the repo | 50 |
Grading Notes for Instructors¶
Look for evidence before polish. A plain but reliable pipeline should score higher than a flashy demo that cannot be rerun.
Recommended grading process:
- Read the README and run the documented local fixture flow.
- Inspect agent, command, and skill files for scope and permissions.
- Run or review outputs for all three eval fixtures.
- Check that the risky fixture produces a blocking finding with evidence.
- Apply automatic caps last.
Learner Self-Check¶
Before submission, learners should confirm:
- I can run
/review-pr evals/fixtures/good-pr.mdand get a non-blocking verdict. - I can run
/review-pr evals/fixtures/risky-pr.mdand getrequest-changes. - I can run
/review-pr evals/fixtures/stylistic-pr.mdand get proportional style feedback. - My orchestrator calls all three specialist agents.
- My skills are discoverable and not just unused documents.
- My permissions are documented and least-privilege.
- My README includes setup, usage, architecture, eval results, and known limits.