Skip to content

Capstone Grading Rubric

Scoring Overview

Total: 100 points

Passing threshold: 70 points

Strong completion: 85 points or higher

Exemplary completion: 95 points or higher

The capstone is graded on working behavior, agent design, safety, evaluation quality, and clarity of communication. Award partial credit for visible, tested work. Do not award points for claims that are not supported by files, demo output, or eval records.

1. End-to-End Workflow: 20 Points

Criteria Points
/review-pr accepts a PR number, URL, or fixture path and routes the task to the orchestrator 4
The workflow reviews all three local fixtures without manual prompt rewriting 4
Final output includes verdict, summary, findings, test impact, and follow-up questions 4
Findings include evidence such as file names, line references, snippets, or clear fixture references 3
The workflow produces useful results without editing source code or changing repo state 3
README documents setup and usage clearly enough for another learner to run it 2

Full credit requires a repeatable path from command invocation to final review packet.

2. Orchestrator Design: 15 Points

Criteria Points
reviewer-orchestrator has a clear role, scope, and output contract 3
It delegates to all required specialist agents with explicit handoff instructions 4
It synthesizes specialist findings instead of merely pasting raw outputs together 3
It distinguishes blockers, non-blocking notes, questions, and residual risk 2
It handles missing MCP access or fixture-only operation gracefully 2
It avoids unsupported claims and asks for human review when evidence is insufficient 1

Strong orchestrators make review decisions traceable to specialist evidence.

3. Specialist Agents: 20 Points

Criteria Points
security-checker identifies concrete security issues and avoids vague fear-based findings 5
style-linter applies project conventions without treating preferences as blockers 4
test-impact maps changes to relevant tests and reports commands/results or limitations 5
Each specialist has a narrow role and does not duplicate the orchestrator's job 2
Specialist outputs use consistent severity and evidence conventions 2
Subagent prompts include clear non-goals and refusal boundaries 2

Deduct heavily if a specialist can make edits, run unsafe commands, or invent findings.

4. Skills and Reusable Knowledge: 10 Points

Criteria Points
Includes an owasp-review skill with a practical security review checklist 3
Includes a project-conventions skill with local review standards and severity labels 3
Skill descriptions are specific enough to support on-demand discovery 2
Agents use skills at appropriate moments rather than embedding all knowledge in one prompt 2

Skills should be short, targeted, and reusable across future repos.

5. Permissions and Safety: 15 Points

Criteria Points
Permissions follow least privilege for each agent 4
The repo documents permission rationale for all agents 3
Review agents do not auto-merge, approve, push, or modify code 3
test-impact is limited to safe local test commands and reports failures honestly 2
The workflow includes secret-handling guidance and does not place real secrets in fixtures 2
MCP or GitHub access is scoped and documented 1

A capstone that can change production state during review cannot receive more than 70 points, even if other parts work.

6. MCP Integration: 8 Points

Criteria Points
Documents or implements GitHub MCP PR metadata/diff retrieval 3
Explains required setup, scopes, and expected inputs 2
Maintains a local fixture fallback for offline evaluation 2
Clearly states data privacy limits for external tool calls 1

If the learner cannot run MCP in their environment, award up to 6 points here for a complete, credible integration plan plus working local fixtures.

7. Evals: 10 Points

Criteria Points
Provides good-pr.md, risky-pr.md, and stylistic-pr.md fixtures 2
Each fixture has an expected verdict and review target 2
The system correctly classifies the risky fixture as request-changes 2
The system avoids blocking the good fixture 1
The system treats stylistic issues proportionally 1
Eval results are recorded with actual verdicts, misses, and follow-up changes 2

The risky fixture is the most important eval. A system that misses a clear secret or auth bypass cannot receive full eval credit.

8. Demo and Reflection: 2 Points

Criteria Points
Demo shows the workflow on at least two fixtures, including the risky fixture 1
Reflection explains one design tradeoff, one failure mode, and one future improvement 1

Automatic Caps

Apply these caps after totaling points:

Condition Maximum Score
No runnable /review-pr command 65
Missing orchestrator agent 60
Missing two or more required specialist agents 70
No eval fixtures 75
Unsafe permissions allow review agents to edit, push, approve, merge, or deploy 70
Real secrets are committed in fixtures or docs 60
Work cannot be inspected because required files are missing from the repo 50

Grading Notes for Instructors

Look for evidence before polish. A plain but reliable pipeline should score higher than a flashy demo that cannot be rerun.

Recommended grading process:

  1. Read the README and run the documented local fixture flow.
  2. Inspect agent, command, and skill files for scope and permissions.
  3. Run or review outputs for all three eval fixtures.
  4. Check that the risky fixture produces a blocking finding with evidence.
  5. Apply automatic caps last.

Learner Self-Check

Before submission, learners should confirm:

  • I can run /review-pr evals/fixtures/good-pr.md and get a non-blocking verdict.
  • I can run /review-pr evals/fixtures/risky-pr.md and get request-changes.
  • I can run /review-pr evals/fixtures/stylistic-pr.md and get proportional style feedback.
  • My orchestrator calls all three specialist agents.
  • My skills are discoverable and not just unused documents.
  • My permissions are documented and least-privilege.
  • My README includes setup, usage, architecture, eval results, and known limits.