Capstone: Multi-Agent Code-Review Pipeline¶

Outcome¶

Build, evaluate, and present a working OpenCode system that reviews pull requests by coordinating multiple specialized agents. By the end, your repo should contain a reusable /review-pr <pr-number-or-fixture> command, a primary orchestrator agent, three specialist subagents, two custom skills, scoped permissions, and three eval fixtures that prove the workflow can distinguish clean, risky, and stylistic PRs.

This is the course's final integration task. It asks you to combine prompting, commands, skills, agents, subagent delegation, permissions, MCP usage, and evaluation into one production-shaped workflow.

Scenario¶

Your team wants a first-pass PR review assistant. It should not approve code or replace human review. It should collect evidence, classify risk, and produce a concise review packet a developer can act on.

The pipeline must answer four questions for every PR:

What changed?
Is anything security-sensitive or unsafe?
Does it follow the project's style and conventions?
What tests are affected, and what should be run before merge?

Required Build¶

Create the capstone inside the provided starter repo. You may adapt file names only if your instructor approves, but the required OpenCode concepts must remain visible in version control.

1. Primary Agent: `reviewer-orchestrator`¶

The orchestrator owns the overall review. It must:

Accept a PR number, URL, or local fixture path.
Fetch or read the PR diff.
Decide which specialist agents to call.
Pass each specialist a clear handoff contract.
Combine specialist findings into one final review.
Separate evidence from recommendations.
Avoid making code changes during review.

Minimum final output shape:

# PR Review

## Verdict
<approve-with-notes | request-changes | needs-human-review>

## Summary
<2-5 bullets describing the change>

## Findings
<ordered by severity, each with file/line evidence when available>

## Test Impact
<tests run, tests recommended, untested risk>

## Follow-Up Questions
<questions for the author, or "None">

2. Subagent: `security-checker`¶

The security checker is read-only. It must look for:

Hardcoded secrets, tokens, private keys, and credentials.
Authentication or authorization bypasses.
Unsafe input handling and injection risks.
Sensitive logging or data exposure.
Dependency or supply-chain concerns visible in the diff.
OWASP-style risks relevant to the changed code.

It should report only evidence-backed findings. If a risk is speculative, label it as a question or residual risk.

3. Subagent: `style-linter`¶

The style linter is read-only. It must look for:

Violations of local project conventions.
Naming, formatting, and organization issues.
Overly complex functions or unclear control flow.
Missing docs or comments where the code is hard to understand.
Inconsistent error handling.

It should not block a PR for personal preference. It should distinguish project conventions from optional polish.

4. Subagent: `test-impact`¶

The test-impact agent may run safe, local test commands. It must:

Identify which behavior changed.
Map changes to existing tests when possible.
Run the smallest relevant test command available.
Report commands run and results.
Recommend missing tests.
State when tests could not be run and why.

This agent must not install packages, modify source files, or run networked/deploy commands.

5. Custom Command: `/review-pr`¶

Create a reusable command that invokes the orchestrator with a PR number, PR URL, or fixture file. It should explain acceptable input and the expected output.

Example use:

/review-pr 42
/review-pr https://github.com/example/app/pull/42
/review-pr evals/fixtures/risky-pr.md

6. Custom Skills¶

Create at least two skills that agents can load on demand:

owasp-review: a concise security review checklist grounded in OWASP-style risk categories.
project-conventions: the repo's review standards, severity labels, test policy, and output expectations.

Your skill descriptions matter. They should be specific enough that agents load the right skill without being explicitly told every time.

7. MCP Integration¶

Wire the workflow for GitHub MCP usage so the orchestrator can fetch PR metadata and diffs from GitHub in a real project.

At minimum, document:

Which GitHub MCP server or GitHub tool integration you used.
What permissions/scopes it needs.
How the orchestrator falls back to local fixture review when MCP is unavailable.
What data must never be sent to external services.

If your environment cannot run MCP during the capstone, your starter repo must still include a clear integration plan and your demo must use the local fixtures.

8. Permissions Rationale¶

Document permissions for every agent. The goal is least privilege:

Agent	File Read	File Edit	Bash	Network/MCP	Rationale
`reviewer-orchestrator`	Yes	No	No by default	GitHub MCP only	Coordinates review and fetches diff.
`security-checker`	Yes	No	No	No	Reviews evidence without side effects.
`style-linter`	Yes	No	No	No	Checks conventions without side effects.
`test-impact`	Yes	No	Safe local tests only	No	Verifies impact without changing code.

Your implementation may differ, but every permission must be justified.

9. Evals¶

Provide three PR fixtures and make the system classify them correctly:

Fixture	Expected Classification	Required Behavior
`good-pr.md`	`approve-with-notes`	Finds no blocking issues and recommends reasonable tests.
`risky-pr.md`	`request-changes`	Flags a concrete security risk with evidence.
`stylistic-pr.md`	`approve-with-notes` or `needs-human-review`	Identifies style issues without overstating severity.

You may add a simple manual eval checklist or script, but no package installs are required.

10. Demo and Writeup¶

Submit:

A 10-minute recorded demo.
A README explaining setup, usage, architecture, permissions, eval results, and known limits.
A final review output for each fixture.

Suggested Work Plan¶

Step 1: Start With the Output Contract¶

Before writing agents, draft the final review format and severity labels. This keeps all subagents aligned.

Recommended severities:

blocker: must fix before merge.
high: likely serious bug, security risk, or data loss.
medium: correctness or maintainability risk.
low: polish, clarity, or minor convention issue.
question: needs author clarification.

Step 2: Build the Local Fixture Path¶

Make /review-pr evals/fixtures/good-pr.md work before connecting GitHub MCP. Local fixtures are easier to debug and make your evals repeatable.

Step 3: Add Specialist Agents One at a Time¶

Add security-checker first, then style-linter, then test-impact. After each agent, run all three fixtures and compare output to the expected classifications.

Step 4: Add GitHub MCP¶

Once local review works, add live PR fetching. Keep the same internal review contract so the rest of the pipeline does not change.

Step 5: Harden Permissions¶

Try to make each agent do something outside its role. Confirm it refuses or lacks permission.

Step 6: Run Evals and Record Results¶

Run all three fixtures and record:

Input fixture.
Expected verdict.
Actual verdict.
Blocking findings found.
False positives or misses.
Changes made after the eval.

Definition of Done¶

Your capstone is complete when:

/review-pr works on all three local fixtures.
The orchestrator delegates to all three subagents.
The two required skills are present and used where relevant.
Permissions are scoped and documented.
The GitHub MCP integration is implemented or clearly documented with a local fallback.
All three eval fixtures receive the expected classification.
The README lets another learner run the project without your help.
The demo shows at least one clean PR, one risky PR, and one style-heavy PR.

Constraints¶

Do not let the system auto-merge, approve, push, or edit code as part of review.
Do not expose secrets in prompts, fixtures, screenshots, or recordings.
Do not require paid services beyond the model provider and GitHub access already used in the course.
Do not rely on hidden local files. The submitted repo must contain all required prompts, skills, commands, fixtures, and documentation.

Stretch Goals¶

Add confidence scores for each finding.
Add a fourth specialist for accessibility, performance, or dependency review.
Produce a machine-readable JSON summary in addition to Markdown.
Compare two model choices for the same fixture and document tradeoffs.
Add a lightweight eval runner that checks whether verdict labels match expected labels.