Reflection — kli — Kleisli Docs

Reflection phase domain knowledge including pattern evaluation, observation analysis, and evidence-based learning. Use when analyzing task outcomes, evaluating pattern effectiveness, or updating knowledge bases with learnings. Activates for retrospective and learning tasks. DO NOT use for research, planning, or implementation.

This skill provides declarative knowledge for the reflection phase: methodology, criteria, schemas, and principles. For the procedural workflow (orchestration steps), see the /reflect command.

Evidence-Based Learning (CRITICAL)

All pattern evaluations MUST be backed by observation evidence.

Reflection is NOT about opinions or intuitions - it's about extracting learnings from captured task trajectories:

"Pattern X seems useful" - NOT VALID (no evidence)
"Pattern X applied in implementation observations (via task_get/timeline):142, resulted in 40% token savings (measured)" - VALID (evidence-based)

Why Observations Matter

Purpose	How Observations Help
Concrete evidence	Pattern effectiveness backed by real application
Challenge documentation	Shows problems encountered and how resolved
Pattern tracking	Shows which patterns were actually applied
Impact measurement	Time saved, errors avoided, quality improved
Playbook evolution	Data-driven counter updates

Observation Requirements

Observations are recorded via observe() during KLI commands and surfaced via: - task_get() — shows the last 3 observations - timeline(limit=50) — shows all events including observations

Evidence categories: - Research observations — research iteration trajectory, findings, agent effectiveness - Planning observations — planning decisions, phase design rationale - Implementation observations — TDD cycles, challenges, verification results

No Observations = No Reflection: Cannot evaluate patterns without evidence, cannot update counters without justification.

Sequential Execution Pattern

reflector → curator (MUST be sequential, NOT parallel)

The agents have dependencies: 1. Reflector reads event stream (via task_get + timeline), produces reflection.md 2. Curator reads reflection.md, updates playbooks via PQ mutations ((:feedback! ...), (add! ...), (:evolve! ...))

Why Sequential: Each agent depends on the previous agent's output. Spawning in parallel means later agents have no input to work with.

Pattern Evaluation Methodology

What to Extract from Observations

Use task_get() + timeline(limit=50) to surface observations. Look for:

From research phase observations: - Which agents were effective? - Iteration count (fewer is better) - Gap analysis and resolution - Exit criteria achievement - Evidence: Agent effectiveness notes, iteration counts

From planning phase observations: - Research artifact reuse (token savings from reusing research.md) - Plan iteration count, phase design decisions - Clarifying questions asked - Evidence: Research reuse notes, iteration counts

From implementation phase observations: - TDD discipline (Red → Green → Refactor documented) - Design principles applied (Extensibility, Composability, Parametricity) - Verification attempts per phase - Challenges encountered and resolved - Evidence: TDD iterations, refactoring notes, verification results

Pattern Effectiveness Criteria

Helpful Indicators: ✅ Pattern was applied (documented in observations) ✅ Led to positive outcome (faster, fewer errors, cleaner code) ✅ Had measurable impact (X% faster, Y fewer iterations) ✅ Matched intended use case ✅ Would recommend using again

Harmful Indicators: ❌ Pattern was applied (documented in observations) ❌ Led to negative outcome (slower, more errors, confusion) ❌ Had measurable negative impact ❌ Mismatched use case or misleading ❌ Would NOT recommend using again

Neutral (No Counter Change): ⚪ Pattern mentioned but not actually applied ⚪ Applied but no observable impact ⚪ Insufficient evidence to evaluate

Evaluation Process

Find pattern reference in observations (e.g., "artifact reuse saved tokens")
Extract context: What was done? What was the outcome?
Look for measurable evidence: time saved, errors avoided, quality improved
Classify: Helpful, Harmful, or Neutral
Document evidence in reflection.md

Harm Signal Tier Definitions

Pattern harm is classified into tiers for appropriate response:

Tier 1: Auto-Action (Definitive Harm)

Signals: - outcome=FAILURE recorded - Git reverts of pattern application - Explicit user rejection ("that didn't work") - Test failures directly caused by pattern

Response: Auto-increment harmful counter. Clear evidence of damage.

Tier 2: Flag for Review (Probable Harm)

Signals: - Excessive iterations (>5 for simple task) - Implicit correction (user redoes work differently) - Confusion requiring clarification - Time wasted on wrong approach

Response: Increment harmful counter with review note. Needs human judgment.

Tier 3: Track Only (Uncertain)

Signals: - Minor iterations (normal debugging) - Context mismatch (pattern applied to wrong domain) - Partial success (worked but not optimal)

Response: Track in reflection.md but no counter change. Insufficient evidence.

New Pattern Discovery

When to Identify New Pattern

Novel approach used that isn't in playbooks
Recurring solution that proved effective (seen 2+ times)
Workaround for common issue
Integration technique that worked well

New Pattern Documentation Template

### New Pattern Discovered

**Pattern**: [temp-id] <Short description>
**Context**: <When this pattern applies>
**Approach**: <What to do>
**Outcome**: <Result with evidence from this task>
**Recommendation**: <Add to [playbook-name]>
**Domain**: <ace | nix | lisp | depot-organization>

Quality Criteria for New Patterns

Criterion	Requirement
Specific	Actionable advice, not vague guidance
Evidence	Effectiveness proven in observations
Reusable	Applicable in similar future contexts
Novel	Not covered by existing patterns

Agent Responsibilities

Reflector Agent

Aspect	Details
Input	Task ID — uses `task_get()` + `timeline()` for observations, reads artifacts
Process	Analyze observations from event stream, evaluate patterns, classify harm signals, discover new patterns
Output	reflection.md artifact with recommendations
Tools	mcp__task__*, Read, Grep, Search

Note: The event stream (observations from observe()) is the source of truth for pattern evidence.

Curator Agent

Aspect	Details
Input	reflection.md artifact
Process	Update playbook via MCP tools, process harm signals by tier, add new patterns
Output	Playbook updates via PQ mutations (`(:feedback! ...)`, `(add! ...)`, `(:evolve! ...)`)
Tools	mcp__playbook__pq_query, Read

Reflection Artifact Structure

reflection.md Template

---
date: <ISO timestamp>
task: <task directory>
status: complete
---

# Reflection: <Task Name>

## Patterns Applied

### [pattern-id]: <Pattern Name>
- **Applied in**: timeline event N
- **Context**: <What was done>
- **Outcome**: <Result with evidence>
- **Effectiveness**: Helpful | Harmful | Neutral
- **Recommendation**: increment helpful | increment harmful | no change

## Harm Signals

### Tier 1 (Auto-Action)
- [pattern-id]: <evidence of definitive harm>

### Tier 2 (Flagged for Review)
- [pattern-id]: <evidence of probable harm>

### Tier 3 (Tracked Only)
- [pattern-id]: <uncertain signal>

## New Patterns Discovered

### [temp-id]: <Pattern Name>
- **Context**: <When this applies>
- **Approach**: <What to do>
- **Outcome**: <Result with evidence>
- **Recommendation**: Add to <playbook-name>

## Challenges & Resolutions

### Challenge: <Description>
- **Context**: <When encountered>
- **Resolution**: <How resolved>
- **Pattern**: <Existing or new>

## Playbook Update Recommendations

**Playbook Updates:**
- [pattern-NNN]: increment helpful (evidence: <ref>)
- [pattern-MMM]: increment harmful (evidence: <ref>)
- Add new pattern: [temp-id] via `(add! :domain :ace :content "...")`

## Summary

**Patterns Evaluated:** <N>
**Helpful:** <count> | **Harmful:** <count> | **Neutral:** <count>
**New Patterns:** <count>
**Harm Signals:** Tier 1: <N>, Tier 2: <M>, Tier 3: <K>
**Key Learnings:** <3-5 bullet points>

TQ and Observation Tools for Reflection

Use these to gather evidence for pattern evaluation:

obs_search(query="pattern X applied")       # Find observations mentioning a pattern
enriched_retrieve(k=10)                      # Context-aware retrieval for current task
obs_feedback(text="...", outcome="success")  # Record observation quality feedback
task_query("(query \"plan\")")               # Review plan phases and their status
task_query("(query \"busy\")")               # Tasks with most observations (richest evidence)

Reference

Core Principles

Document what IS, not what SHOULD BE
Reflector agent analyzes observations
Add explicit tracing for debugging

Playbook Access

Patterns are managed via PQ queries, not file paths: - All patterns: (-> :all (:group-by :domain)) - Proven patterns: (-> (proven :min 3) (:take 10)) - Domain-specific: (-> :all (:where (domain= :lisp)) :ids)