AI Model Delegation Strategy¶
Version: 1.3 Date: December 25, 2025 Status: Draft - Pending Final Review
Related Documentation¶
| Document | Purpose |
|---|---|
| COST_DETAIL.md | Full cost breakdown, token usage, SaaS pricing model |
| PROCESS_FLOWS.md | Workflow states including PendingFinalQA |
| ARCHITECTURE.md | System architecture, service centralization |
| CLIENT_FOLLOWUP.md | FUT-002: Tiered AI Validation Strategy discussion |
Executive Summary¶
This document defines the AI model delegation strategy for the Tax Practice AI system. It specifies which Claude model (Haiku, Sonnet, or Opus) handles each AI touchpoint, the routing rules between models, and the human-in-the-loop checkpoints required by IRS Circular 230 and firm quality standards.
AI Cost Target (SaaS provider COGS): $0.12/return (vs $0.40 baseline) through strategic model delegation and optimization. This is our internal cost per return, not client pricing.
Model Distribution Target: Haiku 60% | Sonnet 35% | Opus 5%
Table of Contents¶
- Model Capabilities and Pricing
- 1.1 Model Comparison
- 1.2 Batch API Discount
- 1.3 Cost Model at Scale
- AI Touchpoints and Model Assignments
- 2.1 Document Processing
- 2.2 Preliminary Analysis
- 2.3 Interactive Q&A
- 2.4 Review and Validation
- Human-in-the-Loop Checkpoints
- 3.1 Regulatory Requirements
- 3.2 Document Processing
- 3.3 Workflow Checkpoints
- 3.4 Process Flow Diagram
- Resolved Ambiguous Scenarios
- 4.1 Preliminary Analysis Threshold
- 4.2 Interactive Q&A Model Selection
- 4.3 Extraction Confidence Threshold
- 4.4 Opus Usage: Interactive vs Batch
- 4.5 Multi-Model Pipelines
- Additional Scenarios
- 5.1 Context Window Management
- 5.2 Retry and Fallback Logic
- 5.3 Caching Strategy
- Quality Metrics and Ground Truth
- 6.1 Ground Truth Sources
- 6.2 Metric Definitions
- 6.3 Quality Feedback Loops
- Usage Caps and Reporting
- 7.1 Opus Usage Caps
- 7.2 Cap Granularity
- 7.3 Usage Visibility
- Implementation Phases
- 8.1 Phase 1: Foundation
- 8.2 Phase 2: Optimization
- 8.3 Phase 3: Enhancement
1. Model Capabilities and Pricing¶
1.1 Model Comparison¶
| Model | Strengths | Weaknesses | Cost (per 1M tokens) |
|---|---|---|---|
| Haiku 3 | Fast, cheap, structured tasks | Limited reasoning | In: $0.25, Out: $1.25 |
| Sonnet 4 | Balanced, strong extraction | Slower than Haiku | In: $3.00, Out: $15.00 |
| Opus 4.5 | Best reasoning, nuanced judgment | Slower | In: $5.00, Out: $25.00 |
Pricing as of Dec 2025 via AWS Bedrock. See AWS Bedrock Pricing.
1.2 Batch API Discount¶
- 50% cost reduction for batch processing (non-real-time)
- Target: 80% of non-interactive workloads through batch API
- Processing window: overnight via Airflow scheduling
1.3 Cost Model at Scale¶
| Returns/Year | AI Cost/Return | Annual AI Cost |
|---|---|---|
| 1,000 | $0.33 | ~$330/year |
| 5,000 | $0.33 | ~$1,650/year |
| 10,000 | $0.33 | ~$3,300/year |
Based on ~126,500 tokens/return lifecycle with 60/35/5 model delegation. See COST_DETAIL.md for token breakdown.
2. AI Touchpoints and Model Assignments¶
2.1 Document Processing Touchpoints¶
| ID | Touchpoint | Default Model | Escalation | Batch Eligible |
|---|---|---|---|---|
| TP-001 | Document Classification | Haiku | None | Yes |
| TP-002 | W-2 Extraction | Haiku | Sonnet if confidence <90% | Yes |
| TP-003 | 1099 Extraction | Haiku | Sonnet if confidence <90% | Yes |
| TP-004 | K-1 Extraction | Sonnet | Opus if confidence <85% | Yes |
| TP-005 | Brokerage Statement Extraction | Sonnet | None | Yes |
| TP-006 | Other Document Extraction | Sonnet | None | Yes |
Rationale: - Standard forms (W-2, 1099) are highly structured → Haiku sufficient - K-1s have variable formats and complex pass-through data → Sonnet default, Opus for low confidence - Brokerage statements require table parsing and multi-page handling → Sonnet
Image Quality Escalation:
Document source affects model selection independent of document type:
| Source | Default Adjustment | Rationale |
|---|---|---|
| Native PDF | Use table above | Embedded text, reliable |
| High-quality scan | Use table above | Clean OCR |
| Phone photo | +1 model tier | Skew, lighting, blur |
| Low-res/blurry | +1 model tier | OCR uncertainty |
| HEIC (iPhone) | Convert first, then assess | Format handling |
| Handwritten content | → Human review | Too unreliable for AI |
Implementation:
def adjust_for_image_quality(base_model, doc_metadata):
if doc_metadata.is_native_pdf:
return base_model
if doc_metadata.source == "phone_photo":
return escalate(base_model) # haiku→sonnet, sonnet→opus
if doc_metadata.resolution < 150 or doc_metadata.blur_score > 0.3:
return escalate(base_model)
if doc_metadata.has_handwriting:
return "human_review"
return base_model
2.2 Preliminary Analysis Touchpoints¶
| ID | Touchpoint | Default Model | Escalation | Batch Eligible |
|---|---|---|---|---|
| TP-007 | Simple Return Analysis (≤2 schedules, no business) | Haiku | Sonnet on anomaly | Yes |
| TP-008 | Standard Return Analysis (>2 schedules OR business/investment) | Sonnet | Opus for complex entities | Yes |
| TP-009 | Prior Year Comparison | Haiku | Sonnet if significant changes | Yes |
| TP-010 | Missing Document Detection | Haiku | None | Yes |
| TP-011 | Client Question Generation | Haiku | Sonnet for complex situations | Yes |
Decision (from client): Schedule type matters as much as count. Haiku for ≤2 schedules AND no business income; Sonnet for >2 schedules OR any business/investment complexity.
TP-007/TP-009 Anomaly Thresholds (Haiku → Sonnet):
Note: We add value on top of tax software variance tools, not replace them. These thresholds trigger deeper AI analysis, not duplicate existing reports.
| Anomaly Type | Threshold | Rationale |
|---|---|---|
| Income change YoY | >25% | Matches IRS DIF flags |
| Deductions change | >20% or >$5K | IRS audit trigger |
| New schedule | Any not in PY | Life event likely |
| Missing PY source | W-2/1099 gone | Job loss, needs verify |
| Charitable giving | >10% of AGI | IRS scrutiny area |
| Schedule C loss | Any | Hobby loss rules |
| Home office | New claim | Documentation needed |
TP-011 Question Complexity Routing:
Haiku generates (simple data collection): - Missing document requests - Address/contact confirmations - "Did you receive [expected doc]?"
Sonnet generates (judgment required): - Life events (marriage, divorce, new child, death) - Business vs hobby determination - Multi-state residency questions - Unusual income sources - "Help us understand..." questions
2.3 Interactive Q&A Touchpoints¶
| ID | Touchpoint | Default Model | Escalation | Batch Eligible |
|---|---|---|---|---|
| TP-012 | FAQ-Pattern Matches | Haiku | None | No |
| TP-013 | Standard Q&A (staff) | Sonnet | Opus via "Request Expert Research" | No (interactive) |
| TP-014 | Complex Research Request | Opus | None | Yes (preferred) |
| TP-015 | Batch Overnight Research | Sonnet (batch) | None | Yes |
Decision (from client): Hybrid UX approach: - Default to Sonnet for interactive Q&A - Auto-downgrade to Haiku for FAQ-pattern matches - "Request Expert Research" link for Opus (batch, positioned as more thorough) - Admin-only "Escalate to Expert" for Opus interactive
2.4 Review and Validation Touchpoints¶
| ID | Touchpoint | Default Model | Escalation | Batch Eligible |
|---|---|---|---|---|
| TP-016 | Preparer Summary Generation | Haiku | None | Yes |
| TP-017 | Pre-Review Validation | Sonnet | Opus if issues found | Yes |
| TP-018 | Final Validation (Opus QA) | Opus (metadata only, batch) | None | Yes |
| TP-019 | E-file Rejection Analysis | Haiku | Sonnet for complex rejections | Yes |
Validation Strategy (see FUT-002 for discussion):
- Tiered self-validation - each model QAs its own work inline
- TP-017 - Sonnet pre-review escalates to Opus if issues found
- TP-018 - Opus reviews cached metadata only (not full docs), runs overnight batch
- Workflow - adds "Pending Final QA" status before client delivery
TP-018 Opus QA Detail:
Workflow integration (see PROCESS_FLOWS.md):
Approved → PendingFinalQA → (overnight batch)
↓
┌───────────┴───────────┐
↓ ↓
No issues Issues found
↓ ↓
PendingSignature RevisionsNeeded
What Opus reviews (metadata only, ~750 tokens): - Key figures summary (income, deductions, refund/due) - Flags from earlier analysis - Prior year comparison highlights - Confidence scores from extraction - Any anomalies detected
What Opus does NOT review: - Full document images - Raw extraction output - Complete prior year data
UX framing: "While reviewing our work, we noticed..."
Expedited QA Option (fee-based):
| Mode | Wait Time | Our Cost | Client Fee |
|---|---|---|---|
| Standard (batch) | 12-24 hrs | ~$0.005 | Included |
| Expedited (real-time) | ~2 min | ~$0.03 | $5-10/return |
Use cases for expedited: - Filing deadline day - Rush returns (already paying rush fee) - Client-requested priority
Implementation: - Button: "Expedite Final Review (+$X)" - Skips PendingFinalQA queue - Runs Opus immediately - Logs for billing
Cost comparison at 1K returns/year:
| Approach | Cost |
|---|---|
| Full-context Opus every return | ~$30/year |
| Summarized data Opus every return | ~$15/year |
| Metadata-only Opus batch | ~$5/year |
| Combined tiered approach | ~$8-10/year |
| Expedited (10% of returns) | +$3/year |
E-file Rejection Routing (TP-019):
Haiku handles simple/technical fixes:
| Code | Issue | Resolution |
|---|---|---|
| IND-181 | IP PIN missing | Prompt for IP PIN |
| Format errors | Schema validation | Auto-correct format |
| Math errors | Calculation mismatch | Recalculate |
| Missing fields | Required field blank | Prompt for data |
Sonnet escalation (investigation needed):
| Code | Issue | Why Complex |
|---|---|---|
| IND-031/032 | AGI mismatch | Amended return, paper filed, IRS lag |
| R0000-500/503 | SSN/name mismatch | Typo vs legal name change |
| IND-524 | DOB mismatch | Data entry vs SSA records |
| F8962-070 | Marketplace insurance | ACA reconciliation needed |
Human escalation (fraud/conflict indicators):
| Code | Issue | Action |
|---|---|---|
| IND-507 | Dependent already claimed | Family dispute or ID theft |
| IND-516 | SSN claimed elsewhere | Possible fraud |
| R0000-902 | Duplicate return filed | Identity theft concern |
def route_rejection(error_code):
# Human escalation - fraud/conflict
if error_code in ['IND-507', 'IND-516', 'R0000-902']:
return "human_review"
# Sonnet - investigation needed
if error_code.startswith(('IND-031', 'IND-032',
'R0000-500', 'R0000-503',
'IND-524', 'F8962')):
return "sonnet"
# Haiku - simple/technical fixes
return "haiku"
3. Human-in-the-Loop Checkpoints¶
3.1 Regulatory Requirements (Circular 230)¶
| Checkpoint | Requirement | Implementation |
|---|---|---|
| Preparer Review | All returns must be reviewed by PTIN holder | AI Analysis → InPrep transition always requires human |
| EA/CPA Final Approval | Licensed professional sign-off | InReview → Approved requires EA/CPA action |
| Due Diligence Documentation | Document basis for positions | AI logs all analysis; preparer confirms |
3.2 Document Processing Checkpoints¶
| State Transition | Human Action | Trigger |
|---|---|---|
| UnknownType → Classified | Staff manually classifies | AI cannot determine document type |
| MediumConfidence → Verified | Staff reviews extraction | Confidence 70-94% |
| LowConfidence → ManualEntry | Staff enters data manually | Confidence <70% |
| HighConfidence → Verified | Staff spot-checks | Non-native PDF with 95%+ confidence |
Confidence Thresholds (refined per client input):
| Document Type | Auto-Verify | Review | Manual Entry |
|---|---|---|---|
| W-2 | 95% | 85-94% | <85% |
| 1099 series | 95% | 85-94% | <85% |
| K-1 | 98% | 90-97% | <90% |
| Dollar amounts | 98% | 90-97% | <90% |
| Name/address | 90% | 80-89% | <80% |
Confidence Calculation Method:
Confidence is calculated from multiple sources:
- OCR engine - SurePrep/Textract character-level scores
- AI extraction - Claude outputs confidence per field
- Format validation - SSN format, EIN format, checksums
- Cross-source agreement - OCR vs AI match
def calculate_field_confidence(field, ocr_result, ai_result):
# Start with OCR confidence (0.0-1.0)
base = ocr_result.confidence
# Boost if AI extraction agrees
if ocr_result.value == ai_result.value:
base = min(base + 0.10, 1.0)
else:
base = base * 0.7 # Disagreement penalty
# Boost if format validates
if passes_format_check(field.type, ocr_result.value):
base = min(base + 0.05, 1.0)
# Penalty for known-difficult fields
if field.type in ['k1_box', 'handwritten']:
base = base * 0.9
return base
Document-level confidence = minimum of all field confidences (weakest link).
3.3 Workflow Checkpoints¶
| Workflow State | Human Actor | Decision Point |
|---|---|---|
| AIAnalysis → InPrep | Preparer | Review AI output, apply judgment |
| AIAnalysis → NeedsReview | Staff | Exception requires human decision |
| NeedsReview → InPrep | Staff | Resolve exception before continuing |
| InPrep → ReadyForReview | Preparer | Confirm preparation complete |
| InReview → Approved | Reviewer (EA/CPA) | Final quality gate |
| InReview → RevisionsNeeded | Reviewer | Issues found, return to preparer |
| Rejected → InPrep | Preparer | Fix e-file rejection |
| FraudReview → EACPAReview | EA/CPA | Escalation for duplicate/fraud |
3.4 Process Flow Diagram: AI to Human Handoffs¶
┌─────────────────────────────────────────────────────────────────┐
│ DOCUMENT PROCESSING │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Upload → [Haiku: Classify] → Classification │
│ │ │
│ ├── Success → [Haiku/Sonnet: Extract] │
│ │ │ │
│ │ ├── High (95%+) → Auto ✓ │
│ │ ├── Med (70-94%) → HUMAN │
│ │ └── Low (<70%) → HUMAN │
│ │ │
│ └── Fail → HUMAN (manual classify) │
│ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ PRELIMINARY ANALYSIS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ All Docs Ready → [Haiku/Sonnet: Analyze] │
│ │ │
│ ├── Normal → Summary → PREPARER │
│ │ │
│ └── Exception → NeedsReview → STAFF │
│ │
│ *** AI NEVER proceeds to filing without human review *** │
│ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ REVIEW & APPROVAL │
├─────────────────────────────────────────────────────────────────┤
│ │
│ PREPARER completes → ReadyForReview → EA/CPA REVIEWER │
│ │ │
│ ├── Approved ✓ │
│ │ │
│ └── Revisions → │
│ back to │
│ PREPARER │
│ │
└─────────────────────────────────────────────────────────────────┘
4. Resolved Ambiguous Scenarios¶
4.1 Preliminary Analysis Threshold¶
Question: When does a return qualify for Haiku (cheap) vs Sonnet (standard) preliminary analysis?
Resolution: - Haiku: ≤2 schedules AND no business income AND no investment complexity - Sonnet: >2 schedules OR any business/investment income OR K-1 present
Implementation:
def select_preliminary_model(return_data):
schedule_count = len(return_data.schedules)
has_business = return_data.has_schedule_c or return_data.has_schedule_e
has_investment = return_data.has_schedule_d or return_data.has_k1
if schedule_count <= 2 and not has_business and not has_investment:
return "haiku"
return "sonnet"
Assumption: Thresholds based on industry norms; validate with client after first tax season.
4.2 Interactive Q&A Model Selection¶
Question: How do we route staff Q&A to the right model without training them to always click "Expert"?
Resolution: Hybrid approach with intelligent defaults:
| Scenario | Model | UX Element |
|---|---|---|
| FAQ pattern match | Haiku | Instant response, no indicator |
| Standard question | Sonnet | Default, "Get Answer Now" |
| Complex research | Opus (batch) | "Request Expert Research" button |
| Urgent escalation | Opus (interactive) | Admin-only "Escalate to Expert" |
UX Positioning: - "Request Expert Research" framed as more thorough, not cheaper - Results delivered next business day - Usage cap per firm with visibility reporting
FAQ Pattern Detection:
FAQ patterns = predictable questions Haiku can answer instantly.
Examples (route to Haiku): - "What's the standard deduction for 2024?" - "When is the filing deadline?" - "What's the mileage rate?" - "What documents do I need for a W-2?"
Not FAQ (route to Sonnet): - "Should this client itemize?" - "Is this income taxable?" - "How do I handle this K-1?"
Detection: Embedding similarity to curated FAQ database.
Tax Knowledge Skills Architecture:
Skills = curated knowledge + prompt templates, versioned by tax year.
skills/
├── federal/
│ ├── individual_2024.md
│ └── business_2024.md
├── states/
│ ├── FL_2024.md
│ ├── GA_2024.md
│ └── ... (design for 50)
└── common/
├── deadlines_2024.md
├── forms.md
└── rates_2024.md
Skills rebuilt annually when tax code updates (public/predictable).
4.3 Extraction Confidence Threshold¶
Question: What confidence thresholds trigger human review?
Resolution: Document-type and field-level thresholds:
| Factor | Auto-Verify | Review | Manual Entry |
|---|---|---|---|
| W-2, 1099 (standard) | ≥95% | 85-94% | <85% |
| K-1 (complex) | ≥98% | 90-97% | <90% |
| Dollar amounts | ≥98% | 90-97% | <90% |
| Name/address | ≥90% | 80-89% | <80% |
Rationale: Dollar amounts have higher materiality; K-1s have more variable formats and require stricter thresholds.
4.4 Opus Usage: Interactive vs Batch¶
Question: When should Opus be available interactively vs forced to batch?
Resolution:
| Mode | Availability | Use Case |
|---|---|---|
| Batch (default) | All staff | "Request Expert Research" → overnight |
| Interactive | Admin only | "Escalate to Expert" → immediate |
UX Copy: - Batch: "Submit to Expert" → "Your research request has been queued. Results will be ready by 8 AM tomorrow." - Interactive: (admin only) "Escalate to Expert" → immediate response
Reporting: Usage caps with per-firm visibility (see Section 7).
4.5 Multi-Model Pipelines¶
Question: Should we chain models (Haiku → Sonnet → Opus) or use single model per task?
Resolution: Single model per task for V1, with architecture supporting future chaining.
V1 Implementation: - Each AI touchpoint uses one model based on routing rules - Output includes confidence score - Routing logic is a separate layer (not embedded in prompts) - All interactions logged for future routing model training
Future Enhancement: - Train routing model on logged data - Implement Haiku-first with auto-escalation - A/B test single vs chained approaches
5. Additional Scenarios¶
5.1 Batch API Optimization¶
Not all AI tasks require real-time response. Batch processing via Anthropic's batch API is 50% cheaper.
Task Classification:
| Task | Interactive? | Batch Eligible? | Notes |
|---|---|---|---|
| Document classification | Partial | Yes | Batch overnight for folder uploads |
| Data extraction | Partial | Yes | Batch overnight for large uploads |
| Prior year comparison | Optional | Yes | Run nightly for next day review |
| Missing doc detection | Optional | Yes | Run nightly after new docs processed |
| Preparer Q&A | Yes | No | Real-time conversation required |
| Worksheet generation | Optional | Yes | Pre-generate overnight |
| Rejection analysis | No | Yes | Run when rejection received |
| Final QA (TP-018) | No | Yes | Overnight batch |
| Tax reminders | No | Yes | Scheduled batch job |
Key Insight: Only Preparer Q&A truly requires real-time response.
UX Strategy: Default to Batch
When documents are uploaded:
"Thanks! I'll analyze these documents and have results ready for your review tomorrow morning. Need to discuss now? [Start Live Session]"
Benefits: - Default path is cheapest (batch) - User opts into real-time only when needed - Sets expectation that AI "works overnight" - Live sessions can be premium feature
Implementation: Airflow schedules batch jobs (see ARCHITECTURE.md): - Queue document processing overnight after SmartVault sync - Pre-generate worksheets for morning review - Run analysis batch jobs during off-peak hours
For cost savings projections, see COST_DETAIL.md.
5.2 Context Window Management¶
Scenario: Complex returns with 50+ documents may exceed context limits.
Strategy:
| Approach | When to Use | Implementation |
|---|---|---|
| Metadata Caching | All returns | Extract once, cache as MD (<350 lines) |
| Document Summarization | Large doc sets | Haiku summarizes each doc first |
| Chunked Processing | Very large returns | Process in batches, aggregate results |
| Per-Return Summary | Every return | Maintain running summary file |
Token Budget per Return: - Target: 15,000 tokens average (fits Haiku/Sonnet comfortably) - Maximum: 50,000 tokens (large returns with summarization) - Opus validation: 5,000 tokens (summarized data only)
Metadata Cache Structure:
/returns/{return_id}/
├── metadata.md # Extracted data summary (<350 lines)
├── prior_year_summary.md # Prior year comparison
├── flags.md # Anomalies and issues
└── documents/
├── w2_001_summary.md
├── 1099_001_summary.md
└── ...
Per-Document Metadata Format:
File: docs/{client_id}/{return_year}/{doc_id}_metadata.md
# Document Metadata
- **Type:** W-2
- **Source:** employer_acme_corp.pdf
- **Uploaded:** 2024-02-15
- **Confidence:** 98%
## Extracted Values
| Field | Value | Box |
|-------|-------|-----|
| Employer | Acme Corp | c |
| EIN | 12-3456789 | b |
| Wages | $85,432.00 | 1 |
| Federal Withholding | $12,543.00 | 2 |
## AI Notes
- Wages increased 8% from prior year ($79,100)
- Withholding rate: 14.7% (typical)
## Flags
- None
Per-Return Summary Format:
File: docs/{client_id}/{return_year}/return_summary.md
# Return Summary: John Smith (2024)
## Documents Received (12 of 15 expected)
| Doc | Type | Status | Key Value |
|-----|------|--------|-----------|
| W-2 Acme Corp | W-2 | ✓ Complete | $85,432 wages |
| 1099-INT Chase | 1099-INT | ✓ Complete | $1,234 interest |
## Missing Documents
- 1098 Mortgage Interest (expected based on prior year)
## Prior Year Comparison
| Item | 2023 | 2024 | Change |
|------|------|------|--------|
| Total Income | $82,500 | $87,233 | +5.7% |
## Flags & Questions
1. Large charitable contribution ($5,000) - verify documentation
Implementation:
- Batch extract → Create metadata MD on first document scan
- Store in S3 → Alongside original document
- Index in Aurora → Track metadata file location
- AI reads cache first → Only re-scan if metadata missing or stale
- Refresh trigger → Re-extract if document updated or confidence < 90%
For cost savings projections, see COST_DETAIL.md.
5.3 Retry and Fallback Logic¶
Scenario: What happens when an AI call fails or returns low-confidence output?
Strategy:
| Failure Type | Retry? | Fallback | Human Escalation |
|---|---|---|---|
| API timeout | Yes, 3x with backoff | Same model | After 3 failures |
| Rate limit | Yes, with delay | Same model | Never (wait) |
| Low confidence (Haiku) | No | Escalate to Sonnet | After Sonnet |
| Low confidence (Sonnet) | No | Escalate to Opus | After Opus |
| Low confidence (Opus) | No | None | Immediate |
| Malformed output | Yes, 1x | Same model | After retry |
Backoff Schedule: - Retry 1: 2 seconds - Retry 2: 4 seconds - Retry 3: 8 seconds - Then: Queue for human review
Confidence Escalation Flow:
Haiku (conf < threshold)
→ Sonnet (conf < threshold)
→ Opus (conf < threshold)
→ HUMAN REVIEW (always terminal)
5.4 Caching Strategy¶
Scenario: Same questions asked across multiple clients (e.g., "What's the standard deduction for 2024?").
Strategy:
| Cache Type | Scope | TTL | Use Case |
|---|---|---|---|
| FAQ Cache | Global | Tax year | Standard deduction, filing deadlines |
| Regulation Cache | Global | Tax year | IRS rules, state requirements |
| Firm Guidelines Cache | Firm | Until updated | Firm-specific policies |
| Prior Year Cache | Per-client | Permanent | Client's historical data |
| Session Cache | Per-session | 1 hour | Avoid re-asking same question |
Cache Key Structure:
faq:{tax_year}:{normalized_question_hash}
reg:{tax_year}:{topic}:{jurisdiction}
firm:{firm_id}:{guideline_type}
client:{client_id}:{tax_year}:summary
session:{session_id}:{query_hash}
Cost Savings: 20-30% reduction in duplicate queries.
6. Quality Metrics and Ground Truth¶
6.1 Ground Truth Sources¶
| Metric | Ground Truth Source | Feedback Loop |
|---|---|---|
| Extraction accuracy | Staff corrections in PendingReview | Retrain extraction prompts |
| Classification accuracy | Staff manual classifications | Update classification rules |
| Analysis quality | Reviewer corrections (RevisionsNeeded) | Refine analysis prompts |
| Q&A quality | Preparer ratings (thumbs up/down) | Prompt optimization |
| Validation effectiveness | IRS rejections | Improve pre-file checks |
| Overall accuracy | IRS notices (CP2000, etc.) | Long-term quality signal |
6.2 Metric Definitions¶
| Metric | Calculation | Target |
|---|---|---|
| Extraction Accuracy | (Auto-verified + Spot-check passed) / Total extractions | >95% |
| Classification Accuracy | Auto-classified / (Auto + Manual classified) | >98% |
| First-Pass Approval Rate | Approved / (Approved + RevisionsNeeded) | >90% |
| E-file Acceptance Rate | Accepted / (Accepted + Rejected) | >98% |
| AI-Assisted Time Savings | (Manual time - AI-assisted time) / Manual time | >40% |
| Cost per Return | Total AI costs / Returns processed | <$0.15 |
6.3 Quality Feedback Loops¶
┌─────────────────────────────────────────────────────────────────┐
│ EXTRACTION FEEDBACK │
├─────────────────────────────────────────────────────────────────┤
│ │
│ AI Extraction → Staff Review → Correction │
│ │ │ │
│ │ └──→ Log correction type │
│ │ │ │
│ │ ▼ │
│ │ ┌──────────────────┐ │
│ │ │ Weekly analysis: │ │
│ │ │ - Common errors │ │
│ │ │ - Doc type fails │ │
│ │ │ - Field failures │ │
│ │ └──────────────────┘ │
│ │ │ │
│ │ ▼ │
│ │ Prompt/threshold tuning │
│ │ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ LONG-TERM QUALITY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ IRS Notices (CP2000, etc.) │
│ │ │
│ ▼ │
│ Link to original return → Identify AI touchpoint involved │
│ │ │
│ ▼ │
│ Root cause analysis → Systemic fix │
│ │
│ *** This is the ultimate ground truth *** │
│ *** But has 6-18 month lag *** │
│ │
└─────────────────────────────────────────────────────────────────┘
7. Usage Caps and Reporting¶
7.1 Opus Usage Caps¶
Decision: Implement per-firm monthly caps with transparency.
| Tier | Opus Interactive | Opus Batch | Rationale |
|---|---|---|---|
| Base | 0 (admin only) | 50/month | Encourage batch |
| Standard | 10/month | 100/month | Most firms |
| Premium | 25/month | 250/month | High-volume firms |
7.2 Cap Granularity¶
| Dimension | Tracking | Reporting |
|---|---|---|
| Per-firm monthly | Hard cap | Dashboard + email at 80% |
| Per-preparer | Soft tracking | Manager visibility |
| Per-return | Logging only | Audit trail |
7.3 Usage Visibility¶
Staff Dashboard:
┌─────────────────────────────────────────────────────────────────┐
│ AI Usage This Month │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Expert Research Requests: 12 of 50 remaining │
│ ████████░░░░░░░░░░░░ 76% available │
│ │
│ Resets: January 1, 2026 │
│ │
└─────────────────────────────────────────────────────────────────┘
Admin Dashboard:
┌─────────────────────────────────────────────────────────────────┐
│ AI Costs - December 2025 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Total Cost: $127.45 │
│ Returns Processed: 89 │
│ Cost per Return: $1.43 │
│ │
│ By Model: │
│ - Haiku: $18.20 (62% of calls) │
│ - Sonnet: $84.50 (33% of calls) │
│ - Opus: $24.75 (5% of calls) │
│ │
│ By Touchpoint: │
│ - Extraction: $45.00 │
│ - Analysis: $52.00 │
│ - Q&A: $30.45 │
│ │
│ Top Opus Users: │
│ - Jane Smith: 8 requests │
│ - John Doe: 4 requests │
│ │
└─────────────────────────────────────────────────────────────────┘
8. Implementation Phases¶
8.1 Phase 1: Foundation (Pre-Tax Season)¶
| Item | Description | Priority |
|---|---|---|
| Model routing layer | Implement touchpoint → model mapping | P0 |
| Confidence scoring | Output confidence with every AI call | P0 |
| Basic escalation | Haiku → Sonnet escalation on low confidence | P0 |
| Logging infrastructure | Log all AI calls with timing, cost, confidence | P0 |
| Human review queues | PendingReview, ManualEntry workflows | P0 |
8.2 Phase 2: Optimization (During Tax Season)¶
| Item | Description | Priority |
|---|---|---|
| FAQ caching | Cache common questions | P1 |
| Batch API integration | Overnight processing for non-urgent | P1 |
| Threshold tuning | Adjust confidence thresholds based on data | P1 |
| Usage dashboards | Staff and admin visibility | P1 |
8.3 Phase 3: Enhancement (Post-Tax Season)¶
| Item | Description | Priority |
|---|---|---|
| Trained routing model | ML model for optimal model selection | P2 |
| A/B testing framework | Compare single vs chained approaches | P2 |
| Long-term quality tracking | IRS notice correlation | P2 |
| Context window optimization | Smarter summarization | P2 |
Appendix A: Model Selection Decision Tree¶
START: New AI Request
│
├── Is this document extraction?
│ │
│ ├── W-2/1099 (standard form)? → HAIKU
│ │ └── Confidence <90%? → SONNET
│ │
│ ├── K-1? → SONNET
│ │ └── Confidence <85%? → OPUS
│ │
│ └── Other? → SONNET
│
├── Is this preliminary analysis?
│ │
│ ├── ≤2 schedules AND no business/investment? → HAIKU
│ │ └── Anomaly detected? → SONNET
│ │
│ └── >2 schedules OR business/investment? → SONNET
│ └── Complex entity (trust, partnership)? → OPUS
│
├── Is this interactive Q&A?
│ │
│ ├── Matches FAQ pattern? → HAIKU
│ │
│ ├── Standard question? → SONNET
│ │
│ ├── "Request Expert Research"? → OPUS (batch)
│ │
│ └── Admin escalation? → OPUS (interactive)
│
└── Is this validation?
│
├── Pre-review check? → SONNET
│
└── Final QA (summarized data)? → OPUS (batch)
Appendix B: Open Questions for Future Resolution¶
| Question | Context | Proposed Resolution |
|---|---|---|
| Optimal Opus validation frequency | Every return vs sampling? | Start with every return, measure value |
| Multi-model chaining ROI | Does Haiku → Sonnet save money? | Collect V1 data, analyze post-season |
| Preparer training on model selection | Do they need to understand? | Hide complexity, surface "Expert Research" only |
| State-specific model adjustments | CA vs FL complexity? | Single national model for V1 |
Document History¶
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.3 | December 25, 2025 | Claude | Fixed Section 1.3: Corrected token estimate from 2,000 to ~126,500/return, fixed annual cost calculations to align with COST_DETAIL.md ($0.33/return). |
| 1.2 | December 25, 2025 | Claude | Document restructure: Absorbed batch API strategy (5.1) and expanded metadata caching (5.2) from COST_DETAIL.md. Updated section numbering. |
| 1.1 | December 25, 2025 | Claude | Reconciliation pass: Cross-referenced with ARCHITECTURE.md, PROCESS_FLOWS.md, COST_DETAIL.md, requirements. Added PendingFinalQA to PROCESS_FLOWS.md. Added Related Documentation section. |
| 1.0 | December 25, 2025 | Claude | Initial draft incorporating client decisions |
This document should be reviewed and approved before implementation.