Skip to content

AI Model Delegation Strategy

Version: 1.3 Date: December 25, 2025 Status: Draft - Pending Final Review


Document Purpose
COST_DETAIL.md Full cost breakdown, token usage, SaaS pricing model
PROCESS_FLOWS.md Workflow states including PendingFinalQA
ARCHITECTURE.md System architecture, service centralization
CLIENT_FOLLOWUP.md FUT-002: Tiered AI Validation Strategy discussion

Executive Summary

This document defines the AI model delegation strategy for the Tax Practice AI system. It specifies which Claude model (Haiku, Sonnet, or Opus) handles each AI touchpoint, the routing rules between models, and the human-in-the-loop checkpoints required by IRS Circular 230 and firm quality standards.

AI Cost Target (SaaS provider COGS): $0.12/return (vs $0.40 baseline) through strategic model delegation and optimization. This is our internal cost per return, not client pricing.

Model Distribution Target: Haiku 60% | Sonnet 35% | Opus 5%


Table of Contents

  1. Model Capabilities and Pricing
  2. 1.1 Model Comparison
  3. 1.2 Batch API Discount
  4. 1.3 Cost Model at Scale
  5. AI Touchpoints and Model Assignments
  6. 2.1 Document Processing
  7. 2.2 Preliminary Analysis
  8. 2.3 Interactive Q&A
  9. 2.4 Review and Validation
  10. Human-in-the-Loop Checkpoints
  11. 3.1 Regulatory Requirements
  12. 3.2 Document Processing
  13. 3.3 Workflow Checkpoints
  14. 3.4 Process Flow Diagram
  15. Resolved Ambiguous Scenarios
  16. 4.1 Preliminary Analysis Threshold
  17. 4.2 Interactive Q&A Model Selection
  18. 4.3 Extraction Confidence Threshold
  19. 4.4 Opus Usage: Interactive vs Batch
  20. 4.5 Multi-Model Pipelines
  21. Additional Scenarios
  22. 5.1 Context Window Management
  23. 5.2 Retry and Fallback Logic
  24. 5.3 Caching Strategy
  25. Quality Metrics and Ground Truth
  26. 6.1 Ground Truth Sources
  27. 6.2 Metric Definitions
  28. 6.3 Quality Feedback Loops
  29. Usage Caps and Reporting
  30. 7.1 Opus Usage Caps
  31. 7.2 Cap Granularity
  32. 7.3 Usage Visibility
  33. Implementation Phases
  34. 8.1 Phase 1: Foundation
  35. 8.2 Phase 2: Optimization
  36. 8.3 Phase 3: Enhancement

1. Model Capabilities and Pricing

1.1 Model Comparison

Model Strengths Weaknesses Cost (per 1M tokens)
Haiku 3 Fast, cheap, structured tasks Limited reasoning In: $0.25, Out: $1.25
Sonnet 4 Balanced, strong extraction Slower than Haiku In: $3.00, Out: $15.00
Opus 4.5 Best reasoning, nuanced judgment Slower In: $5.00, Out: $25.00

Pricing as of Dec 2025 via AWS Bedrock. See AWS Bedrock Pricing.

1.2 Batch API Discount

  • 50% cost reduction for batch processing (non-real-time)
  • Target: 80% of non-interactive workloads through batch API
  • Processing window: overnight via Airflow scheduling

1.3 Cost Model at Scale

Returns/Year AI Cost/Return Annual AI Cost
1,000 $0.33 ~$330/year
5,000 $0.33 ~$1,650/year
10,000 $0.33 ~$3,300/year

Based on ~126,500 tokens/return lifecycle with 60/35/5 model delegation. See COST_DETAIL.md for token breakdown.


2. AI Touchpoints and Model Assignments

2.1 Document Processing Touchpoints

ID Touchpoint Default Model Escalation Batch Eligible
TP-001 Document Classification Haiku None Yes
TP-002 W-2 Extraction Haiku Sonnet if confidence <90% Yes
TP-003 1099 Extraction Haiku Sonnet if confidence <90% Yes
TP-004 K-1 Extraction Sonnet Opus if confidence <85% Yes
TP-005 Brokerage Statement Extraction Sonnet None Yes
TP-006 Other Document Extraction Sonnet None Yes

Rationale: - Standard forms (W-2, 1099) are highly structured → Haiku sufficient - K-1s have variable formats and complex pass-through data → Sonnet default, Opus for low confidence - Brokerage statements require table parsing and multi-page handling → Sonnet

Image Quality Escalation:

Document source affects model selection independent of document type:

Source Default Adjustment Rationale
Native PDF Use table above Embedded text, reliable
High-quality scan Use table above Clean OCR
Phone photo +1 model tier Skew, lighting, blur
Low-res/blurry +1 model tier OCR uncertainty
HEIC (iPhone) Convert first, then assess Format handling
Handwritten content → Human review Too unreliable for AI

Implementation:

def adjust_for_image_quality(base_model, doc_metadata):
    if doc_metadata.is_native_pdf:
        return base_model
    if doc_metadata.source == "phone_photo":
        return escalate(base_model)  # haiku→sonnet, sonnet→opus
    if doc_metadata.resolution < 150 or doc_metadata.blur_score > 0.3:
        return escalate(base_model)
    if doc_metadata.has_handwriting:
        return "human_review"
    return base_model

2.2 Preliminary Analysis Touchpoints

ID Touchpoint Default Model Escalation Batch Eligible
TP-007 Simple Return Analysis (≤2 schedules, no business) Haiku Sonnet on anomaly Yes
TP-008 Standard Return Analysis (>2 schedules OR business/investment) Sonnet Opus for complex entities Yes
TP-009 Prior Year Comparison Haiku Sonnet if significant changes Yes
TP-010 Missing Document Detection Haiku None Yes
TP-011 Client Question Generation Haiku Sonnet for complex situations Yes

Decision (from client): Schedule type matters as much as count. Haiku for ≤2 schedules AND no business income; Sonnet for >2 schedules OR any business/investment complexity.

TP-007/TP-009 Anomaly Thresholds (Haiku → Sonnet):

Note: We add value on top of tax software variance tools, not replace them. These thresholds trigger deeper AI analysis, not duplicate existing reports.

Anomaly Type Threshold Rationale
Income change YoY >25% Matches IRS DIF flags
Deductions change >20% or >$5K IRS audit trigger
New schedule Any not in PY Life event likely
Missing PY source W-2/1099 gone Job loss, needs verify
Charitable giving >10% of AGI IRS scrutiny area
Schedule C loss Any Hobby loss rules
Home office New claim Documentation needed

TP-011 Question Complexity Routing:

Haiku generates (simple data collection): - Missing document requests - Address/contact confirmations - "Did you receive [expected doc]?"

Sonnet generates (judgment required): - Life events (marriage, divorce, new child, death) - Business vs hobby determination - Multi-state residency questions - Unusual income sources - "Help us understand..." questions

2.3 Interactive Q&A Touchpoints

ID Touchpoint Default Model Escalation Batch Eligible
TP-012 FAQ-Pattern Matches Haiku None No
TP-013 Standard Q&A (staff) Sonnet Opus via "Request Expert Research" No (interactive)
TP-014 Complex Research Request Opus None Yes (preferred)
TP-015 Batch Overnight Research Sonnet (batch) None Yes

Decision (from client): Hybrid UX approach: - Default to Sonnet for interactive Q&A - Auto-downgrade to Haiku for FAQ-pattern matches - "Request Expert Research" link for Opus (batch, positioned as more thorough) - Admin-only "Escalate to Expert" for Opus interactive

2.4 Review and Validation Touchpoints

ID Touchpoint Default Model Escalation Batch Eligible
TP-016 Preparer Summary Generation Haiku None Yes
TP-017 Pre-Review Validation Sonnet Opus if issues found Yes
TP-018 Final Validation (Opus QA) Opus (metadata only, batch) None Yes
TP-019 E-file Rejection Analysis Haiku Sonnet for complex rejections Yes

Validation Strategy (see FUT-002 for discussion):

  1. Tiered self-validation - each model QAs its own work inline
  2. TP-017 - Sonnet pre-review escalates to Opus if issues found
  3. TP-018 - Opus reviews cached metadata only (not full docs), runs overnight batch
  4. Workflow - adds "Pending Final QA" status before client delivery

TP-018 Opus QA Detail:

Workflow integration (see PROCESS_FLOWS.md):

Approved → PendingFinalQA → (overnight batch)
    ┌───────────┴───────────┐
    ↓                       ↓
No issues                Issues found
    ↓                       ↓
PendingSignature       RevisionsNeeded

What Opus reviews (metadata only, ~750 tokens): - Key figures summary (income, deductions, refund/due) - Flags from earlier analysis - Prior year comparison highlights - Confidence scores from extraction - Any anomalies detected

What Opus does NOT review: - Full document images - Raw extraction output - Complete prior year data

UX framing: "While reviewing our work, we noticed..."

Expedited QA Option (fee-based):

Mode Wait Time Our Cost Client Fee
Standard (batch) 12-24 hrs ~$0.005 Included
Expedited (real-time) ~2 min ~$0.03 $5-10/return

Use cases for expedited: - Filing deadline day - Rush returns (already paying rush fee) - Client-requested priority

Implementation: - Button: "Expedite Final Review (+$X)" - Skips PendingFinalQA queue - Runs Opus immediately - Logs for billing

Cost comparison at 1K returns/year:

Approach Cost
Full-context Opus every return ~$30/year
Summarized data Opus every return ~$15/year
Metadata-only Opus batch ~$5/year
Combined tiered approach ~$8-10/year
Expedited (10% of returns) +$3/year

E-file Rejection Routing (TP-019):

Haiku handles simple/technical fixes:

Code Issue Resolution
IND-181 IP PIN missing Prompt for IP PIN
Format errors Schema validation Auto-correct format
Math errors Calculation mismatch Recalculate
Missing fields Required field blank Prompt for data

Sonnet escalation (investigation needed):

Code Issue Why Complex
IND-031/032 AGI mismatch Amended return, paper filed, IRS lag
R0000-500/503 SSN/name mismatch Typo vs legal name change
IND-524 DOB mismatch Data entry vs SSA records
F8962-070 Marketplace insurance ACA reconciliation needed

Human escalation (fraud/conflict indicators):

Code Issue Action
IND-507 Dependent already claimed Family dispute or ID theft
IND-516 SSN claimed elsewhere Possible fraud
R0000-902 Duplicate return filed Identity theft concern
def route_rejection(error_code):
    # Human escalation - fraud/conflict
    if error_code in ['IND-507', 'IND-516', 'R0000-902']:
        return "human_review"

    # Sonnet - investigation needed
    if error_code.startswith(('IND-031', 'IND-032',
                              'R0000-500', 'R0000-503',
                              'IND-524', 'F8962')):
        return "sonnet"

    # Haiku - simple/technical fixes
    return "haiku"

3. Human-in-the-Loop Checkpoints

3.1 Regulatory Requirements (Circular 230)

Checkpoint Requirement Implementation
Preparer Review All returns must be reviewed by PTIN holder AI Analysis → InPrep transition always requires human
EA/CPA Final Approval Licensed professional sign-off InReview → Approved requires EA/CPA action
Due Diligence Documentation Document basis for positions AI logs all analysis; preparer confirms

3.2 Document Processing Checkpoints

State Transition Human Action Trigger
UnknownType → Classified Staff manually classifies AI cannot determine document type
MediumConfidence → Verified Staff reviews extraction Confidence 70-94%
LowConfidence → ManualEntry Staff enters data manually Confidence <70%
HighConfidence → Verified Staff spot-checks Non-native PDF with 95%+ confidence

Confidence Thresholds (refined per client input):

Document Type Auto-Verify Review Manual Entry
W-2 95% 85-94% <85%
1099 series 95% 85-94% <85%
K-1 98% 90-97% <90%
Dollar amounts 98% 90-97% <90%
Name/address 90% 80-89% <80%

Confidence Calculation Method:

Confidence is calculated from multiple sources:

  1. OCR engine - SurePrep/Textract character-level scores
  2. AI extraction - Claude outputs confidence per field
  3. Format validation - SSN format, EIN format, checksums
  4. Cross-source agreement - OCR vs AI match
def calculate_field_confidence(field, ocr_result, ai_result):
    # Start with OCR confidence (0.0-1.0)
    base = ocr_result.confidence

    # Boost if AI extraction agrees
    if ocr_result.value == ai_result.value:
        base = min(base + 0.10, 1.0)
    else:
        base = base * 0.7  # Disagreement penalty

    # Boost if format validates
    if passes_format_check(field.type, ocr_result.value):
        base = min(base + 0.05, 1.0)

    # Penalty for known-difficult fields
    if field.type in ['k1_box', 'handwritten']:
        base = base * 0.9

    return base

Document-level confidence = minimum of all field confidences (weakest link).

3.3 Workflow Checkpoints

Workflow State Human Actor Decision Point
AIAnalysis → InPrep Preparer Review AI output, apply judgment
AIAnalysis → NeedsReview Staff Exception requires human decision
NeedsReview → InPrep Staff Resolve exception before continuing
InPrep → ReadyForReview Preparer Confirm preparation complete
InReview → Approved Reviewer (EA/CPA) Final quality gate
InReview → RevisionsNeeded Reviewer Issues found, return to preparer
Rejected → InPrep Preparer Fix e-file rejection
FraudReview → EACPAReview EA/CPA Escalation for duplicate/fraud

3.4 Process Flow Diagram: AI to Human Handoffs

┌─────────────────────────────────────────────────────────────────┐
│                    DOCUMENT PROCESSING                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Upload → [Haiku: Classify] → Classification                   │
│                │                                                │
│                ├── Success → [Haiku/Sonnet: Extract]            │
│                │                    │                           │
│                │                    ├── High (95%+) → Auto ✓    │
│                │                    ├── Med (70-94%) → HUMAN    │
│                │                    └── Low (<70%) → HUMAN      │
│                │                                                │
│                └── Fail → HUMAN (manual classify)               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    PRELIMINARY ANALYSIS                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  All Docs Ready → [Haiku/Sonnet: Analyze]                       │
│                          │                                      │
│                          ├── Normal → Summary → PREPARER        │
│                          │                                      │
│                          └── Exception → NeedsReview → STAFF    │
│                                                                 │
│  *** AI NEVER proceeds to filing without human review ***       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    REVIEW & APPROVAL                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  PREPARER completes → ReadyForReview → EA/CPA REVIEWER          │
│                                              │                  │
│                                              ├── Approved ✓     │
│                                              │                  │
│                                              └── Revisions →    │
│                                                   back to       │
│                                                   PREPARER      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

4. Resolved Ambiguous Scenarios

4.1 Preliminary Analysis Threshold

Question: When does a return qualify for Haiku (cheap) vs Sonnet (standard) preliminary analysis?

Resolution: - Haiku: ≤2 schedules AND no business income AND no investment complexity - Sonnet: >2 schedules OR any business/investment income OR K-1 present

Implementation:

def select_preliminary_model(return_data):
    schedule_count = len(return_data.schedules)
    has_business = return_data.has_schedule_c or return_data.has_schedule_e
    has_investment = return_data.has_schedule_d or return_data.has_k1

    if schedule_count <= 2 and not has_business and not has_investment:
        return "haiku"
    return "sonnet"

Assumption: Thresholds based on industry norms; validate with client after first tax season.

4.2 Interactive Q&A Model Selection

Question: How do we route staff Q&A to the right model without training them to always click "Expert"?

Resolution: Hybrid approach with intelligent defaults:

Scenario Model UX Element
FAQ pattern match Haiku Instant response, no indicator
Standard question Sonnet Default, "Get Answer Now"
Complex research Opus (batch) "Request Expert Research" button
Urgent escalation Opus (interactive) Admin-only "Escalate to Expert"

UX Positioning: - "Request Expert Research" framed as more thorough, not cheaper - Results delivered next business day - Usage cap per firm with visibility reporting

FAQ Pattern Detection:

FAQ patterns = predictable questions Haiku can answer instantly.

Examples (route to Haiku): - "What's the standard deduction for 2024?" - "When is the filing deadline?" - "What's the mileage rate?" - "What documents do I need for a W-2?"

Not FAQ (route to Sonnet): - "Should this client itemize?" - "Is this income taxable?" - "How do I handle this K-1?"

Detection: Embedding similarity to curated FAQ database.

Tax Knowledge Skills Architecture:

Skills = curated knowledge + prompt templates, versioned by tax year.

skills/
├── federal/
│   ├── individual_2024.md
│   └── business_2024.md
├── states/
│   ├── FL_2024.md
│   ├── GA_2024.md
│   └── ... (design for 50)
└── common/
    ├── deadlines_2024.md
    ├── forms.md
    └── rates_2024.md

Skills rebuilt annually when tax code updates (public/predictable).

4.3 Extraction Confidence Threshold

Question: What confidence thresholds trigger human review?

Resolution: Document-type and field-level thresholds:

Factor Auto-Verify Review Manual Entry
W-2, 1099 (standard) ≥95% 85-94% <85%
K-1 (complex) ≥98% 90-97% <90%
Dollar amounts ≥98% 90-97% <90%
Name/address ≥90% 80-89% <80%

Rationale: Dollar amounts have higher materiality; K-1s have more variable formats and require stricter thresholds.

4.4 Opus Usage: Interactive vs Batch

Question: When should Opus be available interactively vs forced to batch?

Resolution:

Mode Availability Use Case
Batch (default) All staff "Request Expert Research" → overnight
Interactive Admin only "Escalate to Expert" → immediate

UX Copy: - Batch: "Submit to Expert" → "Your research request has been queued. Results will be ready by 8 AM tomorrow." - Interactive: (admin only) "Escalate to Expert" → immediate response

Reporting: Usage caps with per-firm visibility (see Section 7).

4.5 Multi-Model Pipelines

Question: Should we chain models (Haiku → Sonnet → Opus) or use single model per task?

Resolution: Single model per task for V1, with architecture supporting future chaining.

V1 Implementation: - Each AI touchpoint uses one model based on routing rules - Output includes confidence score - Routing logic is a separate layer (not embedded in prompts) - All interactions logged for future routing model training

Future Enhancement: - Train routing model on logged data - Implement Haiku-first with auto-escalation - A/B test single vs chained approaches


5. Additional Scenarios

5.1 Batch API Optimization

Not all AI tasks require real-time response. Batch processing via Anthropic's batch API is 50% cheaper.

Task Classification:

Task Interactive? Batch Eligible? Notes
Document classification Partial Yes Batch overnight for folder uploads
Data extraction Partial Yes Batch overnight for large uploads
Prior year comparison Optional Yes Run nightly for next day review
Missing doc detection Optional Yes Run nightly after new docs processed
Preparer Q&A Yes No Real-time conversation required
Worksheet generation Optional Yes Pre-generate overnight
Rejection analysis No Yes Run when rejection received
Final QA (TP-018) No Yes Overnight batch
Tax reminders No Yes Scheduled batch job

Key Insight: Only Preparer Q&A truly requires real-time response.

UX Strategy: Default to Batch

When documents are uploaded:

"Thanks! I'll analyze these documents and have results ready for your review tomorrow morning. Need to discuss now? [Start Live Session]"

Benefits: - Default path is cheapest (batch) - User opts into real-time only when needed - Sets expectation that AI "works overnight" - Live sessions can be premium feature

Implementation: Airflow schedules batch jobs (see ARCHITECTURE.md): - Queue document processing overnight after SmartVault sync - Pre-generate worksheets for morning review - Run analysis batch jobs during off-peak hours

For cost savings projections, see COST_DETAIL.md.

5.2 Context Window Management

Scenario: Complex returns with 50+ documents may exceed context limits.

Strategy:

Approach When to Use Implementation
Metadata Caching All returns Extract once, cache as MD (<350 lines)
Document Summarization Large doc sets Haiku summarizes each doc first
Chunked Processing Very large returns Process in batches, aggregate results
Per-Return Summary Every return Maintain running summary file

Token Budget per Return: - Target: 15,000 tokens average (fits Haiku/Sonnet comfortably) - Maximum: 50,000 tokens (large returns with summarization) - Opus validation: 5,000 tokens (summarized data only)

Metadata Cache Structure:

/returns/{return_id}/
├── metadata.md           # Extracted data summary (<350 lines)
├── prior_year_summary.md # Prior year comparison
├── flags.md              # Anomalies and issues
└── documents/
    ├── w2_001_summary.md
    ├── 1099_001_summary.md
    └── ...

Per-Document Metadata Format:

File: docs/{client_id}/{return_year}/{doc_id}_metadata.md

# Document Metadata
- **Type:** W-2
- **Source:** employer_acme_corp.pdf
- **Uploaded:** 2024-02-15
- **Confidence:** 98%

## Extracted Values
| Field | Value | Box |
|-------|-------|-----|
| Employer | Acme Corp | c |
| EIN | 12-3456789 | b |
| Wages | $85,432.00 | 1 |
| Federal Withholding | $12,543.00 | 2 |

## AI Notes
- Wages increased 8% from prior year ($79,100)
- Withholding rate: 14.7% (typical)

## Flags
- None

Per-Return Summary Format:

File: docs/{client_id}/{return_year}/return_summary.md

# Return Summary: John Smith (2024)

## Documents Received (12 of 15 expected)
| Doc | Type | Status | Key Value |
|-----|------|--------|-----------|
| W-2 Acme Corp | W-2 | ✓ Complete | $85,432 wages |
| 1099-INT Chase | 1099-INT | ✓ Complete | $1,234 interest |

## Missing Documents
- 1098 Mortgage Interest (expected based on prior year)

## Prior Year Comparison
| Item | 2023 | 2024 | Change |
|------|------|------|--------|
| Total Income | $82,500 | $87,233 | +5.7% |

## Flags & Questions
1. Large charitable contribution ($5,000) - verify documentation

Implementation:

  1. Batch extract → Create metadata MD on first document scan
  2. Store in S3 → Alongside original document
  3. Index in Aurora → Track metadata file location
  4. AI reads cache first → Only re-scan if metadata missing or stale
  5. Refresh trigger → Re-extract if document updated or confidence < 90%

For cost savings projections, see COST_DETAIL.md.

5.3 Retry and Fallback Logic

Scenario: What happens when an AI call fails or returns low-confidence output?

Strategy:

Failure Type Retry? Fallback Human Escalation
API timeout Yes, 3x with backoff Same model After 3 failures
Rate limit Yes, with delay Same model Never (wait)
Low confidence (Haiku) No Escalate to Sonnet After Sonnet
Low confidence (Sonnet) No Escalate to Opus After Opus
Low confidence (Opus) No None Immediate
Malformed output Yes, 1x Same model After retry

Backoff Schedule: - Retry 1: 2 seconds - Retry 2: 4 seconds - Retry 3: 8 seconds - Then: Queue for human review

Confidence Escalation Flow:

Haiku (conf < threshold)
    → Sonnet (conf < threshold)
        → Opus (conf < threshold)
            → HUMAN REVIEW (always terminal)

5.4 Caching Strategy

Scenario: Same questions asked across multiple clients (e.g., "What's the standard deduction for 2024?").

Strategy:

Cache Type Scope TTL Use Case
FAQ Cache Global Tax year Standard deduction, filing deadlines
Regulation Cache Global Tax year IRS rules, state requirements
Firm Guidelines Cache Firm Until updated Firm-specific policies
Prior Year Cache Per-client Permanent Client's historical data
Session Cache Per-session 1 hour Avoid re-asking same question

Cache Key Structure:

faq:{tax_year}:{normalized_question_hash}
reg:{tax_year}:{topic}:{jurisdiction}
firm:{firm_id}:{guideline_type}
client:{client_id}:{tax_year}:summary
session:{session_id}:{query_hash}

Cost Savings: 20-30% reduction in duplicate queries.


6. Quality Metrics and Ground Truth

6.1 Ground Truth Sources

Metric Ground Truth Source Feedback Loop
Extraction accuracy Staff corrections in PendingReview Retrain extraction prompts
Classification accuracy Staff manual classifications Update classification rules
Analysis quality Reviewer corrections (RevisionsNeeded) Refine analysis prompts
Q&A quality Preparer ratings (thumbs up/down) Prompt optimization
Validation effectiveness IRS rejections Improve pre-file checks
Overall accuracy IRS notices (CP2000, etc.) Long-term quality signal

6.2 Metric Definitions

Metric Calculation Target
Extraction Accuracy (Auto-verified + Spot-check passed) / Total extractions >95%
Classification Accuracy Auto-classified / (Auto + Manual classified) >98%
First-Pass Approval Rate Approved / (Approved + RevisionsNeeded) >90%
E-file Acceptance Rate Accepted / (Accepted + Rejected) >98%
AI-Assisted Time Savings (Manual time - AI-assisted time) / Manual time >40%
Cost per Return Total AI costs / Returns processed <$0.15

6.3 Quality Feedback Loops

┌─────────────────────────────────────────────────────────────────┐
│                    EXTRACTION FEEDBACK                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  AI Extraction → Staff Review → Correction                      │
│                      │              │                           │
│                      │              └──→ Log correction type    │
│                      │                         │                │
│                      │                         ▼                │
│                      │              ┌──────────────────┐        │
│                      │              │ Weekly analysis: │        │
│                      │              │ - Common errors  │        │
│                      │              │ - Doc type fails │        │
│                      │              │ - Field failures │        │
│                      │              └──────────────────┘        │
│                      │                         │                │
│                      │                         ▼                │
│                      │              Prompt/threshold tuning     │
│                      │                                          │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    LONG-TERM QUALITY                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  IRS Notices (CP2000, etc.)                                     │
│         │                                                       │
│         ▼                                                       │
│  Link to original return → Identify AI touchpoint involved      │
│         │                                                       │
│         ▼                                                       │
│  Root cause analysis → Systemic fix                             │
│                                                                 │
│  *** This is the ultimate ground truth ***                      │
│  *** But has 6-18 month lag ***                                 │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

7. Usage Caps and Reporting

7.1 Opus Usage Caps

Decision: Implement per-firm monthly caps with transparency.

Tier Opus Interactive Opus Batch Rationale
Base 0 (admin only) 50/month Encourage batch
Standard 10/month 100/month Most firms
Premium 25/month 250/month High-volume firms

7.2 Cap Granularity

Dimension Tracking Reporting
Per-firm monthly Hard cap Dashboard + email at 80%
Per-preparer Soft tracking Manager visibility
Per-return Logging only Audit trail

7.3 Usage Visibility

Staff Dashboard:

┌─────────────────────────────────────────────────────────────────┐
│  AI Usage This Month                                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Expert Research Requests: 12 of 50 remaining                   │
│  ████████░░░░░░░░░░░░ 76% available                             │
│                                                                 │
│  Resets: January 1, 2026                                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Admin Dashboard:

┌─────────────────────────────────────────────────────────────────┐
│  AI Costs - December 2025                                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Total Cost: $127.45                                            │
│  Returns Processed: 89                                          │
│  Cost per Return: $1.43                                         │
│                                                                 │
│  By Model:                                                      │
│  - Haiku:  $18.20 (62% of calls)                                │
│  - Sonnet: $84.50 (33% of calls)                                │
│  - Opus:   $24.75 (5% of calls)                                 │
│                                                                 │
│  By Touchpoint:                                                 │
│  - Extraction:    $45.00                                        │
│  - Analysis:      $52.00                                        │
│  - Q&A:           $30.45                                        │
│                                                                 │
│  Top Opus Users:                                                │
│  - Jane Smith: 8 requests                                       │
│  - John Doe: 4 requests                                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘


8. Implementation Phases

8.1 Phase 1: Foundation (Pre-Tax Season)

Item Description Priority
Model routing layer Implement touchpoint → model mapping P0
Confidence scoring Output confidence with every AI call P0
Basic escalation Haiku → Sonnet escalation on low confidence P0
Logging infrastructure Log all AI calls with timing, cost, confidence P0
Human review queues PendingReview, ManualEntry workflows P0

8.2 Phase 2: Optimization (During Tax Season)

Item Description Priority
FAQ caching Cache common questions P1
Batch API integration Overnight processing for non-urgent P1
Threshold tuning Adjust confidence thresholds based on data P1
Usage dashboards Staff and admin visibility P1

8.3 Phase 3: Enhancement (Post-Tax Season)

Item Description Priority
Trained routing model ML model for optimal model selection P2
A/B testing framework Compare single vs chained approaches P2
Long-term quality tracking IRS notice correlation P2
Context window optimization Smarter summarization P2

Appendix A: Model Selection Decision Tree

START: New AI Request
    ├── Is this document extraction?
    │       │
    │       ├── W-2/1099 (standard form)? → HAIKU
    │       │       └── Confidence <90%? → SONNET
    │       │
    │       ├── K-1? → SONNET
    │       │       └── Confidence <85%? → OPUS
    │       │
    │       └── Other? → SONNET
    ├── Is this preliminary analysis?
    │       │
    │       ├── ≤2 schedules AND no business/investment? → HAIKU
    │       │       └── Anomaly detected? → SONNET
    │       │
    │       └── >2 schedules OR business/investment? → SONNET
    │               └── Complex entity (trust, partnership)? → OPUS
    ├── Is this interactive Q&A?
    │       │
    │       ├── Matches FAQ pattern? → HAIKU
    │       │
    │       ├── Standard question? → SONNET
    │       │
    │       ├── "Request Expert Research"? → OPUS (batch)
    │       │
    │       └── Admin escalation? → OPUS (interactive)
    └── Is this validation?
            ├── Pre-review check? → SONNET
            └── Final QA (summarized data)? → OPUS (batch)

Appendix B: Open Questions for Future Resolution

Question Context Proposed Resolution
Optimal Opus validation frequency Every return vs sampling? Start with every return, measure value
Multi-model chaining ROI Does Haiku → Sonnet save money? Collect V1 data, analyze post-season
Preparer training on model selection Do they need to understand? Hide complexity, surface "Expert Research" only
State-specific model adjustments CA vs FL complexity? Single national model for V1

Document History

Version Date Author Changes
1.3 December 25, 2025 Claude Fixed Section 1.3: Corrected token estimate from 2,000 to ~126,500/return, fixed annual cost calculations to align with COST_DETAIL.md ($0.33/return).
1.2 December 25, 2025 Claude Document restructure: Absorbed batch API strategy (5.1) and expanded metadata caching (5.2) from COST_DETAIL.md. Updated section numbering.
1.1 December 25, 2025 Claude Reconciliation pass: Cross-referenced with ARCHITECTURE.md, PROCESS_FLOWS.md, COST_DETAIL.md, requirements. Added PendingFinalQA to PROCESS_FLOWS.md. Added Related Documentation section.
1.0 December 25, 2025 Claude Initial draft incorporating client decisions

This document should be reviewed and approved before implementation.