AI Model Delegation Strategy¶

Version: 1.3 Date: December 25, 2025 Status: Draft - Pending Final Review

Document	Purpose
COST_DETAIL.md	Full cost breakdown, token usage, SaaS pricing model
PROCESS_FLOWS.md	Workflow states including PendingFinalQA
ARCHITECTURE.md	System architecture, service centralization
CLIENT_FOLLOWUP.md	FUT-002: Tiered AI Validation Strategy discussion

Executive Summary¶

This document defines the AI model delegation strategy for the Tax Practice AI system. It specifies which Claude model (Haiku, Sonnet, or Opus) handles each AI touchpoint, the routing rules between models, and the human-in-the-loop checkpoints required by IRS Circular 230 and firm quality standards.

AI Cost Target (SaaS provider COGS): $0.12/return (vs $0.40 baseline) through strategic model delegation and optimization. This is our internal cost per return, not client pricing.

Model Distribution Target: Haiku 60% | Sonnet 35% | Opus 5%

Table of Contents¶

Model Capabilities and Pricing
1.1 Model Comparison
1.2 Batch API Discount
1.3 Cost Model at Scale
AI Touchpoints and Model Assignments
2.1 Document Processing
2.2 Preliminary Analysis
2.3 Interactive Q&A
2.4 Review and Validation
Human-in-the-Loop Checkpoints
3.1 Regulatory Requirements
3.2 Document Processing
3.3 Workflow Checkpoints
3.4 Process Flow Diagram
Resolved Ambiguous Scenarios
4.1 Preliminary Analysis Threshold
4.2 Interactive Q&A Model Selection
4.3 Extraction Confidence Threshold
4.4 Opus Usage: Interactive vs Batch
4.5 Multi-Model Pipelines
Additional Scenarios
5.1 Context Window Management
5.2 Retry and Fallback Logic
5.3 Caching Strategy
Quality Metrics and Ground Truth
6.1 Ground Truth Sources
6.2 Metric Definitions
6.3 Quality Feedback Loops
Usage Caps and Reporting
7.1 Opus Usage Caps
7.2 Cap Granularity
7.3 Usage Visibility
Implementation Phases
8.1 Phase 1: Foundation
8.2 Phase 2: Optimization
8.3 Phase 3: Enhancement

1. Model Capabilities and Pricing¶

1.1 Model Comparison¶

Model	Strengths	Weaknesses	Cost (per 1M tokens)
Haiku 3	Fast, cheap, structured tasks	Limited reasoning	In: $0.25, Out: $1.25
Sonnet 4	Balanced, strong extraction	Slower than Haiku	In: $3.00, Out: $15.00
Opus 4.5	Best reasoning, nuanced judgment	Slower	In: $5.00, Out: $25.00

Pricing as of Dec 2025 via AWS Bedrock. See AWS Bedrock Pricing.

1.2 Batch API Discount¶

50% cost reduction for batch processing (non-real-time)
Target: 80% of non-interactive workloads through batch API
Processing window: overnight via Airflow scheduling

1.3 Cost Model at Scale¶

Returns/Year	AI Cost/Return	Annual AI Cost
1,000	$0.33	~$330/year
5,000	$0.33	~$1,650/year
10,000	$0.33	~$3,300/year

Based on ~126,500 tokens/return lifecycle with 60/35/5 model delegation. See COST_DETAIL.md for token breakdown.

2. AI Touchpoints and Model Assignments¶

2.1 Document Processing Touchpoints¶

ID	Touchpoint	Default Model	Escalation	Batch Eligible
TP-001	Document Classification	Haiku	None	Yes
TP-002	W-2 Extraction	Haiku	Sonnet if confidence <90%	Yes
TP-003	1099 Extraction	Haiku	Sonnet if confidence <90%	Yes
TP-004	K-1 Extraction	Sonnet	Opus if confidence <85%	Yes
TP-005	Brokerage Statement Extraction	Sonnet	None	Yes
TP-006	Other Document Extraction	Sonnet	None	Yes

Rationale: - Standard forms (W-2, 1099) are highly structured → Haiku sufficient - K-1s have variable formats and complex pass-through data → Sonnet default, Opus for low confidence - Brokerage statements require table parsing and multi-page handling → Sonnet

Image Quality Escalation:

Document source affects model selection independent of document type:

Source	Default Adjustment	Rationale
Native PDF	Use table above	Embedded text, reliable
High-quality scan	Use table above	Clean OCR
Phone photo	+1 model tier	Skew, lighting, blur
Low-res/blurry	+1 model tier	OCR uncertainty
HEIC (iPhone)	Convert first, then assess	Format handling
Handwritten content	→ Human review	Too unreliable for AI

Implementation:

def adjust_for_image_quality(base_model, doc_metadata):
    if doc_metadata.is_native_pdf:
        return base_model
    if doc_metadata.source == "phone_photo":
        return escalate(base_model)  # haiku→sonnet, sonnet→opus
    if doc_metadata.resolution < 150 or doc_metadata.blur_score > 0.3:
        return escalate(base_model)
    if doc_metadata.has_handwriting:
        return "human_review"
    return base_model

2.2 Preliminary Analysis Touchpoints¶

ID	Touchpoint	Default Model	Escalation	Batch Eligible
TP-007	Simple Return Analysis (≤2 schedules, no business)	Haiku	Sonnet on anomaly	Yes
TP-008	Standard Return Analysis (>2 schedules OR business/investment)	Sonnet	Opus for complex entities	Yes
TP-009	Prior Year Comparison	Haiku	Sonnet if significant changes	Yes
TP-010	Missing Document Detection	Haiku	None	Yes
TP-011	Client Question Generation	Haiku	Sonnet for complex situations	Yes

Decision (from client): Schedule type matters as much as count. Haiku for ≤2 schedules AND no business income; Sonnet for >2 schedules OR any business/investment complexity.

TP-007/TP-009 Anomaly Thresholds (Haiku → Sonnet):

Note: We add value on top of tax software variance tools, not replace them. These thresholds trigger deeper AI analysis, not duplicate existing reports.

Anomaly Type	Threshold	Rationale
Income change YoY	>25%	Matches IRS DIF flags
Deductions change	>20% or >$5K	IRS audit trigger
New schedule	Any not in PY	Life event likely
Missing PY source	W-2/1099 gone	Job loss, needs verify
Charitable giving	>10% of AGI	IRS scrutiny area
Schedule C loss	Any	Hobby loss rules
Home office	New claim	Documentation needed

TP-011 Question Complexity Routing:

Haiku generates (simple data collection): - Missing document requests - Address/contact confirmations - "Did you receive [expected doc]?"

Sonnet generates (judgment required): - Life events (marriage, divorce, new child, death) - Business vs hobby determination - Multi-state residency questions - Unusual income sources - "Help us understand..." questions

2.3 Interactive Q&A Touchpoints¶

ID	Touchpoint	Default Model	Escalation	Batch Eligible
TP-012	FAQ-Pattern Matches	Haiku	None	No
TP-013	Standard Q&A (staff)	Sonnet	Opus via "Request Expert Research"	No (interactive)
TP-014	Complex Research Request	Opus	None	Yes (preferred)
TP-015	Batch Overnight Research	Sonnet (batch)	None	Yes

Decision (from client): Hybrid UX approach: - Default to Sonnet for interactive Q&A - Auto-downgrade to Haiku for FAQ-pattern matches - "Request Expert Research" link for Opus (batch, positioned as more thorough) - Admin-only "Escalate to Expert" for Opus interactive

2.4 Review and Validation Touchpoints¶

ID	Touchpoint	Default Model	Escalation	Batch Eligible
TP-016	Preparer Summary Generation	Haiku	None	Yes
TP-017	Pre-Review Validation	Sonnet	Opus if issues found	Yes
TP-018	Final Validation (Opus QA)	Opus (metadata only, batch)	None	Yes
TP-019	E-file Rejection Analysis	Haiku	Sonnet for complex rejections	Yes

Validation Strategy (see FUT-002 for discussion):

Tiered self-validation - each model QAs its own work inline
TP-017 - Sonnet pre-review escalates to Opus if issues found
TP-018 - Opus reviews cached metadata only (not full docs), runs overnight batch
Workflow - adds "Pending Final QA" status before client delivery

TP-018 Opus QA Detail:

Workflow integration (see PROCESS_FLOWS.md):

Approved → PendingFinalQA → (overnight batch)
                ↓
    ┌───────────┴───────────┐
    ↓                       ↓
No issues                Issues found
    ↓                       ↓
PendingSignature       RevisionsNeeded

What Opus reviews (metadata only, ~750 tokens): - Key figures summary (income, deductions, refund/due) - Flags from earlier analysis - Prior year comparison highlights - Confidence scores from extraction - Any anomalies detected

What Opus does NOT review: - Full document images - Raw extraction output - Complete prior year data

UX framing: "While reviewing our work, we noticed..."

Expedited QA Option (fee-based):

Mode	Wait Time	Our Cost	Client Fee
Standard (batch)	12-24 hrs	~$0.005	Included
Expedited (real-time)	~2 min	~$0.03	$5-10/return

Use cases for expedited: - Filing deadline day - Rush returns (already paying rush fee) - Client-requested priority

Implementation: - Button: "Expedite Final Review (+$X)" - Skips PendingFinalQA queue - Runs Opus immediately - Logs for billing

Cost comparison at 1K returns/year:

Approach	Cost
Full-context Opus every return	~$30/year
Summarized data Opus every return	~$15/year
Metadata-only Opus batch	~$5/year
Combined tiered approach	~$8-10/year
Expedited (10% of returns)	+$3/year

E-file Rejection Routing (TP-019):

Haiku handles simple/technical fixes:

Code	Issue	Resolution
IND-181	IP PIN missing	Prompt for IP PIN
Format errors	Schema validation	Auto-correct format
Math errors	Calculation mismatch	Recalculate
Missing fields	Required field blank	Prompt for data

Sonnet escalation (investigation needed):

Code	Issue	Why Complex
IND-031/032	AGI mismatch	Amended return, paper filed, IRS lag
R0000-500/503	SSN/name mismatch	Typo vs legal name change
IND-524	DOB mismatch	Data entry vs SSA records
F8962-070	Marketplace insurance	ACA reconciliation needed

Human escalation (fraud/conflict indicators):

Code	Issue	Action
IND-507	Dependent already claimed	Family dispute or ID theft
IND-516	SSN claimed elsewhere	Possible fraud
R0000-902	Duplicate return filed	Identity theft concern

def route_rejection(error_code):
    # Human escalation - fraud/conflict
    if error_code in ['IND-507', 'IND-516', 'R0000-902']:
        return "human_review"

    # Sonnet - investigation needed
    if error_code.startswith(('IND-031', 'IND-032',
                              'R0000-500', 'R0000-503',
                              'IND-524', 'F8962')):
        return "sonnet"

    # Haiku - simple/technical fixes
    return "haiku"

3. Human-in-the-Loop Checkpoints¶

3.1 Regulatory Requirements (Circular 230)¶

Checkpoint	Requirement	Implementation
Preparer Review	All returns must be reviewed by PTIN holder	AI Analysis → InPrep transition always requires human
EA/CPA Final Approval	Licensed professional sign-off	InReview → Approved requires EA/CPA action
Due Diligence Documentation	Document basis for positions	AI logs all analysis; preparer confirms

3.2 Document Processing Checkpoints¶

State Transition	Human Action	Trigger
UnknownType → Classified	Staff manually classifies	AI cannot determine document type
MediumConfidence → Verified	Staff reviews extraction	Confidence 70-94%
LowConfidence → ManualEntry	Staff enters data manually	Confidence <70%
HighConfidence → Verified	Staff spot-checks	Non-native PDF with 95%+ confidence

Confidence Thresholds (refined per client input):

Document Type	Auto-Verify	Review	Manual Entry
W-2	95%	85-94%	<85%
1099 series	95%	85-94%	<85%
K-1	98%	90-97%	<90%
Dollar amounts	98%	90-97%	<90%
Name/address	90%	80-89%	<80%

Confidence Calculation Method:

Confidence is calculated from multiple sources:

OCR engine - SurePrep/Textract character-level scores
AI extraction - Claude outputs confidence per field
Format validation - SSN format, EIN format, checksums
Cross-source agreement - OCR vs AI match

def calculate_field_confidence(field, ocr_result, ai_result):
    # Start with OCR confidence (0.0-1.0)
    base = ocr_result.confidence

    # Boost if AI extraction agrees
    if ocr_result.value == ai_result.value:
        base = min(base + 0.10, 1.0)
    else:
        base = base * 0.7  # Disagreement penalty

    # Boost if format validates
    if passes_format_check(field.type, ocr_result.value):
        base = min(base + 0.05, 1.0)

    # Penalty for known-difficult fields
    if field.type in ['k1_box', 'handwritten']:
        base = base * 0.9

    return base

Document-level confidence = minimum of all field confidences (weakest link).

3.3 Workflow Checkpoints¶

Workflow State	Human Actor	Decision Point
AIAnalysis → InPrep	Preparer	Review AI output, apply judgment
AIAnalysis → NeedsReview	Staff	Exception requires human decision
NeedsReview → InPrep	Staff	Resolve exception before continuing
InPrep → ReadyForReview	Preparer	Confirm preparation complete
InReview → Approved	Reviewer (EA/CPA)	Final quality gate
InReview → RevisionsNeeded	Reviewer	Issues found, return to preparer
Rejected → InPrep	Preparer	Fix e-file rejection
FraudReview → EACPAReview	EA/CPA	Escalation for duplicate/fraud

3.4 Process Flow Diagram: AI to Human Handoffs¶

┌─────────────────────────────────────────────────────────────────┐
│                    DOCUMENT PROCESSING                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Upload → [Haiku: Classify] → Classification                   │
│                │                                                │
│                ├── Success → [Haiku/Sonnet: Extract]            │
│                │                    │                           │
│                │                    ├── High (95%+) → Auto ✓    │
│                │                    ├── Med (70-94%) → HUMAN    │
│                │                    └── Low (<70%) → HUMAN      │
│                │                                                │
│                └── Fail → HUMAN (manual classify)               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    PRELIMINARY ANALYSIS                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  All Docs Ready → [Haiku/Sonnet: Analyze]                       │
│                          │                                      │
│                          ├── Normal → Summary → PREPARER        │
│                          │                                      │
│                          └── Exception → NeedsReview → STAFF    │
│                                                                 │
│  *** AI NEVER proceeds to filing without human review ***       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    REVIEW & APPROVAL                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  PREPARER completes → ReadyForReview → EA/CPA REVIEWER          │
│                                              │                  │
│                                              ├── Approved ✓     │
│                                              │                  │
│                                              └── Revisions →    │
│                                                   back to       │
│                                                   PREPARER      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

4. Resolved Ambiguous Scenarios¶

4.1 Preliminary Analysis Threshold¶

Question: When does a return qualify for Haiku (cheap) vs Sonnet (standard) preliminary analysis?

Resolution: - Haiku: ≤2 schedules AND no business income AND no investment complexity - Sonnet: >2 schedules OR any business/investment income OR K-1 present

Implementation:

def select_preliminary_model(return_data):
    schedule_count = len(return_data.schedules)
    has_business = return_data.has_schedule_c or return_data.has_schedule_e
    has_investment = return_data.has_schedule_d or return_data.has_k1

    if schedule_count <= 2 and not has_business and not has_investment:
        return "haiku"
    return "sonnet"

Assumption: Thresholds based on industry norms; validate with client after first tax season.

4.2 Interactive Q&A Model Selection¶

Question: How do we route staff Q&A to the right model without training them to always click "Expert"?

Resolution: Hybrid approach with intelligent defaults:

Scenario	Model	UX Element
FAQ pattern match	Haiku	Instant response, no indicator
Standard question	Sonnet	Default, "Get Answer Now"
Complex research	Opus (batch)	"Request Expert Research" button
Urgent escalation	Opus (interactive)	Admin-only "Escalate to Expert"

UX Positioning: - "Request Expert Research" framed as more thorough, not cheaper - Results delivered next business day - Usage cap per firm with visibility reporting

FAQ Pattern Detection:

FAQ patterns = predictable questions Haiku can answer instantly.

Examples (route to Haiku): - "What's the standard deduction for 2024?" - "When is the filing deadline?" - "What's the mileage rate?" - "What documents do I need for a W-2?"

Not FAQ (route to Sonnet): - "Should this client itemize?" - "Is this income taxable?" - "How do I handle this K-1?"

Detection: Embedding similarity to curated FAQ database.

Tax Knowledge Skills Architecture:

Skills = curated knowledge + prompt templates, versioned by tax year.

skills/
├── federal/
│   ├── individual_2024.md
│   └── business_2024.md
├── states/
│   ├── FL_2024.md
│   ├── GA_2024.md
│   └── ... (design for 50)
└── common/
    ├── deadlines_2024.md
    ├── forms.md
    └── rates_2024.md

Skills rebuilt annually when tax code updates (public/predictable).

4.3 Extraction Confidence Threshold¶

Question: What confidence thresholds trigger human review?

Resolution: Document-type and field-level thresholds:

Factor	Auto-Verify	Review	Manual Entry
W-2, 1099 (standard)	≥95%	85-94%	<85%
K-1 (complex)	≥98%	90-97%	<90%
Dollar amounts	≥98%	90-97%	<90%
Name/address	≥90%	80-89%	<80%

Rationale: Dollar amounts have higher materiality; K-1s have more variable formats and require stricter thresholds.

4.4 Opus Usage: Interactive vs Batch¶

Question: When should Opus be available interactively vs forced to batch?

Resolution:

Mode	Availability	Use Case
Batch (default)	All staff	"Request Expert Research" → overnight
Interactive	Admin only	"Escalate to Expert" → immediate

UX Copy: - Batch: "Submit to Expert" → "Your research request has been queued. Results will be ready by 8 AM tomorrow." - Interactive: (admin only) "Escalate to Expert" → immediate response

Reporting: Usage caps with per-firm visibility (see Section 7).

4.5 Multi-Model Pipelines¶

Question: Should we chain models (Haiku → Sonnet → Opus) or use single model per task?

Resolution: Single model per task for V1, with architecture supporting future chaining.

V1 Implementation: - Each AI touchpoint uses one model based on routing rules - Output includes confidence score - Routing logic is a separate layer (not embedded in prompts) - All interactions logged for future routing model training

Future Enhancement: - Train routing model on logged data - Implement Haiku-first with auto-escalation - A/B test single vs chained approaches

5. Additional Scenarios¶

5.1 Batch API Optimization¶

Not all AI tasks require real-time response. Batch processing via Anthropic's batch API is 50% cheaper.

Task Classification:

Task	Interactive?	Batch Eligible?	Notes
Document classification	Partial	Yes	Batch overnight for folder uploads
Data extraction	Partial	Yes	Batch overnight for large uploads
Prior year comparison	Optional	Yes	Run nightly for next day review
Missing doc detection	Optional	Yes	Run nightly after new docs processed
Preparer Q&A	Yes	No	Real-time conversation required
Worksheet generation	Optional	Yes	Pre-generate overnight
Rejection analysis	No	Yes	Run when rejection received
Final QA (TP-018)	No	Yes	Overnight batch
Tax reminders	No	Yes	Scheduled batch job

Key Insight: Only Preparer Q&A truly requires real-time response.

UX Strategy: Default to Batch

When documents are uploaded:

"Thanks! I'll analyze these documents and have results ready for your review tomorrow morning. Need to discuss now? [Start Live Session]"

Benefits: - Default path is cheapest (batch) - User opts into real-time only when needed - Sets expectation that AI "works overnight" - Live sessions can be premium feature

Implementation: Airflow schedules batch jobs (see ARCHITECTURE.md): - Queue document processing overnight after SmartVault sync - Pre-generate worksheets for morning review - Run analysis batch jobs during off-peak hours

For cost savings projections, see COST_DETAIL.md.

5.2 Context Window Management¶

Scenario: Complex returns with 50+ documents may exceed context limits.

Strategy:

Approach	When to Use	Implementation
Metadata Caching	All returns	Extract once, cache as MD (<350 lines)
Document Summarization	Large doc sets	Haiku summarizes each doc first
Chunked Processing	Very large returns	Process in batches, aggregate results
Per-Return Summary	Every return	Maintain running summary file

Token Budget per Return: - Target: 15,000 tokens average (fits Haiku/Sonnet comfortably) - Maximum: 50,000 tokens (large returns with summarization) - Opus validation: 5,000 tokens (summarized data only)

Metadata Cache Structure:

/returns/{return_id}/
├── metadata.md           # Extracted data summary (<350 lines)
├── prior_year_summary.md # Prior year comparison
├── flags.md              # Anomalies and issues
└── documents/
    ├── w2_001_summary.md
    ├── 1099_001_summary.md
    └── ...

Per-Document Metadata Format:

File: docs/{client_id}/{return_year}/{doc_id}_metadata.md

# Document Metadata
- **Type:** W-2
- **Source:** employer_acme_corp.pdf
- **Uploaded:** 2024-02-15
- **Confidence:** 98%

## Extracted Values
| Field | Value | Box |
|-------|-------|-----|
| Employer | Acme Corp | c |
| EIN | 12-3456789 | b |
| Wages | $85,432.00 | 1 |
| Federal Withholding | $12,543.00 | 2 |

## AI Notes
- Wages increased 8% from prior year ($79,100)
- Withholding rate: 14.7% (typical)

## Flags
- None

Per-Return Summary Format:

File: docs/{client_id}/{return_year}/return_summary.md

# Return Summary: John Smith (2024)

## Documents Received (12 of 15 expected)
| Doc | Type | Status | Key Value |
|-----|------|--------|-----------|
| W-2 Acme Corp | W-2 | ✓ Complete | $85,432 wages |
| 1099-INT Chase | 1099-INT | ✓ Complete | $1,234 interest |

## Missing Documents
- 1098 Mortgage Interest (expected based on prior year)

## Prior Year Comparison
| Item | 2023 | 2024 | Change |
|------|------|------|--------|
| Total Income | $82,500 | $87,233 | +5.7% |

## Flags & Questions
1. Large charitable contribution ($5,000) - verify documentation

Implementation:

Batch extract → Create metadata MD on first document scan
Store in S3 → Alongside original document
Index in Aurora → Track metadata file location
AI reads cache first → Only re-scan if metadata missing or stale
Refresh trigger → Re-extract if document updated or confidence < 90%

For cost savings projections, see COST_DETAIL.md.

5.3 Retry and Fallback Logic¶

Scenario: What happens when an AI call fails or returns low-confidence output?

Strategy:

Failure Type	Retry?	Fallback	Human Escalation
API timeout	Yes, 3x with backoff	Same model	After 3 failures
Rate limit	Yes, with delay	Same model	Never (wait)
Low confidence (Haiku)	No	Escalate to Sonnet	After Sonnet
Low confidence (Sonnet)	No	Escalate to Opus	After Opus
Low confidence (Opus)	No	None	Immediate
Malformed output	Yes, 1x	Same model	After retry

Backoff Schedule: - Retry 1: 2 seconds - Retry 2: 4 seconds - Retry 3: 8 seconds - Then: Queue for human review

Confidence Escalation Flow:

Haiku (conf < threshold)
    → Sonnet (conf < threshold)
        → Opus (conf < threshold)
            → HUMAN REVIEW (always terminal)

5.4 Caching Strategy¶

Scenario: Same questions asked across multiple clients (e.g., "What's the standard deduction for 2024?").

Strategy:

Cache Type	Scope	TTL	Use Case
FAQ Cache	Global	Tax year	Standard deduction, filing deadlines
Regulation Cache	Global	Tax year	IRS rules, state requirements
Firm Guidelines Cache	Firm	Until updated	Firm-specific policies
Prior Year Cache	Per-client	Permanent	Client's historical data
Session Cache	Per-session	1 hour	Avoid re-asking same question

Cache Key Structure:

faq:{tax_year}:{normalized_question_hash}
reg:{tax_year}:{topic}:{jurisdiction}
firm:{firm_id}:{guideline_type}
client:{client_id}:{tax_year}:summary
session:{session_id}:{query_hash}

Cost Savings: 20-30% reduction in duplicate queries.

6. Quality Metrics and Ground Truth¶

6.1 Ground Truth Sources¶

Metric	Ground Truth Source	Feedback Loop
Extraction accuracy	Staff corrections in PendingReview	Retrain extraction prompts
Classification accuracy	Staff manual classifications	Update classification rules
Analysis quality	Reviewer corrections (RevisionsNeeded)	Refine analysis prompts
Q&A quality	Preparer ratings (thumbs up/down)	Prompt optimization
Validation effectiveness	IRS rejections	Improve pre-file checks
Overall accuracy	IRS notices (CP2000, etc.)	Long-term quality signal

6.2 Metric Definitions¶

Metric	Calculation	Target
Extraction Accuracy	(Auto-verified + Spot-check passed) / Total extractions	>95%
Classification Accuracy	Auto-classified / (Auto + Manual classified)	>98%
First-Pass Approval Rate	Approved / (Approved + RevisionsNeeded)	>90%
E-file Acceptance Rate	Accepted / (Accepted + Rejected)	>98%
AI-Assisted Time Savings	(Manual time - AI-assisted time) / Manual time	>40%
Cost per Return	Total AI costs / Returns processed	<$0.15

6.3 Quality Feedback Loops¶

┌─────────────────────────────────────────────────────────────────┐
│                    EXTRACTION FEEDBACK                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  AI Extraction → Staff Review → Correction                      │
│                      │              │                           │
│                      │              └──→ Log correction type    │
│                      │                         │                │
│                      │                         ▼                │
│                      │              ┌──────────────────┐        │
│                      │              │ Weekly analysis: │        │
│                      │              │ - Common errors  │        │
│                      │              │ - Doc type fails │        │
│                      │              │ - Field failures │        │
│                      │              └──────────────────┘        │
│                      │                         │                │
│                      │                         ▼                │
│                      │              Prompt/threshold tuning     │
│                      │                                          │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    LONG-TERM QUALITY                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  IRS Notices (CP2000, etc.)                                     │
│         │                                                       │
│         ▼                                                       │
│  Link to original return → Identify AI touchpoint involved      │
│         │                                                       │
│         ▼                                                       │
│  Root cause analysis → Systemic fix                             │
│                                                                 │
│  *** This is the ultimate ground truth ***                      │
│  *** But has 6-18 month lag ***                                 │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

7. Usage Caps and Reporting¶

7.1 Opus Usage Caps¶

Decision: Implement per-firm monthly caps with transparency.

Tier	Opus Interactive	Opus Batch	Rationale
Base	0 (admin only)	50/month	Encourage batch
Standard	10/month	100/month	Most firms
Premium	25/month	250/month	High-volume firms

7.2 Cap Granularity¶

Dimension	Tracking	Reporting
Per-firm monthly	Hard cap	Dashboard + email at 80%
Per-preparer	Soft tracking	Manager visibility
Per-return	Logging only	Audit trail

7.3 Usage Visibility¶

Staff Dashboard:

┌─────────────────────────────────────────────────────────────────┐
│  AI Usage This Month                                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Expert Research Requests: 12 of 50 remaining                   │
│  ████████░░░░░░░░░░░░ 76% available                             │
│                                                                 │
│  Resets: January 1, 2026                                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Admin Dashboard:

┌─────────────────────────────────────────────────────────────────┐
│  AI Costs - December 2025                                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Total Cost: $127.45                                            │
│  Returns Processed: 89                                          │
│  Cost per Return: $1.43                                         │
│                                                                 │
│  By Model:                                                      │
│  - Haiku:  $18.20 (62% of calls)                                │
│  - Sonnet: $84.50 (33% of calls)                                │
│  - Opus:   $24.75 (5% of calls)                                 │
│                                                                 │
│  By Touchpoint:                                                 │
│  - Extraction:    $45.00                                        │
│  - Analysis:      $52.00                                        │
│  - Q&A:           $30.45                                        │
│                                                                 │
│  Top Opus Users:                                                │
│  - Jane Smith: 8 requests                                       │
│  - John Doe: 4 requests                                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

8. Implementation Phases¶

8.1 Phase 1: Foundation (Pre-Tax Season)¶

Item	Description	Priority
Model routing layer	Implement touchpoint → model mapping	P0
Confidence scoring	Output confidence with every AI call	P0
Basic escalation	Haiku → Sonnet escalation on low confidence	P0
Logging infrastructure	Log all AI calls with timing, cost, confidence	P0
Human review queues	PendingReview, ManualEntry workflows	P0

8.2 Phase 2: Optimization (During Tax Season)¶

Item	Description	Priority
FAQ caching	Cache common questions	P1
Batch API integration	Overnight processing for non-urgent	P1
Threshold tuning	Adjust confidence thresholds based on data	P1
Usage dashboards	Staff and admin visibility	P1

8.3 Phase 3: Enhancement (Post-Tax Season)¶

Item	Description	Priority
Trained routing model	ML model for optimal model selection	P2
A/B testing framework	Compare single vs chained approaches	P2
Long-term quality tracking	IRS notice correlation	P2
Context window optimization	Smarter summarization	P2

Appendix A: Model Selection Decision Tree¶

START: New AI Request
    │
    ├── Is this document extraction?
    │       │
    │       ├── W-2/1099 (standard form)? → HAIKU
    │       │       └── Confidence <90%? → SONNET
    │       │
    │       ├── K-1? → SONNET
    │       │       └── Confidence <85%? → OPUS
    │       │
    │       └── Other? → SONNET
    │
    ├── Is this preliminary analysis?
    │       │
    │       ├── ≤2 schedules AND no business/investment? → HAIKU
    │       │       └── Anomaly detected? → SONNET
    │       │
    │       └── >2 schedules OR business/investment? → SONNET
    │               └── Complex entity (trust, partnership)? → OPUS
    │
    ├── Is this interactive Q&A?
    │       │
    │       ├── Matches FAQ pattern? → HAIKU
    │       │
    │       ├── Standard question? → SONNET
    │       │
    │       ├── "Request Expert Research"? → OPUS (batch)
    │       │
    │       └── Admin escalation? → OPUS (interactive)
    │
    └── Is this validation?
            │
            ├── Pre-review check? → SONNET
            │
            └── Final QA (summarized data)? → OPUS (batch)

Appendix B: Open Questions for Future Resolution¶

Question	Context	Proposed Resolution
Optimal Opus validation frequency	Every return vs sampling?	Start with every return, measure value
Multi-model chaining ROI	Does Haiku → Sonnet save money?	Collect V1 data, analyze post-season
Preparer training on model selection	Do they need to understand?	Hide complexity, surface "Expert Research" only
State-specific model adjustments	CA vs FL complexity?	Single national model for V1

Document History¶

Version	Date	Author	Changes
1.3	December 25, 2025	Claude	Fixed Section 1.3: Corrected token estimate from 2,000 to ~126,500/return, fixed annual cost calculations to align with COST_DETAIL.md ($0.33/return).
1.2	December 25, 2025	Claude	Document restructure: Absorbed batch API strategy (5.1) and expanded metadata caching (5.2) from COST_DETAIL.md. Updated section numbering.
1.1	December 25, 2025	Claude	Reconciliation pass: Cross-referenced with ARCHITECTURE.md, PROCESS_FLOWS.md, COST_DETAIL.md, requirements. Added PendingFinalQA to PROCESS_FLOWS.md. Added Related Documentation section.
1.0	December 25, 2025	Claude	Initial draft incorporating client decisions

This document should be reviewed and approved before implementation.

AI Model Delegation Strategy¶

Related Documentation¶

Executive Summary¶

Table of Contents¶

1. Model Capabilities and Pricing¶

1.1 Model Comparison¶

1.2 Batch API Discount¶

1.3 Cost Model at Scale¶

2. AI Touchpoints and Model Assignments¶

2.1 Document Processing Touchpoints¶

2.2 Preliminary Analysis Touchpoints¶

2.3 Interactive Q&A Touchpoints¶

2.4 Review and Validation Touchpoints¶

3. Human-in-the-Loop Checkpoints¶

3.1 Regulatory Requirements (Circular 230)¶

3.2 Document Processing Checkpoints¶

3.3 Workflow Checkpoints¶

3.4 Process Flow Diagram: AI to Human Handoffs¶

4. Resolved Ambiguous Scenarios¶

4.1 Preliminary Analysis Threshold¶

4.2 Interactive Q&A Model Selection¶

4.3 Extraction Confidence Threshold¶

4.4 Opus Usage: Interactive vs Batch¶

4.5 Multi-Model Pipelines¶

5. Additional Scenarios¶

5.1 Batch API Optimization¶

5.2 Context Window Management¶

5.3 Retry and Fallback Logic¶

5.4 Caching Strategy¶

6. Quality Metrics and Ground Truth¶

6.1 Ground Truth Sources¶

6.2 Metric Definitions¶

6.3 Quality Feedback Loops¶

7. Usage Caps and Reporting¶

7.1 Opus Usage Caps¶

7.2 Cap Granularity¶

7.3 Usage Visibility¶

8. Implementation Phases¶

8.1 Phase 1: Foundation (Pre-Tax Season)¶

8.2 Phase 2: Optimization (During Tax Season)¶

8.3 Phase 3: Enhancement (Post-Tax Season)¶

Appendix A: Model Selection Decision Tree¶

Appendix B: Open Questions for Future Resolution¶

Document History¶