Skip to content

Sequence 4: Document Management

Status: Complete Depends On: Sequences 1-3 (Foundation, Client Identity, Engagement) Last Updated: 2024-12-23


Overview

This sequence implements the document management system. Clients can upload tax documents via web portal or email. Documents are scanned for malware, classified by AI, and key data is extracted. Integration with SmartVault (client portal) and SurePrep (OCR/extraction) is included.

Stories in This Sequence

Story Title Status Priority
S4-001 Document Upload via Portal Done P0
S4-002 Document Upload via Email Done P0
S4-003 Malware Scanning Done P0
S4-004 Document Classification and Extraction Done P0
S4-005 SmartVault Integration Done P1
S4-006 SurePrep Integration Done P1
S4-007 Document Checklist Management Done P1
S4-008 Manual Extraction Correction Done P1

S4-001: Document Upload via Portal

Story: As a client, I want to upload documents through the web portal, so that I can submit my tax documents securely.

Acceptance Criteria

  • Drag-and-drop upload interface
  • Support for PDF, JPEG, PNG, HEIC
  • Support for DOCX, XLSX, CSV
  • Progress indicator during upload
  • Confirmation receipt displayed and emailed
  • Mobile camera capture supported
  • File size limit enforced (50MB)

Technical Design

Domain Model

# src/domain/document.py

class DocumentType(str, Enum):
    """Types of tax documents."""
    W2 = "w2"
    W2_G = "w2_g"
    FORM_1099_INT = "1099_int"
    FORM_1099_DIV = "1099_div"
    FORM_1099_B = "1099_b"
    FORM_1099_MISC = "1099_misc"
    FORM_1099_NEC = "1099_nec"
    FORM_1099_R = "1099_r"
    FORM_1099_G = "1099_g"
    FORM_1099_K = "1099_k"
    FORM_1098 = "1098"
    FORM_1098_E = "1098_e"
    FORM_1098_T = "1098_t"
    K1_1065 = "k1_1065"
    K1_1120S = "k1_1120s"
    K1_1041 = "k1_1041"
    BROKERAGE_STATEMENT = "brokerage_statement"
    BANK_STATEMENT = "bank_statement"
    DRIVERS_LICENSE = "drivers_license"
    PRIOR_YEAR_RETURN = "prior_year_return"
    PROPERTY_TAX = "property_tax"
    CHARITABLE_RECEIPT = "charitable_receipt"
    MEDICAL_EXPENSE = "medical_expense"
    BUSINESS_EXPENSE = "business_expense"
    OTHER = "other"
    UNKNOWN = "unknown"

class DocumentStatus(str, Enum):
    """Document processing status."""
    UPLOADING = "uploading"
    SCANNING = "scanning"           # Malware scan
    QUARANTINED = "quarantined"     # Failed malware scan
    CLASSIFYING = "classifying"     # AI classification
    EXTRACTING = "extracting"       # OCR/extraction
    REVIEW_REQUIRED = "review_required"  # Low confidence
    PROCESSED = "processed"         # Ready for use
    REJECTED = "rejected"           # Manual rejection
    ARCHIVED = "archived"           # Post-filing archive

class Document:
    """Client-uploaded document."""
    id: UUID
    client_id: UUID
    tax_year: int
    original_filename: str
    content_type: str               # MIME type
    file_size_bytes: int
    s3_key: str                     # S3 storage key
    document_type: DocumentType
    document_type_confidence: Optional[float]
    status: DocumentStatus
    upload_source: str              # "portal", "email", "smartvault"
    uploaded_at: datetime
    scanned_at: Optional[datetime]
    scan_result: Optional[str]      # "clean", "infected", "error"
    classified_at: Optional[datetime]
    extracted_at: Optional[datetime]
    extraction_id: Optional[UUID]   # Link to extraction record
    notes: Optional[str]
    created_at: datetime
    updated_at: datetime

Database Tables

-- document_type enum
CREATE TYPE document_type_enum AS ENUM (
    'w2', 'w2_g', '1099_int', '1099_div', '1099_b', '1099_misc',
    '1099_nec', '1099_r', '1099_g', '1099_k', '1098', '1098_e',
    '1098_t', 'k1_1065', 'k1_1120s', 'k1_1041', 'brokerage_statement',
    'bank_statement', 'drivers_license', 'prior_year_return',
    'property_tax', 'charitable_receipt', 'medical_expense',
    'business_expense', 'other', 'unknown'
);

-- document_status enum
CREATE TYPE document_status_enum AS ENUM (
    'uploading', 'scanning', 'quarantined', 'classifying',
    'extracting', 'review_required', 'processed', 'rejected', 'archived'
);

-- document table
CREATE TABLE document (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    client_id UUID NOT NULL REFERENCES client(id),
    tax_year INTEGER NOT NULL,
    original_filename VARCHAR(255) NOT NULL,
    content_type VARCHAR(100) NOT NULL,
    file_size_bytes BIGINT NOT NULL,
    s3_key VARCHAR(500) NOT NULL,
    document_type document_type_enum NOT NULL DEFAULT 'unknown',
    document_type_confidence DECIMAL(5,4),
    status document_status_enum NOT NULL DEFAULT 'uploading',
    upload_source VARCHAR(50) NOT NULL,
    uploaded_at TIMESTAMP NOT NULL DEFAULT NOW(),
    scanned_at TIMESTAMP,
    scan_result VARCHAR(50),
    classified_at TIMESTAMP,
    extracted_at TIMESTAMP,
    extraction_id UUID,
    notes TEXT,
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
    deleted_at TIMESTAMP
);

CREATE INDEX idx_document_client_id ON document(client_id);
CREATE INDEX idx_document_tax_year ON document(client_id, tax_year);
CREATE INDEX idx_document_status ON document(status);
CREATE INDEX idx_document_type ON document(document_type);

S3 Storage Structure

documents/
└── {client_id}/
    └── {tax_year}/
        └── uploads/
            └── {document_id}/{original_filename}
        └── processed/
            └── {document_id}/
                ├── original.pdf
                ├── thumbnail.jpg
                └── extracted.json

API Endpoints

Endpoint Method Description
/v1/documents/upload POST Initiate multipart upload
/v1/documents/upload/{upload_id} PUT Upload file chunk
/v1/documents/upload/{upload_id}/complete POST Complete upload
/v1/documents/{id} GET Get document details
/v1/documents/{id}/download GET Get presigned download URL
/v1/documents/client/{client_id} GET List client documents

Files to Create/Modify

File Action Description
src/domain/document.py Create Document domain entities
src/repositories/document_repository.py Create Document data access
src/workflows/documents/upload_workflow.py Create Upload orchestration
src/api/routes/documents.py Create API endpoints
src/api/schemas/document_schemas.py Create Pydantic models
tests/integration/conftest.py Modify Add document schema
tests/e2e/conftest.py Modify Add document schema

S4-002: Document Upload via Email

Story: As a client, I want to email documents to the firm, so that I can submit documents using familiar tools.

Acceptance Criteria

  • Dedicated intake email address monitored
  • Account number + tax year parsed from subject/body
  • Attachments extracted and processed
  • Unmatched emails queued for manual routing
  • Confirmation reply sent
  • MSG/EML format handling

Technical Design

Email Processing Flow

Incoming Email (SES)
    ├── 1. Store raw email in S3
    ├── 2. Parse subject/body for account number + tax year
    ├── 3. If matched → Extract attachments → Document upload workflow
    └── 4. If unmatched → Queue for manual routing

Files to Create/Modify

File Action Description
src/services/email_intake_service.py Create Email parsing and routing
src/workflows/documents/email_intake_workflow.py Create Email processing
src/api/routes/webhooks.py Modify Add SES webhook handler

S4-003: Malware Scanning

Story: As a system administrator, I want all uploads scanned for malware, so that the system is protected from malicious files.

Acceptance Criteria

  • ClamAV or equivalent integration
  • Scan occurs before storage in production bucket
  • Infected files quarantined with alert
  • Scan results logged
  • Client notified if file rejected
  • Bypass not possible

Technical Design

Scan Flow

Upload → Staging Bucket → ClamAV Scan
    ├── Clean → Move to production bucket → Continue processing
    └── Infected → Quarantine bucket → Alert → Notify client

Files to Create/Modify

File Action Description
src/services/malware_service.py Create ClamAV integration
src/workflows/documents/scan_workflow.py Create Scan orchestration

S4-004: Document Classification and Extraction

Story: As a preparer, I want uploaded documents automatically classified and key data extracted, so that I don't have to manually identify document types.

Acceptance Criteria

  • AI classifies document type (W-2, 1099-, 1098-, K-1, etc.)
  • All 27 document types from schema supported
  • Confidence score stored
  • Low-confidence items flagged for review
  • Classification editable by preparer
  • Form-specific extraction (W-2, 1099s, 1098s, K-1s)
  • Low-confidence extractions flagged for human review

Technical Design

Domain Model

# src/domain/extraction.py

class Extraction:
    """Extracted data from a document."""
    id: UUID
    document_id: UUID
    extraction_source: str          # "internal_ai", "sureprep"
    raw_extraction: dict            # Full extraction JSON
    validated_extraction: Optional[dict]  # After human review
    confidence_score: float         # Overall confidence
    fields_requiring_review: List[str]
    reviewed_by: Optional[UUID]
    reviewed_at: Optional[datetime]
    created_at: datetime

Files to Create/Modify

File Action Description
src/domain/extraction.py Create Extraction domain entities
src/repositories/extraction_repository.py Create Extraction data access
src/services/classification_service.py Create AI classification
src/workflows/documents/classification_workflow.py Create Classification orchestration

S4-005: SmartVault Integration

Story: As a preparer, I want documents from SmartVault automatically retrieved, so that client uploads appear in our system without manual transfer.

Acceptance Criteria

  • OAuth 2.0 authentication to SmartVault API
  • Polling for new client uploads (or webhook if available)
  • Document retrieved and processed in our system
  • Completed documents pushed back to SmartVault
  • Folder structure maintained
  • Sync status visible to staff

Technical Design

See src/services/smartvault_service.py (service stub exists).

Files to Create/Modify

File Action Description
src/services/smartvault_service.py Modify Implement full API integration
src/workflows/documents/smartvault_sync_workflow.py Create Sync orchestration

S4-006: SurePrep Integration

Story: As a preparer, I want document data extracted automatically via SurePrep, so that I don't have to manually key tax form data.

Acceptance Criteria

  • Documents pushed to SurePrep for OCR/extraction
  • Binder created per client/tax year
  • Extracted data retrieved as structured JSON
  • Confidence scores per field
  • Low-confidence fields flagged
  • Link maintained between extraction and source document

Technical Design

See src/services/sureprep_service.py (service stub exists).

Files to Create/Modify

File Action Description
src/services/sureprep_service.py Modify Implement full API integration
src/workflows/documents/sureprep_extraction_workflow.py Create Extraction orchestration

S4-007: Document Checklist Management

Story: As a preparer, I want a checklist of expected documents based on prior year, so that I can track what's missing.

Acceptance Criteria

  • Prior year documents inform current year checklist
  • Checklist items marked as received when uploaded
  • Missing items visible to preparer and client
  • Automated reminders for missing documents
  • Checklist customizable per client
  • Completion percentage calculated

Technical Design

Domain Model

# src/domain/checklist.py

class ChecklistItem:
    """Expected document in checklist."""
    id: UUID
    client_id: UUID
    tax_year: int
    document_type: DocumentType
    description: str
    is_required: bool
    source: str                     # "prior_year", "manual", "rule"
    document_id: Optional[UUID]     # Linked when received
    received_at: Optional[datetime]
    reminder_sent_at: Optional[datetime]
    created_at: datetime

Files to Create/Modify

File Action Description
src/domain/checklist.py Create Checklist domain entities
src/repositories/checklist_repository.py Create Checklist data access
src/workflows/documents/checklist_workflow.py Create Checklist management
src/api/routes/checklist.py Create Checklist API endpoints

S4-008: Manual Extraction Correction

Story: As a preparer, I want to correct extracted data, so that errors in OCR/extraction can be fixed.

Acceptance Criteria

  • Side-by-side view: source document + extracted fields
  • Inline editing of extracted values
  • Correction logged with before/after
  • AI feedback loop (corrections improve future extractions)
  • Bulk correction support for repeated errors
  • Corrected flag on extraction record

Technical Design

Files to Create/Modify

File Action Description
src/api/routes/extraction.py Create Extraction review API
src/api/schemas/extraction_schemas.py Create Extraction schemas
src/workflows/documents/correction_workflow.py Create Correction handling

Implementation Order

  1. S4-001: Document Upload via Portal (P0)
  2. Document domain model
  3. Repository layer
  4. Upload workflow with S3 integration
  5. API endpoints for upload/download

  6. S4-003: Malware Scanning (P0)

  7. ClamAV integration (dry_run mode for development)
  8. Scan workflow
  9. Quarantine handling

  10. S4-004: Document Classification (P0)

  11. Classification service using Bedrock
  12. Extraction domain model
  13. Classification workflow

  14. S4-002: Document Upload via Email (P0)

  15. Email parsing service
  16. SES webhook integration
  17. Email intake workflow

  18. S4-005: SmartVault Integration (P1)

  19. S4-006: SurePrep Integration (P1)
  20. S4-007: Document Checklist (P1)
  21. S4-008: Manual Extraction Correction (P1)

Test Coverage Requirements

Component Unit Tests Integration Tests E2E Tests
Document domain 10+ - -
Document repository - 15+ -
Upload workflow 15+ - -
Document API - - 10+
Malware service 5+ - -
Classification service 10+ - -
Email intake 10+ - -

Dependencies

External Services

Service Purpose Config Key
S3 Document storage config.s3
SES Email intake config.ses
ClamAV Malware scanning config.malware
Bedrock AI classification config.bedrock
SmartVault Client portal sync config.smartvault
SurePrep OCR/extraction config.sureprep

Internal Dependencies

Component Dependency
UploadWorkflow AuroraService, S3Service, MalwareService
ClassificationWorkflow AuroraService, BedrockService
EmailIntakeWorkflow AuroraService, S3Service, EmailService
SmartVaultSyncWorkflow AuroraService, S3Service, SmartVaultService
SurePrepExtractionWorkflow AuroraService, SurePrepService

Completion Checklist

  • S4-001 implemented and tested
  • S4-002 implemented and tested
  • S4-003 implemented and tested
  • S4-004 implemented and tested
  • S4-005 implemented and tested
  • S4-006 implemented and tested
  • S4-007 implemented and tested
  • S4-008 implemented and tested
  • All tests passing (unit, integration, e2e)
  • ARCHITECTURE.md updated
  • This document updated with completion status