Sequence 4: Document Management¶

Status: Complete Depends On: Sequences 1-3 (Foundation, Client Identity, Engagement) Last Updated: 2024-12-23

Overview¶

This sequence implements the document management system. Clients can upload tax documents via web portal or email. Documents are scanned for malware, classified by AI, and key data is extracted. Integration with SmartVault (client portal) and SurePrep (OCR/extraction) is included.

Stories in This Sequence¶

Story	Title	Status	Priority
S4-001	Document Upload via Portal	Done	P0
S4-002	Document Upload via Email	Done	P0
S4-003	Malware Scanning	Done	P0
S4-004	Document Classification and Extraction	Done	P0
S4-005	SmartVault Integration	Done	P1
S4-006	SurePrep Integration	Done	P1
S4-007	Document Checklist Management	Done	P1
S4-008	Manual Extraction Correction	Done	P1

S4-001: Document Upload via Portal¶

Story: As a client, I want to upload documents through the web portal, so that I can submit my tax documents securely.

Acceptance Criteria¶

Drag-and-drop upload interface
Support for PDF, JPEG, PNG, HEIC
Support for DOCX, XLSX, CSV
Progress indicator during upload
Confirmation receipt displayed and emailed
Mobile camera capture supported
File size limit enforced (50MB)

Technical Design¶

Domain Model¶

# src/domain/document.py

class DocumentType(str, Enum):
    """Types of tax documents."""
    W2 = "w2"
    W2_G = "w2_g"
    FORM_1099_INT = "1099_int"
    FORM_1099_DIV = "1099_div"
    FORM_1099_B = "1099_b"
    FORM_1099_MISC = "1099_misc"
    FORM_1099_NEC = "1099_nec"
    FORM_1099_R = "1099_r"
    FORM_1099_G = "1099_g"
    FORM_1099_K = "1099_k"
    FORM_1098 = "1098"
    FORM_1098_E = "1098_e"
    FORM_1098_T = "1098_t"
    K1_1065 = "k1_1065"
    K1_1120S = "k1_1120s"
    K1_1041 = "k1_1041"
    BROKERAGE_STATEMENT = "brokerage_statement"
    BANK_STATEMENT = "bank_statement"
    DRIVERS_LICENSE = "drivers_license"
    PRIOR_YEAR_RETURN = "prior_year_return"
    PROPERTY_TAX = "property_tax"
    CHARITABLE_RECEIPT = "charitable_receipt"
    MEDICAL_EXPENSE = "medical_expense"
    BUSINESS_EXPENSE = "business_expense"
    OTHER = "other"
    UNKNOWN = "unknown"

class DocumentStatus(str, Enum):
    """Document processing status."""
    UPLOADING = "uploading"
    SCANNING = "scanning"           # Malware scan
    QUARANTINED = "quarantined"     # Failed malware scan
    CLASSIFYING = "classifying"     # AI classification
    EXTRACTING = "extracting"       # OCR/extraction
    REVIEW_REQUIRED = "review_required"  # Low confidence
    PROCESSED = "processed"         # Ready for use
    REJECTED = "rejected"           # Manual rejection
    ARCHIVED = "archived"           # Post-filing archive

class Document:
    """Client-uploaded document."""
    id: UUID
    client_id: UUID
    tax_year: int
    original_filename: str
    content_type: str               # MIME type
    file_size_bytes: int
    s3_key: str                     # S3 storage key
    document_type: DocumentType
    document_type_confidence: Optional[float]
    status: DocumentStatus
    upload_source: str              # "portal", "email", "smartvault"
    uploaded_at: datetime
    scanned_at: Optional[datetime]
    scan_result: Optional[str]      # "clean", "infected", "error"
    classified_at: Optional[datetime]
    extracted_at: Optional[datetime]
    extraction_id: Optional[UUID]   # Link to extraction record
    notes: Optional[str]
    created_at: datetime
    updated_at: datetime

Database Tables¶

-- document_type enum
CREATE TYPE document_type_enum AS ENUM (
    'w2', 'w2_g', '1099_int', '1099_div', '1099_b', '1099_misc',
    '1099_nec', '1099_r', '1099_g', '1099_k', '1098', '1098_e',
    '1098_t', 'k1_1065', 'k1_1120s', 'k1_1041', 'brokerage_statement',
    'bank_statement', 'drivers_license', 'prior_year_return',
    'property_tax', 'charitable_receipt', 'medical_expense',
    'business_expense', 'other', 'unknown'
);

-- document_status enum
CREATE TYPE document_status_enum AS ENUM (
    'uploading', 'scanning', 'quarantined', 'classifying',
    'extracting', 'review_required', 'processed', 'rejected', 'archived'
);

-- document table
CREATE TABLE document (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    client_id UUID NOT NULL REFERENCES client(id),
    tax_year INTEGER NOT NULL,
    original_filename VARCHAR(255) NOT NULL,
    content_type VARCHAR(100) NOT NULL,
    file_size_bytes BIGINT NOT NULL,
    s3_key VARCHAR(500) NOT NULL,
    document_type document_type_enum NOT NULL DEFAULT 'unknown',
    document_type_confidence DECIMAL(5,4),
    status document_status_enum NOT NULL DEFAULT 'uploading',
    upload_source VARCHAR(50) NOT NULL,
    uploaded_at TIMESTAMP NOT NULL DEFAULT NOW(),
    scanned_at TIMESTAMP,
    scan_result VARCHAR(50),
    classified_at TIMESTAMP,
    extracted_at TIMESTAMP,
    extraction_id UUID,
    notes TEXT,
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
    deleted_at TIMESTAMP
);

CREATE INDEX idx_document_client_id ON document(client_id);
CREATE INDEX idx_document_tax_year ON document(client_id, tax_year);
CREATE INDEX idx_document_status ON document(status);
CREATE INDEX idx_document_type ON document(document_type);

S3 Storage Structure¶

documents/
└── {client_id}/
    └── {tax_year}/
        └── uploads/
            └── {document_id}/{original_filename}
        └── processed/
            └── {document_id}/
                ├── original.pdf
                ├── thumbnail.jpg
                └── extracted.json

API Endpoints¶

Endpoint	Method	Description
`/v1/documents/upload`	POST	Initiate multipart upload
`/v1/documents/upload/{upload_id}`	PUT	Upload file chunk
`/v1/documents/upload/{upload_id}/complete`	POST	Complete upload
`/v1/documents/{id}`	GET	Get document details
`/v1/documents/{id}/download`	GET	Get presigned download URL
`/v1/documents/client/{client_id}`	GET	List client documents

Files to Create/Modify¶

File	Action	Description
`src/domain/document.py`	Create	Document domain entities
`src/repositories/document_repository.py`	Create	Document data access
`src/workflows/documents/upload_workflow.py`	Create	Upload orchestration
`src/api/routes/documents.py`	Create	API endpoints
`src/api/schemas/document_schemas.py`	Create	Pydantic models
`tests/integration/conftest.py`	Modify	Add document schema
`tests/e2e/conftest.py`	Modify	Add document schema

S4-002: Document Upload via Email¶

Story: As a client, I want to email documents to the firm, so that I can submit documents using familiar tools.

Acceptance Criteria¶

Dedicated intake email address monitored
Account number + tax year parsed from subject/body
Attachments extracted and processed
Unmatched emails queued for manual routing
Confirmation reply sent
MSG/EML format handling

Technical Design¶

Email Processing Flow¶

Incoming Email (SES)
    │
    ├── 1. Store raw email in S3
    │
    ├── 2. Parse subject/body for account number + tax year
    │
    ├── 3. If matched → Extract attachments → Document upload workflow
    │
    └── 4. If unmatched → Queue for manual routing

Files to Create/Modify¶

File	Action	Description
`src/services/email_intake_service.py`	Create	Email parsing and routing
`src/workflows/documents/email_intake_workflow.py`	Create	Email processing
`src/api/routes/webhooks.py`	Modify	Add SES webhook handler

S4-003: Malware Scanning¶

Story: As a system administrator, I want all uploads scanned for malware, so that the system is protected from malicious files.

Acceptance Criteria¶

ClamAV or equivalent integration
Scan occurs before storage in production bucket
Infected files quarantined with alert
Scan results logged
Client notified if file rejected
Bypass not possible

Technical Design¶

Scan Flow¶

Upload → Staging Bucket → ClamAV Scan
    │
    ├── Clean → Move to production bucket → Continue processing
    │
    └── Infected → Quarantine bucket → Alert → Notify client

Files to Create/Modify¶

File	Action	Description
`src/services/malware_service.py`	Create	ClamAV integration
`src/workflows/documents/scan_workflow.py`	Create	Scan orchestration

S4-004: Document Classification and Extraction¶

Story: As a preparer, I want uploaded documents automatically classified and key data extracted, so that I don't have to manually identify document types.

Acceptance Criteria¶

AI classifies document type (W-2, 1099-, 1098-, K-1, etc.)
All 27 document types from schema supported
Confidence score stored
Low-confidence items flagged for review
Classification editable by preparer
Form-specific extraction (W-2, 1099s, 1098s, K-1s)
Low-confidence extractions flagged for human review

Technical Design¶

Domain Model¶

# src/domain/extraction.py

class Extraction:
    """Extracted data from a document."""
    id: UUID
    document_id: UUID
    extraction_source: str          # "internal_ai", "sureprep"
    raw_extraction: dict            # Full extraction JSON
    validated_extraction: Optional[dict]  # After human review
    confidence_score: float         # Overall confidence
    fields_requiring_review: List[str]
    reviewed_by: Optional[UUID]
    reviewed_at: Optional[datetime]
    created_at: datetime

Files to Create/Modify¶

File	Action	Description
`src/domain/extraction.py`	Create	Extraction domain entities
`src/repositories/extraction_repository.py`	Create	Extraction data access
`src/services/classification_service.py`	Create	AI classification
`src/workflows/documents/classification_workflow.py`	Create	Classification orchestration

S4-005: SmartVault Integration¶

Story: As a preparer, I want documents from SmartVault automatically retrieved, so that client uploads appear in our system without manual transfer.

Acceptance Criteria¶

OAuth 2.0 authentication to SmartVault API
Polling for new client uploads (or webhook if available)
Document retrieved and processed in our system
Completed documents pushed back to SmartVault
Folder structure maintained
Sync status visible to staff

Technical Design¶

See src/services/smartvault_service.py (service stub exists).

Files to Create/Modify¶

File	Action	Description
`src/services/smartvault_service.py`	Modify	Implement full API integration
`src/workflows/documents/smartvault_sync_workflow.py`	Create	Sync orchestration

S4-006: SurePrep Integration¶

Story: As a preparer, I want document data extracted automatically via SurePrep, so that I don't have to manually key tax form data.

Acceptance Criteria¶

Documents pushed to SurePrep for OCR/extraction
Binder created per client/tax year
Extracted data retrieved as structured JSON
Confidence scores per field
Low-confidence fields flagged
Link maintained between extraction and source document

Technical Design¶

See src/services/sureprep_service.py (service stub exists).

Files to Create/Modify¶

File	Action	Description
`src/services/sureprep_service.py`	Modify	Implement full API integration
`src/workflows/documents/sureprep_extraction_workflow.py`	Create	Extraction orchestration

S4-007: Document Checklist Management¶

Story: As a preparer, I want a checklist of expected documents based on prior year, so that I can track what's missing.

Acceptance Criteria¶

Prior year documents inform current year checklist
Checklist items marked as received when uploaded
Missing items visible to preparer and client
Automated reminders for missing documents
Checklist customizable per client
Completion percentage calculated

Technical Design¶

Domain Model¶

# src/domain/checklist.py

class ChecklistItem:
    """Expected document in checklist."""
    id: UUID
    client_id: UUID
    tax_year: int
    document_type: DocumentType
    description: str
    is_required: bool
    source: str                     # "prior_year", "manual", "rule"
    document_id: Optional[UUID]     # Linked when received
    received_at: Optional[datetime]
    reminder_sent_at: Optional[datetime]
    created_at: datetime

Files to Create/Modify¶

File	Action	Description
`src/domain/checklist.py`	Create	Checklist domain entities
`src/repositories/checklist_repository.py`	Create	Checklist data access
`src/workflows/documents/checklist_workflow.py`	Create	Checklist management
`src/api/routes/checklist.py`	Create	Checklist API endpoints

S4-008: Manual Extraction Correction¶

Story: As a preparer, I want to correct extracted data, so that errors in OCR/extraction can be fixed.

Acceptance Criteria¶

Side-by-side view: source document + extracted fields
Inline editing of extracted values
Correction logged with before/after
AI feedback loop (corrections improve future extractions)
Bulk correction support for repeated errors
Corrected flag on extraction record

Technical Design¶

Files to Create/Modify¶

File	Action	Description
`src/api/routes/extraction.py`	Create	Extraction review API
`src/api/schemas/extraction_schemas.py`	Create	Extraction schemas
`src/workflows/documents/correction_workflow.py`	Create	Correction handling

Implementation Order¶

S4-001: Document Upload via Portal (P0)
Document domain model
Repository layer
Upload workflow with S3 integration
API endpoints for upload/download
S4-003: Malware Scanning (P0)
ClamAV integration (dry_run mode for development)
Scan workflow
Quarantine handling
S4-004: Document Classification (P0)
Classification service using Bedrock
Extraction domain model
Classification workflow
S4-002: Document Upload via Email (P0)
Email parsing service
SES webhook integration
Email intake workflow
S4-005: SmartVault Integration (P1)
S4-006: SurePrep Integration (P1)
S4-007: Document Checklist (P1)
S4-008: Manual Extraction Correction (P1)

Test Coverage Requirements¶

Component	Unit Tests	Integration Tests	E2E Tests
Document domain	10+	-	-
Document repository	-	15+	-
Upload workflow	15+	-	-
Document API	-	-	10+
Malware service	5+	-	-
Classification service	10+	-	-
Email intake	10+	-	-

Dependencies¶

External Services¶

Service	Purpose	Config Key
S3	Document storage	`config.s3`
SES	Email intake	`config.ses`
ClamAV	Malware scanning	`config.malware`
Bedrock	AI classification	`config.bedrock`
SmartVault	Client portal sync	`config.smartvault`
SurePrep	OCR/extraction	`config.sureprep`

Internal Dependencies¶

Component	Dependency
UploadWorkflow	AuroraService, S3Service, MalwareService
ClassificationWorkflow	AuroraService, BedrockService
EmailIntakeWorkflow	AuroraService, S3Service, EmailService
SmartVaultSyncWorkflow	AuroraService, S3Service, SmartVaultService
SurePrepExtractionWorkflow	AuroraService, SurePrepService

Sequence 4: Document Management¶

Overview¶

Stories in This Sequence¶

S4-001: Document Upload via Portal¶

Acceptance Criteria¶

Technical Design¶

Domain Model¶

Database Tables¶

S3 Storage Structure¶

API Endpoints¶

Files to Create/Modify¶

S4-002: Document Upload via Email¶

Acceptance Criteria¶

Technical Design¶

Email Processing Flow¶

Files to Create/Modify¶

S4-003: Malware Scanning¶

Acceptance Criteria¶

Technical Design¶

Scan Flow¶

Files to Create/Modify¶

S4-004: Document Classification and Extraction¶

Acceptance Criteria¶

Technical Design¶

Domain Model¶

Files to Create/Modify¶

S4-005: SmartVault Integration¶

Acceptance Criteria¶

Technical Design¶

Files to Create/Modify¶

S4-006: SurePrep Integration¶

Acceptance Criteria¶

Technical Design¶

Files to Create/Modify¶

S4-007: Document Checklist Management¶

Acceptance Criteria¶

Technical Design¶

Domain Model¶

Files to Create/Modify¶

S4-008: Manual Extraction Correction¶

Acceptance Criteria¶

Technical Design¶

Files to Create/Modify¶

Implementation Order¶

Test Coverage Requirements¶

Dependencies¶

External Services¶

Internal Dependencies¶

Completion Checklist¶