Sequence 4: Document Management¶
Status: Complete Depends On: Sequences 1-3 (Foundation, Client Identity, Engagement) Last Updated: 2024-12-23
Overview¶
This sequence implements the document management system. Clients can upload tax documents via web portal or email. Documents are scanned for malware, classified by AI, and key data is extracted. Integration with SmartVault (client portal) and SurePrep (OCR/extraction) is included.
Stories in This Sequence¶
| Story | Title | Status | Priority |
|---|---|---|---|
| S4-001 | Document Upload via Portal | Done | P0 |
| S4-002 | Document Upload via Email | Done | P0 |
| S4-003 | Malware Scanning | Done | P0 |
| S4-004 | Document Classification and Extraction | Done | P0 |
| S4-005 | SmartVault Integration | Done | P1 |
| S4-006 | SurePrep Integration | Done | P1 |
| S4-007 | Document Checklist Management | Done | P1 |
| S4-008 | Manual Extraction Correction | Done | P1 |
S4-001: Document Upload via Portal¶
Story: As a client, I want to upload documents through the web portal, so that I can submit my tax documents securely.
Acceptance Criteria¶
- Drag-and-drop upload interface
- Support for PDF, JPEG, PNG, HEIC
- Support for DOCX, XLSX, CSV
- Progress indicator during upload
- Confirmation receipt displayed and emailed
- Mobile camera capture supported
- File size limit enforced (50MB)
Technical Design¶
Domain Model¶
# src/domain/document.py
class DocumentType(str, Enum):
"""Types of tax documents."""
W2 = "w2"
W2_G = "w2_g"
FORM_1099_INT = "1099_int"
FORM_1099_DIV = "1099_div"
FORM_1099_B = "1099_b"
FORM_1099_MISC = "1099_misc"
FORM_1099_NEC = "1099_nec"
FORM_1099_R = "1099_r"
FORM_1099_G = "1099_g"
FORM_1099_K = "1099_k"
FORM_1098 = "1098"
FORM_1098_E = "1098_e"
FORM_1098_T = "1098_t"
K1_1065 = "k1_1065"
K1_1120S = "k1_1120s"
K1_1041 = "k1_1041"
BROKERAGE_STATEMENT = "brokerage_statement"
BANK_STATEMENT = "bank_statement"
DRIVERS_LICENSE = "drivers_license"
PRIOR_YEAR_RETURN = "prior_year_return"
PROPERTY_TAX = "property_tax"
CHARITABLE_RECEIPT = "charitable_receipt"
MEDICAL_EXPENSE = "medical_expense"
BUSINESS_EXPENSE = "business_expense"
OTHER = "other"
UNKNOWN = "unknown"
class DocumentStatus(str, Enum):
"""Document processing status."""
UPLOADING = "uploading"
SCANNING = "scanning" # Malware scan
QUARANTINED = "quarantined" # Failed malware scan
CLASSIFYING = "classifying" # AI classification
EXTRACTING = "extracting" # OCR/extraction
REVIEW_REQUIRED = "review_required" # Low confidence
PROCESSED = "processed" # Ready for use
REJECTED = "rejected" # Manual rejection
ARCHIVED = "archived" # Post-filing archive
class Document:
"""Client-uploaded document."""
id: UUID
client_id: UUID
tax_year: int
original_filename: str
content_type: str # MIME type
file_size_bytes: int
s3_key: str # S3 storage key
document_type: DocumentType
document_type_confidence: Optional[float]
status: DocumentStatus
upload_source: str # "portal", "email", "smartvault"
uploaded_at: datetime
scanned_at: Optional[datetime]
scan_result: Optional[str] # "clean", "infected", "error"
classified_at: Optional[datetime]
extracted_at: Optional[datetime]
extraction_id: Optional[UUID] # Link to extraction record
notes: Optional[str]
created_at: datetime
updated_at: datetime
Database Tables¶
-- document_type enum
CREATE TYPE document_type_enum AS ENUM (
'w2', 'w2_g', '1099_int', '1099_div', '1099_b', '1099_misc',
'1099_nec', '1099_r', '1099_g', '1099_k', '1098', '1098_e',
'1098_t', 'k1_1065', 'k1_1120s', 'k1_1041', 'brokerage_statement',
'bank_statement', 'drivers_license', 'prior_year_return',
'property_tax', 'charitable_receipt', 'medical_expense',
'business_expense', 'other', 'unknown'
);
-- document_status enum
CREATE TYPE document_status_enum AS ENUM (
'uploading', 'scanning', 'quarantined', 'classifying',
'extracting', 'review_required', 'processed', 'rejected', 'archived'
);
-- document table
CREATE TABLE document (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
client_id UUID NOT NULL REFERENCES client(id),
tax_year INTEGER NOT NULL,
original_filename VARCHAR(255) NOT NULL,
content_type VARCHAR(100) NOT NULL,
file_size_bytes BIGINT NOT NULL,
s3_key VARCHAR(500) NOT NULL,
document_type document_type_enum NOT NULL DEFAULT 'unknown',
document_type_confidence DECIMAL(5,4),
status document_status_enum NOT NULL DEFAULT 'uploading',
upload_source VARCHAR(50) NOT NULL,
uploaded_at TIMESTAMP NOT NULL DEFAULT NOW(),
scanned_at TIMESTAMP,
scan_result VARCHAR(50),
classified_at TIMESTAMP,
extracted_at TIMESTAMP,
extraction_id UUID,
notes TEXT,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
deleted_at TIMESTAMP
);
CREATE INDEX idx_document_client_id ON document(client_id);
CREATE INDEX idx_document_tax_year ON document(client_id, tax_year);
CREATE INDEX idx_document_status ON document(status);
CREATE INDEX idx_document_type ON document(document_type);
S3 Storage Structure¶
documents/
└── {client_id}/
└── {tax_year}/
└── uploads/
└── {document_id}/{original_filename}
└── processed/
└── {document_id}/
├── original.pdf
├── thumbnail.jpg
└── extracted.json
API Endpoints¶
| Endpoint | Method | Description |
|---|---|---|
/v1/documents/upload |
POST | Initiate multipart upload |
/v1/documents/upload/{upload_id} |
PUT | Upload file chunk |
/v1/documents/upload/{upload_id}/complete |
POST | Complete upload |
/v1/documents/{id} |
GET | Get document details |
/v1/documents/{id}/download |
GET | Get presigned download URL |
/v1/documents/client/{client_id} |
GET | List client documents |
Files to Create/Modify¶
| File | Action | Description |
|---|---|---|
src/domain/document.py |
Create | Document domain entities |
src/repositories/document_repository.py |
Create | Document data access |
src/workflows/documents/upload_workflow.py |
Create | Upload orchestration |
src/api/routes/documents.py |
Create | API endpoints |
src/api/schemas/document_schemas.py |
Create | Pydantic models |
tests/integration/conftest.py |
Modify | Add document schema |
tests/e2e/conftest.py |
Modify | Add document schema |
S4-002: Document Upload via Email¶
Story: As a client, I want to email documents to the firm, so that I can submit documents using familiar tools.
Acceptance Criteria¶
- Dedicated intake email address monitored
- Account number + tax year parsed from subject/body
- Attachments extracted and processed
- Unmatched emails queued for manual routing
- Confirmation reply sent
- MSG/EML format handling
Technical Design¶
Email Processing Flow¶
Incoming Email (SES)
│
├── 1. Store raw email in S3
│
├── 2. Parse subject/body for account number + tax year
│
├── 3. If matched → Extract attachments → Document upload workflow
│
└── 4. If unmatched → Queue for manual routing
Files to Create/Modify¶
| File | Action | Description |
|---|---|---|
src/services/email_intake_service.py |
Create | Email parsing and routing |
src/workflows/documents/email_intake_workflow.py |
Create | Email processing |
src/api/routes/webhooks.py |
Modify | Add SES webhook handler |
S4-003: Malware Scanning¶
Story: As a system administrator, I want all uploads scanned for malware, so that the system is protected from malicious files.
Acceptance Criteria¶
- ClamAV or equivalent integration
- Scan occurs before storage in production bucket
- Infected files quarantined with alert
- Scan results logged
- Client notified if file rejected
- Bypass not possible
Technical Design¶
Scan Flow¶
Upload → Staging Bucket → ClamAV Scan
│
├── Clean → Move to production bucket → Continue processing
│
└── Infected → Quarantine bucket → Alert → Notify client
Files to Create/Modify¶
| File | Action | Description |
|---|---|---|
src/services/malware_service.py |
Create | ClamAV integration |
src/workflows/documents/scan_workflow.py |
Create | Scan orchestration |
S4-004: Document Classification and Extraction¶
Story: As a preparer, I want uploaded documents automatically classified and key data extracted, so that I don't have to manually identify document types.
Acceptance Criteria¶
- AI classifies document type (W-2, 1099-, 1098-, K-1, etc.)
- All 27 document types from schema supported
- Confidence score stored
- Low-confidence items flagged for review
- Classification editable by preparer
- Form-specific extraction (W-2, 1099s, 1098s, K-1s)
- Low-confidence extractions flagged for human review
Technical Design¶
Domain Model¶
# src/domain/extraction.py
class Extraction:
"""Extracted data from a document."""
id: UUID
document_id: UUID
extraction_source: str # "internal_ai", "sureprep"
raw_extraction: dict # Full extraction JSON
validated_extraction: Optional[dict] # After human review
confidence_score: float # Overall confidence
fields_requiring_review: List[str]
reviewed_by: Optional[UUID]
reviewed_at: Optional[datetime]
created_at: datetime
Files to Create/Modify¶
| File | Action | Description |
|---|---|---|
src/domain/extraction.py |
Create | Extraction domain entities |
src/repositories/extraction_repository.py |
Create | Extraction data access |
src/services/classification_service.py |
Create | AI classification |
src/workflows/documents/classification_workflow.py |
Create | Classification orchestration |
S4-005: SmartVault Integration¶
Story: As a preparer, I want documents from SmartVault automatically retrieved, so that client uploads appear in our system without manual transfer.
Acceptance Criteria¶
- OAuth 2.0 authentication to SmartVault API
- Polling for new client uploads (or webhook if available)
- Document retrieved and processed in our system
- Completed documents pushed back to SmartVault
- Folder structure maintained
- Sync status visible to staff
Technical Design¶
See src/services/smartvault_service.py (service stub exists).
Files to Create/Modify¶
| File | Action | Description |
|---|---|---|
src/services/smartvault_service.py |
Modify | Implement full API integration |
src/workflows/documents/smartvault_sync_workflow.py |
Create | Sync orchestration |
S4-006: SurePrep Integration¶
Story: As a preparer, I want document data extracted automatically via SurePrep, so that I don't have to manually key tax form data.
Acceptance Criteria¶
- Documents pushed to SurePrep for OCR/extraction
- Binder created per client/tax year
- Extracted data retrieved as structured JSON
- Confidence scores per field
- Low-confidence fields flagged
- Link maintained between extraction and source document
Technical Design¶
See src/services/sureprep_service.py (service stub exists).
Files to Create/Modify¶
| File | Action | Description |
|---|---|---|
src/services/sureprep_service.py |
Modify | Implement full API integration |
src/workflows/documents/sureprep_extraction_workflow.py |
Create | Extraction orchestration |
S4-007: Document Checklist Management¶
Story: As a preparer, I want a checklist of expected documents based on prior year, so that I can track what's missing.
Acceptance Criteria¶
- Prior year documents inform current year checklist
- Checklist items marked as received when uploaded
- Missing items visible to preparer and client
- Automated reminders for missing documents
- Checklist customizable per client
- Completion percentage calculated
Technical Design¶
Domain Model¶
# src/domain/checklist.py
class ChecklistItem:
"""Expected document in checklist."""
id: UUID
client_id: UUID
tax_year: int
document_type: DocumentType
description: str
is_required: bool
source: str # "prior_year", "manual", "rule"
document_id: Optional[UUID] # Linked when received
received_at: Optional[datetime]
reminder_sent_at: Optional[datetime]
created_at: datetime
Files to Create/Modify¶
| File | Action | Description |
|---|---|---|
src/domain/checklist.py |
Create | Checklist domain entities |
src/repositories/checklist_repository.py |
Create | Checklist data access |
src/workflows/documents/checklist_workflow.py |
Create | Checklist management |
src/api/routes/checklist.py |
Create | Checklist API endpoints |
S4-008: Manual Extraction Correction¶
Story: As a preparer, I want to correct extracted data, so that errors in OCR/extraction can be fixed.
Acceptance Criteria¶
- Side-by-side view: source document + extracted fields
- Inline editing of extracted values
- Correction logged with before/after
- AI feedback loop (corrections improve future extractions)
- Bulk correction support for repeated errors
- Corrected flag on extraction record
Technical Design¶
Files to Create/Modify¶
| File | Action | Description |
|---|---|---|
src/api/routes/extraction.py |
Create | Extraction review API |
src/api/schemas/extraction_schemas.py |
Create | Extraction schemas |
src/workflows/documents/correction_workflow.py |
Create | Correction handling |
Implementation Order¶
- S4-001: Document Upload via Portal (P0)
- Document domain model
- Repository layer
- Upload workflow with S3 integration
-
API endpoints for upload/download
-
S4-003: Malware Scanning (P0)
- ClamAV integration (dry_run mode for development)
- Scan workflow
-
Quarantine handling
-
S4-004: Document Classification (P0)
- Classification service using Bedrock
- Extraction domain model
-
Classification workflow
-
S4-002: Document Upload via Email (P0)
- Email parsing service
- SES webhook integration
-
Email intake workflow
-
S4-005: SmartVault Integration (P1)
- S4-006: SurePrep Integration (P1)
- S4-007: Document Checklist (P1)
- S4-008: Manual Extraction Correction (P1)
Test Coverage Requirements¶
| Component | Unit Tests | Integration Tests | E2E Tests |
|---|---|---|---|
| Document domain | 10+ | - | - |
| Document repository | - | 15+ | - |
| Upload workflow | 15+ | - | - |
| Document API | - | - | 10+ |
| Malware service | 5+ | - | - |
| Classification service | 10+ | - | - |
| Email intake | 10+ | - | - |
Dependencies¶
External Services¶
| Service | Purpose | Config Key |
|---|---|---|
| S3 | Document storage | config.s3 |
| SES | Email intake | config.ses |
| ClamAV | Malware scanning | config.malware |
| Bedrock | AI classification | config.bedrock |
| SmartVault | Client portal sync | config.smartvault |
| SurePrep | OCR/extraction | config.sureprep |
Internal Dependencies¶
| Component | Dependency |
|---|---|
| UploadWorkflow | AuroraService, S3Service, MalwareService |
| ClassificationWorkflow | AuroraService, BedrockService |
| EmailIntakeWorkflow | AuroraService, S3Service, EmailService |
| SmartVaultSyncWorkflow | AuroraService, S3Service, SmartVaultService |
| SurePrepExtractionWorkflow | AuroraService, SurePrepService |
Completion Checklist¶
- S4-001 implemented and tested
- S4-002 implemented and tested
- S4-003 implemented and tested
- S4-004 implemented and tested
- S4-005 implemented and tested
- S4-006 implemented and tested
- S4-007 implemented and tested
- S4-008 implemented and tested
- All tests passing (unit, integration, e2e)
- ARCHITECTURE.md updated
- This document updated with completion status