Introduction
A well-designed schema is crucial for accurate data extraction. This guide covers best practices, patterns, and anti-patterns for creating schemas that get the best results from Documind.
Schema Structure Basics
Minimum Viable Schema
Start simple and iterate:
{
"type": "object",
"named_entities": {
"field_name": {
"type": "string",
"description": "Clear description of what this field contains"
}
},
"required": ["critical_field"]
}
Complete Schema Template
A production-ready schema with all recommended fields:
{
"type": "object",
"title": "Invoice Schema",
"description": "Schema for extracting invoice data",
"named_entities": {
"invoice_number": {
"type": "string",
"description": "The unique invoice identifier (e.g., INV-2024-001)"
},
"invoice_date": {
"type": "string",
"description": "Invoice date in YYYY-MM-DD format"
},
"total_amount": {
"type": "number",
"description": "Total invoice amount in USD"
}
},
"required": ["invoice_number", "invoice_date", "total_amount"]
}
Best Practices
1. Write Descriptive Field Names
{
"num": {"type": "string"},
"amt": {"type": "number"},
"dt": {"type": "string"}
}
{
"invoice_number": {"type": "string"},
"total_amount": {"type": "number"},
"invoice_date": {"type": "string"}
}
Why: Descriptive names help the AI understand what to extract and make your code more maintainable.
2. Always Include Descriptions
{
"vendor_name": {
"type": "string"
}
}
{
"vendor_name": {
"type": "string",
"description": "The name of the vendor or supplier who issued the invoice"
}
}
Why: Descriptions provide crucial context that improves extraction accuracy, especially for ambiguous fields.
3. Use Specific Field Descriptions
{
"date": {
"type": "string",
"description": "A date"
}
}
{
"invoice_date": {
"type": "string",
"description": "The date the invoice was issued, in MM/DD/YYYY format"
}
}
Why: Specific descriptions reduce ambiguity when documents contain multiple dates.
4. Specify Data Types Correctly
{
"quantity": {
"type": "number", // Not "string"
"description": "Quantity ordered"
},
"is_paid": {
"type": "boolean", // Not "string"
"description": "Payment status"
},
"due_date": {
"type": "string",
"description": "Payment due date in YYYY-MM-DD format"
}
}
5. Mark Critical Fields as Required
{
"type": "object",
"named_entities": {
"invoice_number": {"type": "string"},
"total_amount": {"type": "number"},
"notes": {"type": "string"} // Optional
},
"required": ["invoice_number", "total_amount"] // Only critical fields
}
Why: Required fields are flagged for review if confidence is low, ensuring accuracy where it matters most.
Working with Arrays
Simple Arrays
For lists of primitive values:
{
"product_names": {
"type": "array",
"description": "List of product names mentioned",
"items": {
"type": "string"
}
}
}
Array of Objects (Tables)
For structured lists like line items:
{
"line_items": {
"type": "array",
"description": "Invoice line items",
"items": {
"type": "object",
"named_entities": {
"description": {
"type": "string",
"description": "Product or service description"
},
"quantity": {
"type": "number",
"description": "Quantity ordered"
},
"unit_price": {
"type": "number",
"description": "Price per unit"
},
"total": {
"type": "number",
"description": "Line total (quantity × unit_price)"
}
},
"required": ["description", "quantity", "unit_price"]
}
}
}
Working with Arrays
Define arrays to extract repeating data:
{
"line_items": {
"type": "array",
"description": "Line items in the invoice",
"items": {...}
}
}
Working with Nested Objects
Simple Nesting
Group related fields:
{
"vendor": {
"type": "object",
"description": "Vendor information",
"named_entities": {
"name": {
"type": "string",
"description": "Vendor company name"
},
"address": {
"type": "string",
"description": "Vendor mailing address"
},
"tax_id": {
"type": "string",
"description": "Vendor tax ID number"
}
},
"required": ["name"]
}
}
Deep Nesting
For complex structures:
{
"customer": {
"type": "object",
"description": "Customer information",
"named_entities": {
"company": {
"type": "object",
"description": "Company details",
"named_entities": {
"name": {"type": "string"},
"address": {
"type": "object",
"named_entities": {
"street": {"type": "string"},
"city": {"type": "string"},
"state": {"type": "string"},
"zip": {"type": "string"}
}
}
}
},
"contact": {
"type": "object",
"description": "Contact person",
"named_entities": {
"name": {"type": "string"},
"email": {"type": "string"},
"phone": {"type": "string"}
}
}
}
}
}
Avoid nesting deeper than 3-4 levels. Consider flattening or splitting into multiple extractions for very complex schemas.
Common Patterns
Invoice Schema
{
"type": "object",
"named_entities": {
"invoice_number": {"type": "string", "description": "Invoice number"},
"invoice_date": {"type": "string", "description": "Invoice date"},
"due_date": {"type": "string", "description": "Payment due date"},
"vendor": {
"type": "object",
"named_entities": {
"name": {"type": "string"},
"address": {"type": "string"}
}
},
"customer": {
"type": "object",
"named_entities": {
"name": {"type": "string"},
"address": {"type": "string"}
}
},
"line_items": {
"type": "array",
"items": {
"type": "object",
"named_entities": {
"description": {"type": "string"},
"quantity": {"type": "number"},
"unit_price": {"type": "number"},
"total": {"type": "number"}
}
}
},
"subtotal": {"type": "number"},
"tax": {"type": "number"},
"total": {"type": "number"}
},
"required": ["invoice_number", "total"]
}
Receipt Schema
{
"type": "object",
"named_entities": {
"merchant_name": {"type": "string", "description": "Store or restaurant name"},
"transaction_date": {"type": "string", "description": "Transaction date"},
"transaction_time": {"type": "string", "description": "Transaction time"},
"items": {
"type": "array",
"items": {
"type": "object",
"named_entities": {
"name": {"type": "string"},
"price": {"type": "number"}
}
}
},
"subtotal": {"type": "number"},
"tax": {"type": "number"},
"tip": {"type": "number"},
"total": {"type": "number"},
"payment_method": {"type": "string", "description": "Payment method used"}
},
"required": ["merchant_name", "total"]
}
{
"type": "object",
"named_entities": {
"applicant": {
"type": "object",
"named_entities": {
"first_name": {"type": "string"},
"last_name": {"type": "string"},
"date_of_birth": {"type": "string", "description": "Date of birth in YYYY-MM-DD format"},
"ssn": {"type": "string", "description": "Social Security Number in format XXX-XX-XXXX"}
}
},
"address": {
"type": "object",
"named_entities": {
"street": {"type": "string"},
"city": {"type": "string"},
"state": {"type": "string"},
"zip": {"type": "string"}
}
},
"employment": {
"type": "object",
"named_entities": {
"employer_name": {"type": "string"},
"position": {"type": "string"},
"annual_income": {"type": "number"}
}
},
"signature_date": {"type": "string"},
"agreed_to_terms": {"type": "boolean"}
},
"required": ["applicant", "signature_date"]
}
Examples in Descriptions
Include examples to guide extraction:
{
"invoice_number": {
"type": "string",
"description": "Invoice number (e.g., INV-2024-001, INV-2024-002)"
},
"phone": {
"type": "string",
"description": "Phone number in format (123) 456-7890 or 123-456-7890"
}
}
Common Mistakes
❌ Too Many Optional Fields
{
"field1": {"type": "string"},
"field2": {"type": "string"},
// ... 50 more optional fields
"required": [] // Nothing required!
}
Problem: Review workflow won’t trigger even for poor extractions.
Solution: Mark at least 2-3 critical fields as required.
❌ Ambiguous Field Names
{
"date": {"type": "string"}, // Which date?
"amount": {"type": "number"}, // Amount of what?
"number": {"type": "string"} // What number?
}
Problem: AI may extract the wrong data.
Solution: Use specific names: invoice_date, total_amount, invoice_number.
❌ Missing Descriptions
{
"tax_id": {"type": "string"}
// No description!
}
Problem: AI may confuse similar fields.
Solution: Always include descriptions.
❌ Wrong Data Types
{
"quantity": {"type": "string"}, // Should be number
"is_paid": {"type": "string"} // Should be boolean
}
Problem: You’ll get strings like “5” instead of numbers, making calculations fail.
Solution: Use correct types.
❌ Overly Complex Schemas
{
"deeply": {
"nested": {
"structure": {
"with": {
"many": {
"levels": {...} // 6+ levels deep
}
}
}
}
}
}
Problem: Harder to extract accurately, slower processing.
Solution: Flatten or split into multiple extractions.
Schema Testing
Iterative Development
- Start simple: Extract only 2-3 fields
- Test: Run on sample documents
- Validate: Check accuracy
- Expand: Add more fields
- Repeat: Until all needed data is extracted
A/B Testing
Test different schema approaches:
# Version A: Flat structure
schema_a = {
"vendor_name": {"type": "string"},
"vendor_address": {"type": "string"}
}
# Version B: Nested structure
schema_b = {
"vendor": {
"type": "object",
"named_entities": {
"name": {"type": "string"},
"address": {"type": "string"}
}
}
}
# Compare results
results_a = extract(document_id, schema_a)
results_b = extract(document_id, schema_b)
Next Steps