Introduction
A well-designed schema is crucial for accurate data extraction. This guide covers best practices, patterns, and anti-patterns for creating schemas that get the best results from Documind.
Schema Structure Basics
Minimum Viable Schema
Start simple and iterate:
{
"type" : "object" ,
"named_entities" : {
"field_name" : {
"type" : "string" ,
"description" : "Clear description of what this field contains"
}
},
"required" : [ "critical_field" ]
}
Complete Schema Template
A production-ready schema with all recommended fields:
{
"type" : "object" ,
"title" : "Invoice Schema" ,
"description" : "Schema for extracting invoice data" ,
"named_entities" : {
"invoice_number" : {
"type" : "string" ,
"description" : "The unique invoice identifier (e.g., INV-2024-001)"
},
"invoice_date" : {
"type" : "string" ,
"description" : "Invoice date in YYYY-MM-DD format"
},
"total_amount" : {
"type" : "number" ,
"description" : "Total invoice amount in USD"
}
},
"required" : [ "invoice_number" , "invoice_date" , "total_amount" ]
}
Best Practices
1. Write Descriptive Field Names
{
"num" : { "type" : "string" },
"amt" : { "type" : "number" },
"dt" : { "type" : "string" }
}
{
"invoice_number" : { "type" : "string" },
"total_amount" : { "type" : "number" },
"invoice_date" : { "type" : "string" }
}
Why: Descriptive names help the AI understand what to extract and make your code more maintainable.
2. Always Include Descriptions
{
"vendor_name" : {
"type" : "string"
}
}
{
"vendor_name" : {
"type" : "string" ,
"description" : "The name of the vendor or supplier who issued the invoice"
}
}
Why: Descriptions provide crucial context that improves extraction accuracy, especially for ambiguous fields.
3. Use Specific Field Descriptions
{
"date" : {
"type" : "string" ,
"description" : "A date"
}
}
{
"invoice_date" : {
"type" : "string" ,
"description" : "The date the invoice was issued, in MM/DD/YYYY format"
}
}
Why: Specific descriptions reduce ambiguity when documents contain multiple dates.
4. Specify Data Types Correctly
{
"quantity" : {
"type" : "number" , // Not "string"
"description" : "Quantity ordered"
},
"is_paid" : {
"type" : "boolean" , // Not "string"
"description" : "Payment status"
},
"due_date" : {
"type" : "string" ,
"description" : "Payment due date in YYYY-MM-DD format"
}
}
5. Mark Critical Fields as Required
{
"type" : "object" ,
"named_entities" : {
"invoice_number" : { "type" : "string" },
"total_amount" : { "type" : "number" },
"notes" : { "type" : "string" } // Optional
},
"required" : [ "invoice_number" , "total_amount" ] // Only critical fields
}
Why: Required fields are flagged for review if confidence is low, ensuring accuracy where it matters most.
Working with Arrays
Simple Arrays
For lists of primitive values:
{
"product_names" : {
"type" : "array" ,
"description" : "List of product names mentioned" ,
"items" : {
"type" : "string"
}
}
}
Array of Objects (Tables)
For structured lists like line items:
{
"line_items" : {
"type" : "array" ,
"description" : "Invoice line items" ,
"items" : {
"type" : "object" ,
"named_entities" : {
"description" : {
"type" : "string" ,
"description" : "Product or service description"
},
"quantity" : {
"type" : "number" ,
"description" : "Quantity ordered"
},
"unit_price" : {
"type" : "number" ,
"description" : "Price per unit"
},
"total" : {
"type" : "number" ,
"description" : "Line total (quantity × unit_price)"
}
},
"required" : [ "description" , "quantity" , "unit_price" ]
}
}
}
Working with Arrays
Define arrays to extract repeating data:
{
"line_items" : {
"type" : "array" ,
"description" : "Line items in the invoice" ,
"items" : { ... }
}
}
Working with Nested Objects
Simple Nesting
Group related fields:
{
"vendor" : {
"type" : "object" ,
"description" : "Vendor information" ,
"named_entities" : {
"name" : {
"type" : "string" ,
"description" : "Vendor company name"
},
"address" : {
"type" : "string" ,
"description" : "Vendor mailing address"
},
"tax_id" : {
"type" : "string" ,
"description" : "Vendor tax ID number"
}
},
"required" : [ "name" ]
}
}
Deep Nesting
For complex structures:
{
"customer" : {
"type" : "object" ,
"description" : "Customer information" ,
"named_entities" : {
"company" : {
"type" : "object" ,
"description" : "Company details" ,
"named_entities" : {
"name" : { "type" : "string" },
"address" : {
"type" : "object" ,
"named_entities" : {
"street" : { "type" : "string" },
"city" : { "type" : "string" },
"state" : { "type" : "string" },
"zip" : { "type" : "string" }
}
}
}
},
"contact" : {
"type" : "object" ,
"description" : "Contact person" ,
"named_entities" : {
"name" : { "type" : "string" },
"email" : { "type" : "string" },
"phone" : { "type" : "string" }
}
}
}
}
}
Avoid nesting deeper than 3-4 levels. Consider flattening or splitting into multiple extractions for very complex schemas.
Common Patterns
Invoice Schema
{
"type" : "object" ,
"named_entities" : {
"invoice_number" : { "type" : "string" , "description" : "Invoice number" },
"invoice_date" : { "type" : "string" , "description" : "Invoice date" },
"due_date" : { "type" : "string" , "description" : "Payment due date" },
"vendor" : {
"type" : "object" ,
"named_entities" : {
"name" : { "type" : "string" },
"address" : { "type" : "string" }
}
},
"customer" : {
"type" : "object" ,
"named_entities" : {
"name" : { "type" : "string" },
"address" : { "type" : "string" }
}
},
"line_items" : {
"type" : "array" ,
"items" : {
"type" : "object" ,
"named_entities" : {
"description" : { "type" : "string" },
"quantity" : { "type" : "number" },
"unit_price" : { "type" : "number" },
"total" : { "type" : "number" }
}
}
},
"subtotal" : { "type" : "number" },
"tax" : { "type" : "number" },
"total" : { "type" : "number" }
},
"required" : [ "invoice_number" , "total" ]
}
Receipt Schema
{
"type" : "object" ,
"named_entities" : {
"merchant_name" : { "type" : "string" , "description" : "Store or restaurant name" },
"transaction_date" : { "type" : "string" , "description" : "Transaction date" },
"transaction_time" : { "type" : "string" , "description" : "Transaction time" },
"items" : {
"type" : "array" ,
"items" : {
"type" : "object" ,
"named_entities" : {
"name" : { "type" : "string" },
"price" : { "type" : "number" }
}
}
},
"subtotal" : { "type" : "number" },
"tax" : { "type" : "number" },
"tip" : { "type" : "number" },
"total" : { "type" : "number" },
"payment_method" : { "type" : "string" , "description" : "Payment method used" }
},
"required" : [ "merchant_name" , "total" ]
}
{
"type" : "object" ,
"named_entities" : {
"applicant" : {
"type" : "object" ,
"named_entities" : {
"first_name" : { "type" : "string" },
"last_name" : { "type" : "string" },
"date_of_birth" : { "type" : "string" , "description" : "Date of birth in YYYY-MM-DD format" },
"ssn" : { "type" : "string" , "description" : "Social Security Number in format XXX-XX-XXXX" }
}
},
"address" : {
"type" : "object" ,
"named_entities" : {
"street" : { "type" : "string" },
"city" : { "type" : "string" },
"state" : { "type" : "string" },
"zip" : { "type" : "string" }
}
},
"employment" : {
"type" : "object" ,
"named_entities" : {
"employer_name" : { "type" : "string" },
"position" : { "type" : "string" },
"annual_income" : { "type" : "number" }
}
},
"signature_date" : { "type" : "string" },
"agreed_to_terms" : { "type" : "boolean" }
},
"required" : [ "applicant" , "signature_date" ]
}
Examples in Descriptions
Include examples to guide extraction:
{
"invoice_number" : {
"type" : "string" ,
"description" : "Invoice number (e.g., INV-2024-001, INV-2024-002)"
},
"phone" : {
"type" : "string" ,
"description" : "Phone number in format (123) 456-7890 or 123-456-7890"
}
}
Common Mistakes
❌ Too Many Optional Fields
{
"field1" : { "type" : "string" },
"field2" : { "type" : "string" },
// ... 50 more optional fields
"required" : [] // Nothing required!
}
Problem: Review workflow won’t trigger even for poor extractions.
Solution: Mark at least 2-3 critical fields as required.
❌ Ambiguous Field Names
{
"date" : { "type" : "string" }, // Which date?
"amount" : { "type" : "number" }, // Amount of what?
"number" : { "type" : "string" } // What number?
}
Problem: AI may extract the wrong data.
Solution: Use specific names: invoice_date, total_amount, invoice_number.
❌ Missing Descriptions
{
"tax_id" : { "type" : "string" }
// No description!
}
Problem: AI may confuse similar fields.
Solution: Always include descriptions.
❌ Wrong Data Types
{
"quantity" : { "type" : "string" }, // Should be number
"is_paid" : { "type" : "string" } // Should be boolean
}
Problem: You’ll get strings like “5” instead of numbers, making calculations fail.
Solution: Use correct types.
❌ Overly Complex Schemas
{
"deeply" : {
"nested" : {
"structure" : {
"with" : {
"many" : {
"levels" : { ... } // 6+ levels deep
}
}
}
}
}
}
Problem: Harder to extract accurately, slower processing.
Solution: Flatten or split into multiple extractions.
Schema Testing
Iterative Development
Start simple : Extract only 2-3 fields
Test : Run on sample documents
Validate : Check accuracy
Expand : Add more fields
Repeat : Until all needed data is extracted
A/B Testing
Test different schema approaches:
# Version A: Flat structure
schema_a = {
"vendor_name" : { "type" : "string" },
"vendor_address" : { "type" : "string" }
}
# Version B: Nested structure
schema_b = {
"vendor" : {
"type" : "object" ,
"named_entities" : {
"name" : { "type" : "string" },
"address" : { "type" : "string" }
}
}
}
# Compare results
results_a = extract(document_id, schema_a)
results_b = extract(document_id, schema_b)
Next Steps
Prompt Design Optimize extraction prompts for better results
Invoice Tutorial Apply schema design to invoice processing
Core Concepts Understand schemas in the context of Documind
API Reference See the extraction API documentation