Skip to main content

Introduction

A well-designed schema is crucial for accurate data extraction. This guide covers best practices, patterns, and anti-patterns for creating schemas that get the best results from Documind.

Schema Structure Basics

Minimum Viable Schema

Start simple and iterate:
{
  "type": "object",
  "named_entities": {
    "field_name": {
      "type": "string",
      "description": "Clear description of what this field contains"
    }
  },
  "required": ["critical_field"]
}

Complete Schema Template

A production-ready schema with all recommended fields:
{
  "type": "object",
  "title": "Invoice Schema",
  "description": "Schema for extracting invoice data",
  "named_entities": {
    "invoice_number": {
      "type": "string",
      "description": "The unique invoice identifier (e.g., INV-2024-001)"
    },
    "invoice_date": {
      "type": "string",
      "description": "Invoice date in YYYY-MM-DD format"
    },
    "total_amount": {
      "type": "number",
      "description": "Total invoice amount in USD"
    }
  },
  "required": ["invoice_number", "invoice_date", "total_amount"]
}

Best Practices

1. Write Descriptive Field Names

{
  "num": {"type": "string"},
  "amt": {"type": "number"},
  "dt": {"type": "string"}
}
Why: Descriptive names help the AI understand what to extract and make your code more maintainable.

2. Always Include Descriptions

{
  "vendor_name": {
    "type": "string"
  }
}
Why: Descriptions provide crucial context that improves extraction accuracy, especially for ambiguous fields.

3. Use Specific Field Descriptions

{
  "date": {
    "type": "string",
    "description": "A date"
  }
}
Why: Specific descriptions reduce ambiguity when documents contain multiple dates.

4. Specify Data Types Correctly

{
  "quantity": {
    "type": "number",           // Not "string"
    "description": "Quantity ordered"
  },
  "is_paid": {
    "type": "boolean",          // Not "string"
    "description": "Payment status"
  },
  "due_date": {
    "type": "string",
    "description": "Payment due date in YYYY-MM-DD format"
  }
}

5. Mark Critical Fields as Required

{
  "type": "object",
  "named_entities": {
    "invoice_number": {"type": "string"},
    "total_amount": {"type": "number"},
    "notes": {"type": "string"}        // Optional
  },
  "required": ["invoice_number", "total_amount"]  // Only critical fields
}
Why: Required fields are flagged for review if confidence is low, ensuring accuracy where it matters most.

Working with Arrays

Simple Arrays

For lists of primitive values:
{
  "product_names": {
    "type": "array",
    "description": "List of product names mentioned",
    "items": {
      "type": "string"
    }
  }
}

Array of Objects (Tables)

For structured lists like line items:
{
  "line_items": {
    "type": "array",
    "description": "Invoice line items",
    "items": {
      "type": "object",
      "named_entities": {
        "description": {
          "type": "string",
          "description": "Product or service description"
        },
        "quantity": {
          "type": "number",
          "description": "Quantity ordered"
        },
        "unit_price": {
          "type": "number",
          "description": "Price per unit"
        },
        "total": {
          "type": "number",
          "description": "Line total (quantity × unit_price)"
        }
      },
      "required": ["description", "quantity", "unit_price"]
    }
  }
}

Working with Arrays

Define arrays to extract repeating data:
{
  "line_items": {
    "type": "array",
    "description": "Line items in the invoice",
    "items": {...}
  }
}

Working with Nested Objects

Simple Nesting

Group related fields:
{
  "vendor": {
    "type": "object",
    "description": "Vendor information",
    "named_entities": {
      "name": {
        "type": "string",
        "description": "Vendor company name"
      },
      "address": {
        "type": "string",
        "description": "Vendor mailing address"
      },
      "tax_id": {
        "type": "string",
        "description": "Vendor tax ID number"
      }
    },
    "required": ["name"]
  }
}

Deep Nesting

For complex structures:
{
  "customer": {
    "type": "object",
    "description": "Customer information",
    "named_entities": {
      "company": {
        "type": "object",
        "description": "Company details",
        "named_entities": {
          "name": {"type": "string"},
          "address": {
            "type": "object",
            "named_entities": {
              "street": {"type": "string"},
              "city": {"type": "string"},
              "state": {"type": "string"},
              "zip": {"type": "string"}
            }
          }
        }
      },
      "contact": {
        "type": "object",
        "description": "Contact person",
        "named_entities": {
          "name": {"type": "string"},
          "email": {"type": "string"},
          "phone": {"type": "string"}
        }
      }
    }
  }
}
Avoid nesting deeper than 3-4 levels. Consider flattening or splitting into multiple extractions for very complex schemas.

Common Patterns

Invoice Schema

{
  "type": "object",
  "named_entities": {
    "invoice_number": {"type": "string", "description": "Invoice number"},
    "invoice_date": {"type": "string", "description": "Invoice date"},
    "due_date": {"type": "string", "description": "Payment due date"},
    "vendor": {
      "type": "object",
      "named_entities": {
        "name": {"type": "string"},
        "address": {"type": "string"}
      }
    },
    "customer": {
      "type": "object",
      "named_entities": {
        "name": {"type": "string"},
        "address": {"type": "string"}
      }
    },
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "named_entities": {
          "description": {"type": "string"},
          "quantity": {"type": "number"},
          "unit_price": {"type": "number"},
          "total": {"type": "number"}
        }
      }
    },
    "subtotal": {"type": "number"},
    "tax": {"type": "number"},
    "total": {"type": "number"}
  },
  "required": ["invoice_number", "total"]
}

Receipt Schema

{
  "type": "object",
  "named_entities": {
    "merchant_name": {"type": "string", "description": "Store or restaurant name"},
    "transaction_date": {"type": "string", "description": "Transaction date"},
    "transaction_time": {"type": "string", "description": "Transaction time"},
    "items": {
      "type": "array",
      "items": {
        "type": "object",
        "named_entities": {
          "name": {"type": "string"},
          "price": {"type": "number"}
        }
      }
    },
    "subtotal": {"type": "number"},
    "tax": {"type": "number"},
    "tip": {"type": "number"},
    "total": {"type": "number"},
    "payment_method": {"type": "string", "description": "Payment method used"}
  },
  "required": ["merchant_name", "total"]
}

Form Schema

{
  "type": "object",
  "named_entities": {
    "applicant": {
      "type": "object",
      "named_entities": {
        "first_name": {"type": "string"},
        "last_name": {"type": "string"},
        "date_of_birth": {"type": "string", "description": "Date of birth in YYYY-MM-DD format"},
        "ssn": {"type": "string", "description": "Social Security Number in format XXX-XX-XXXX"}
      }
    },
    "address": {
      "type": "object",
      "named_entities": {
        "street": {"type": "string"},
        "city": {"type": "string"},
        "state": {"type": "string"},
        "zip": {"type": "string"}
      }
    },
    "employment": {
      "type": "object",
      "named_entities": {
        "employer_name": {"type": "string"},
        "position": {"type": "string"},
        "annual_income": {"type": "number"}
      }
    },
    "signature_date": {"type": "string"},
    "agreed_to_terms": {"type": "boolean"}
  },
  "required": ["applicant", "signature_date"]
}

Examples in Descriptions

Include examples to guide extraction:
{
  "invoice_number": {
    "type": "string",
    "description": "Invoice number (e.g., INV-2024-001, INV-2024-002)"
  },
  "phone": {
    "type": "string",
    "description": "Phone number in format (123) 456-7890 or 123-456-7890"
  }
}

Common Mistakes

❌ Too Many Optional Fields

{
  "field1": {"type": "string"},
  "field2": {"type": "string"},
  // ... 50 more optional fields
  "required": []  // Nothing required!
}
Problem: Review workflow won’t trigger even for poor extractions.
Solution: Mark at least 2-3 critical fields as required.

❌ Ambiguous Field Names

{
  "date": {"type": "string"},  // Which date?
  "amount": {"type": "number"},  // Amount of what?
  "number": {"type": "string"}  // What number?
}
Problem: AI may extract the wrong data.
Solution: Use specific names: invoice_date, total_amount, invoice_number.

❌ Missing Descriptions

{
  "tax_id": {"type": "string"}
  // No description!
}
Problem: AI may confuse similar fields.
Solution: Always include descriptions.

❌ Wrong Data Types

{
  "quantity": {"type": "string"},  // Should be number
  "is_paid": {"type": "string"}    // Should be boolean
}
Problem: You’ll get strings like “5” instead of numbers, making calculations fail.
Solution: Use correct types.

❌ Overly Complex Schemas

{
  "deeply": {
    "nested": {
      "structure": {
        "with": {
          "many": {
            "levels": {...}  // 6+ levels deep
          }
        }
      }
    }
  }
}
Problem: Harder to extract accurately, slower processing.
Solution: Flatten or split into multiple extractions.

Schema Testing

Iterative Development

  1. Start simple: Extract only 2-3 fields
  2. Test: Run on sample documents
  3. Validate: Check accuracy
  4. Expand: Add more fields
  5. Repeat: Until all needed data is extracted

A/B Testing

Test different schema approaches:
# Version A: Flat structure
schema_a = {
  "vendor_name": {"type": "string"},
  "vendor_address": {"type": "string"}
}

# Version B: Nested structure
schema_b = {
  "vendor": {
    "type": "object",
    "named_entities": {
      "name": {"type": "string"},
      "address": {"type": "string"}
    }
  }
}

# Compare results
results_a = extract(document_id, schema_a)
results_b = extract(document_id, schema_b)

Next Steps

Prompt Design

Optimize extraction prompts for better results

Invoice Tutorial

Apply schema design to invoice processing

Core Concepts

Understand schemas in the context of Documind

API Reference

See the extraction API documentation