Skip to main content

Introduction

A well-designed schema is crucial for accurate data extraction. This guide covers best practices, patterns, and anti-patterns for creating schemas that get the best results from Documind.

Schema Structure Basics

Minimum Viable Schema

Start simple and iterate:
{
  "type": "object",
  "named_entities": {
    "field_name": {
      "type": "string",
      "description": "Clear description of what this field contains"
    }
  },
  "required": ["critical_field"]
}

Complete Schema Template

A production-ready schema with all recommended fields:
{
  "type": "object",
  "title": "Invoice Schema",
  "description": "Schema for extracting invoice data",
  "named_entities": {
    "invoice_number": {
      "type": "string",
      "description": "The unique invoice identifier (e.g., INV-2024-001)"
    },
    "invoice_date": {
      "type": "string",
      "description": "Invoice date in YYYY-MM-DD format"
    },
    "total_amount": {
      "type": "number",
      "description": "Total invoice amount in USD"
    }
  },
  "required": ["invoice_number", "invoice_date", "total_amount"]
}

Best Practices

1. Write Descriptive Field Names

{
  "num": {"type": "string"},
  "amt": {"type": "number"},
  "dt": {"type": "string"}
}
Why: Descriptive names help the AI understand what to extract and make your code more maintainable.

2. Always Include Descriptions

{
  "vendor_name": {
    "type": "string"
  }
}
Why: Descriptions provide crucial context that improves extraction accuracy, especially for ambiguous fields.

3. Use Specific Field Descriptions

{
  "date": {
    "type": "string",
    "description": "A date"
  }
}
Why: Specific descriptions reduce ambiguity when documents contain multiple dates.

4. Specify Data Types Correctly

{
  "quantity": {
    "type": "number",           // Not "string"
    "description": "Quantity ordered"
  },
  "is_paid": {
    "type": "boolean",          // Not "string"
    "description": "Payment status"
  },
  "due_date": {
    "type": "string",
    "description": "Payment due date in YYYY-MM-DD format"
  }
}

5. Mark Critical Fields as Required

{
  "type": "object",
  "named_entities": {
    "invoice_number": {"type": "string"},
    "total_amount": {"type": "number"},
    "notes": {"type": "string"}        // Optional
  },
  "required": ["invoice_number", "total_amount"]  // Only critical fields
}
Why: Required fields are flagged for review if confidence is low, ensuring accuracy where it matters most.

Working with Arrays

Simple Arrays

For lists of primitive values:
{
  "product_names": {
    "type": "array",
    "description": "List of product names mentioned",
    "items": {
      "type": "string"
    }
  }
}

Array of Objects (Tables)

For structured lists like line items:
{
  "line_items": {
    "type": "array",
    "description": "Invoice line items",
    "items": {
      "type": "object",
      "named_entities": {
        "description": {
          "type": "string",
          "description": "Product or service description"
        },
        "quantity": {
          "type": "number",
          "description": "Quantity ordered"
        },
        "unit_price": {
          "type": "number",
          "description": "Price per unit"
        },
        "total": {
          "type": "number",
          "description": "Line total (quantity × unit_price)"
        }
      },
      "required": ["description", "quantity", "unit_price"]
    }
  }
}

Working with Arrays

Define arrays to extract repeating data:
{
  "line_items": {
    "type": "array",
    "description": "Line items in the invoice",
    "items": {...}
  }
}

Working with Nested Objects

Simple Nesting

Group related fields:
{
  "vendor": {
    "type": "object",
    "description": "Vendor information",
    "named_entities": {
      "name": {
        "type": "string",
        "description": "Vendor company name"
      },
      "address": {
        "type": "string",
        "description": "Vendor mailing address"
      },
      "tax_id": {
        "type": "string",
        "description": "Vendor tax ID number"
      }
    },
    "required": ["name"]
  }
}

Deep Nesting

For complex structures:
{
  "customer": {
    "type": "object",
    "description": "Customer information",
    "named_entities": {
      "company": {
        "type": "object",
        "description": "Company details",
        "named_entities": {
          "name": {"type": "string"},
          "address": {
            "type": "object",
            "named_entities": {
              "street": {"type": "string"},
              "city": {"type": "string"},
              "state": {"type": "string"},
              "zip": {"type": "string"}
            }
          }
        }
      },
      "contact": {
        "type": "object",
        "description": "Contact person",
        "named_entities": {
          "name": {"type": "string"},
          "email": {"type": "string"},
          "phone": {"type": "string"}
        }
      }
    }
  }
}
Avoid nesting deeper than 3-4 levels. Consider flattening or splitting into multiple extractions for very complex schemas.

Common Patterns

Invoice Schema

{
  "type": "object",
  "named_entities": {
    "invoice_number": {"type": "string", "description": "Invoice number"},
    "invoice_date": {"type": "string", "description": "Invoice date"},
    "due_date": {"type": "string", "description": "Payment due date"},
    "vendor": {
      "type": "object",
      "named_entities": {
        "name": {"type": "string"},
        "address": {"type": "string"}
      }
    },
    "customer": {
      "type": "object",
      "named_entities": {
        "name": {"type": "string"},
        "address": {"type": "string"}
      }
    },
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "named_entities": {
          "description": {"type": "string"},
          "quantity": {"type": "number"},
          "unit_price": {"type": "number"},
          "total": {"type": "number"}
        }
      }
    },
    "subtotal": {"type": "number"},
    "tax": {"type": "number"},
    "total": {"type": "number"}
  },
  "required": ["invoice_number", "total"]
}

Receipt Schema

{
  "type": "object",
  "named_entities": {
    "merchant_name": {"type": "string", "description": "Store or restaurant name"},
    "transaction_date": {"type": "string", "description": "Transaction date"},
    "transaction_time": {"type": "string", "description": "Transaction time"},
    "items": {
      "type": "array",
      "items": {
        "type": "object",
        "named_entities": {
          "name": {"type": "string"},
          "price": {"type": "number"}
        }
      }
    },
    "subtotal": {"type": "number"},
    "tax": {"type": "number"},
    "tip": {"type": "number"},
    "total": {"type": "number"},
    "payment_method": {"type": "string", "description": "Payment method used"}
  },
  "required": ["merchant_name", "total"]
}

Form Schema

{
  "type": "object",
  "named_entities": {
    "applicant": {
      "type": "object",
      "named_entities": {
        "first_name": {"type": "string"},
        "last_name": {"type": "string"},
        "date_of_birth": {"type": "string", "description": "Date of birth in YYYY-MM-DD format"},
        "ssn": {"type": "string", "description": "Social Security Number in format XXX-XX-XXXX"}
      }
    },
    "address": {
      "type": "object",
      "named_entities": {
        "street": {"type": "string"},
        "city": {"type": "string"},
        "state": {"type": "string"},
        "zip": {"type": "string"}
      }
    },
    "employment": {
      "type": "object",
      "named_entities": {
        "employer_name": {"type": "string"},
        "position": {"type": "string"},
        "annual_income": {"type": "number"}
      }
    },
    "signature_date": {"type": "string"},
    "agreed_to_terms": {"type": "boolean"}
  },
  "required": ["applicant", "signature_date"]
}

Examples in Descriptions

Include examples to guide extraction:
{
  "invoice_number": {
    "type": "string",
    "description": "Invoice number (e.g., INV-2024-001, INV-2024-002)"
  },
  "phone": {
    "type": "string",
    "description": "Phone number in format (123) 456-7890 or 123-456-7890"
  }
}

Common Mistakes

❌ Too Many Optional Fields

{
  "field1": {"type": "string"},
  "field2": {"type": "string"},
  // ... 50 more optional fields
  "required": []  // Nothing required!
}
Problem: Review workflow won’t trigger even for poor extractions.
Solution: Mark at least 2-3 critical fields as required.

❌ Ambiguous Field Names

{
  "date": {"type": "string"},  // Which date?
  "amount": {"type": "number"},  // Amount of what?
  "number": {"type": "string"}  // What number?
}
Problem: AI may extract the wrong data.
Solution: Use specific names: invoice_date, total_amount, invoice_number.

❌ Missing Descriptions

{
  "tax_id": {"type": "string"}
  // No description!
}
Problem: AI may confuse similar fields.
Solution: Always include descriptions.

❌ Wrong Data Types

{
  "quantity": {"type": "string"},  // Should be number
  "is_paid": {"type": "string"}    // Should be boolean
}
Problem: You’ll get strings like “5” instead of numbers, making calculations fail.
Solution: Use correct types.

❌ Overly Complex Schemas

{
  "deeply": {
    "nested": {
      "structure": {
        "with": {
          "many": {
            "levels": {...}  // 6+ levels deep
          }
        }
      }
    }
  }
}
Problem: Harder to extract accurately, slower processing.
Solution: Flatten or split into multiple extractions.

Schema Testing

Iterative Development

  1. Start simple: Extract only 2-3 fields
  2. Test: Run on sample documents
  3. Validate: Check accuracy
  4. Expand: Add more fields
  5. Repeat: Until all needed data is extracted

A/B Testing

Test different schema approaches:
# Version A: Flat structure
schema_a = {
  "vendor_name": {"type": "string"},
  "vendor_address": {"type": "string"}
}

# Version B: Nested structure
schema_b = {
  "vendor": {
    "type": "object",
    "named_entities": {
      "name": {"type": "string"},
      "address": {"type": "string"}
    }
  }
}

# Compare results
results_a = extract(document_id, schema_a)
results_b = extract(document_id, schema_b)

Next Steps