Schema Design Guide

Introduction

A well-designed schema is crucial for accurate data extraction. This guide covers best practices, patterns, and anti-patterns for creating schemas that get the best results from Documind.

Schema Structure Basics

Minimum Viable Schema

Start simple and iterate:

{
  "type": "object",
  "named_entities": {
    "field_name": {
      "type": "string",
      "description": "Clear description of what this field contains"
    }
  },
  "required": ["critical_field"]
}

Complete Schema Template

A production-ready schema with all recommended fields:

{
  "type": "object",
  "title": "Invoice Schema",
  "description": "Schema for extracting invoice data",
  "named_entities": {
    "invoice_number": {
      "type": "string",
      "description": "The unique invoice identifier (e.g., INV-2024-001)"
    },
    "invoice_date": {
      "type": "string",
      "description": "Invoice date in YYYY-MM-DD format"
    },
    "total_amount": {
      "type": "number",
      "description": "Total invoice amount in USD"
    }
  },
  "required": ["invoice_number", "invoice_date", "total_amount"]
}

Best Practices

1. Write Descriptive Field Names

❌ Bad
✅ Good

{
  "num": {"type": "string"},
  "amt": {"type": "number"},
  "dt": {"type": "string"}
}

{
  "invoice_number": {"type": "string"},
  "total_amount": {"type": "number"},
  "invoice_date": {"type": "string"}
}

Why: Descriptive names help the AI understand what to extract and make your code more maintainable.

2. Always Include Descriptions

❌ Bad
✅ Good

{
  "vendor_name": {
    "type": "string"
  }
}

{
  "vendor_name": {
    "type": "string",
    "description": "The name of the vendor or supplier who issued the invoice"
  }
}

Why: Descriptions provide crucial context that improves extraction accuracy, especially for ambiguous fields.

3. Use Specific Field Descriptions

❌ Vague
✅ Specific

{
  "date": {
    "type": "string",
    "description": "A date"
  }
}

{
  "invoice_date": {
    "type": "string",
    "description": "The date the invoice was issued, in MM/DD/YYYY format"
  }
}

Why: Specific descriptions reduce ambiguity when documents contain multiple dates.

4. Specify Data Types Correctly

{
  "quantity": {
    "type": "number",           // Not "string"
    "description": "Quantity ordered"
  },
  "is_paid": {
    "type": "boolean",          // Not "string"
    "description": "Payment status"
  },
  "due_date": {
    "type": "string",
    "description": "Payment due date in YYYY-MM-DD format"
  }
}

5. Mark Critical Fields as Required

{
  "type": "object",
  "named_entities": {
    "invoice_number": {"type": "string"},
    "total_amount": {"type": "number"},
    "notes": {"type": "string"}        // Optional
  },
  "required": ["invoice_number", "total_amount"]  // Only critical fields
}

Why: Required fields are flagged for review if confidence is low, ensuring accuracy where it matters most.

Working with Arrays

Simple Arrays

For lists of primitive values:

{
  "product_names": {
    "type": "array",
    "description": "List of product names mentioned",
    "items": {
      "type": "string"
    }
  }
}

Array of Objects (Tables)

For structured lists like line items:

{
  "line_items": {
    "type": "array",
    "description": "Invoice line items",
    "items": {
      "type": "object",
      "named_entities": {
        "description": {
          "type": "string",
          "description": "Product or service description"
        },
        "quantity": {
          "type": "number",
          "description": "Quantity ordered"
        },
        "unit_price": {
          "type": "number",
          "description": "Price per unit"
        },
        "total": {
          "type": "number",
          "description": "Line total (quantity × unit_price)"
        }
      },
      "required": ["description", "quantity", "unit_price"]
    }
  }
}

Working with Arrays

Define arrays to extract repeating data:

{
  "line_items": {
    "type": "array",
    "description": "Line items in the invoice",
    "items": {...}
  }
}

Working with Nested Objects

Simple Nesting

Group related fields:

{
  "vendor": {
    "type": "object",
    "description": "Vendor information",
    "named_entities": {
      "name": {
        "type": "string",
        "description": "Vendor company name"
      },
      "address": {
        "type": "string",
        "description": "Vendor mailing address"
      },
      "tax_id": {
        "type": "string",
        "description": "Vendor tax ID number"
      }
    },
    "required": ["name"]
  }
}

Deep Nesting

For complex structures:

{
  "customer": {
    "type": "object",
    "description": "Customer information",
    "named_entities": {
      "company": {
        "type": "object",
        "description": "Company details",
        "named_entities": {
          "name": {"type": "string"},
          "address": {
            "type": "object",
            "named_entities": {
              "street": {"type": "string"},
              "city": {"type": "string"},
              "state": {"type": "string"},
              "zip": {"type": "string"}
            }
          }
        }
      },
      "contact": {
        "type": "object",
        "description": "Contact person",
        "named_entities": {
          "name": {"type": "string"},
          "email": {"type": "string"},
          "phone": {"type": "string"}
        }
      }
    }
  }
}

Avoid nesting deeper than 3-4 levels. Consider flattening or splitting into multiple extractions for very complex schemas.

Common Patterns

Invoice Schema

{
  "type": "object",
  "named_entities": {
    "invoice_number": {"type": "string", "description": "Invoice number"},
    "invoice_date": {"type": "string", "description": "Invoice date"},
    "due_date": {"type": "string", "description": "Payment due date"},
    "vendor": {
      "type": "object",
      "named_entities": {
        "name": {"type": "string"},
        "address": {"type": "string"}
      }
    },
    "customer": {
      "type": "object",
      "named_entities": {
        "name": {"type": "string"},
        "address": {"type": "string"}
      }
    },
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "named_entities": {
          "description": {"type": "string"},
          "quantity": {"type": "number"},
          "unit_price": {"type": "number"},
          "total": {"type": "number"}
        }
      }
    },
    "subtotal": {"type": "number"},
    "tax": {"type": "number"},
    "total": {"type": "number"}
  },
  "required": ["invoice_number", "total"]
}

Receipt Schema

{
  "type": "object",
  "named_entities": {
    "merchant_name": {"type": "string", "description": "Store or restaurant name"},
    "transaction_date": {"type": "string", "description": "Transaction date"},
    "transaction_time": {"type": "string", "description": "Transaction time"},
    "items": {
      "type": "array",
      "items": {
        "type": "object",
        "named_entities": {
          "name": {"type": "string"},
          "price": {"type": "number"}
        }
      }
    },
    "subtotal": {"type": "number"},
    "tax": {"type": "number"},
    "tip": {"type": "number"},
    "total": {"type": "number"},
    "payment_method": {"type": "string", "description": "Payment method used"}
  },
  "required": ["merchant_name", "total"]
}

Form Schema

{
  "type": "object",
  "named_entities": {
    "applicant": {
      "type": "object",
      "named_entities": {
        "first_name": {"type": "string"},
        "last_name": {"type": "string"},
        "date_of_birth": {"type": "string", "description": "Date of birth in YYYY-MM-DD format"},
        "ssn": {"type": "string", "description": "Social Security Number in format XXX-XX-XXXX"}
      }
    },
    "address": {
      "type": "object",
      "named_entities": {
        "street": {"type": "string"},
        "city": {"type": "string"},
        "state": {"type": "string"},
        "zip": {"type": "string"}
      }
    },
    "employment": {
      "type": "object",
      "named_entities": {
        "employer_name": {"type": "string"},
        "position": {"type": "string"},
        "annual_income": {"type": "number"}
      }
    },
    "signature_date": {"type": "string"},
    "agreed_to_terms": {"type": "boolean"}
  },
  "required": ["applicant", "signature_date"]
}

Examples in Descriptions

Include examples to guide extraction:

{
  "invoice_number": {
    "type": "string",
    "description": "Invoice number (e.g., INV-2024-001, INV-2024-002)"
  },
  "phone": {
    "type": "string",
    "description": "Phone number in format (123) 456-7890 or 123-456-7890"
  }
}

Common Mistakes

❌ Too Many Optional Fields

{
  "field1": {"type": "string"},
  "field2": {"type": "string"},
  // ... 50 more optional fields
  "required": []  // Nothing required!
}

Problem: Review workflow won’t trigger even for poor extractions.
Solution: Mark at least 2-3 critical fields as required.

❌ Ambiguous Field Names

{
  "date": {"type": "string"},  // Which date?
  "amount": {"type": "number"},  // Amount of what?
  "number": {"type": "string"}  // What number?
}

Problem: AI may extract the wrong data.
Solution: Use specific names: invoice_date, total_amount, invoice_number.

❌ Missing Descriptions

{
  "tax_id": {"type": "string"}
  // No description!
}

Problem: AI may confuse similar fields.
Solution: Always include descriptions.

❌ Wrong Data Types

{
  "quantity": {"type": "string"},  // Should be number
  "is_paid": {"type": "string"}    // Should be boolean
}

Problem: You’ll get strings like “5” instead of numbers, making calculations fail.
Solution: Use correct types.

❌ Overly Complex Schemas

{
  "deeply": {
    "nested": {
      "structure": {
        "with": {
          "many": {
            "levels": {...}  // 6+ levels deep
          }
        }
      }
    }
  }
}

Problem: Harder to extract accurately, slower processing.
Solution: Flatten or split into multiple extractions.

Schema Testing

Iterative Development

Start simple: Extract only 2-3 fields
Test: Run on sample documents
Validate: Check accuracy
Expand: Add more fields
Repeat: Until all needed data is extracted

A/B Testing

Test different schema approaches:

# Version A: Flat structure
schema_a = {
  "vendor_name": {"type": "string"},
  "vendor_address": {"type": "string"}
}

# Version B: Nested structure
schema_b = {
  "vendor": {
    "type": "object",
    "named_entities": {
      "name": {"type": "string"},
      "address": {"type": "string"}
    }
  }
}

# Compare results
results_a = extract(document_id, schema_a)
results_b = extract(document_id, schema_b)

Next Steps

Prompt Design

Optimize extraction prompts for better results

Invoice Tutorial

Apply schema design to invoice processing

Core Concepts

Understand schemas in the context of Documind

API Reference

See the extraction API documentation

Getting Started

Use-Case Tutorials

Advanced Guides

Introduction

Schema Structure Basics

Minimum Viable Schema

Complete Schema Template

Best Practices

1. Write Descriptive Field Names

2. Always Include Descriptions

3. Use Specific Field Descriptions

4. Specify Data Types Correctly

5. Mark Critical Fields as Required

Working with Arrays

Simple Arrays

Array of Objects (Tables)

Working with Arrays

Working with Nested Objects

Simple Nesting

Deep Nesting

Common Patterns

Invoice Schema

Receipt Schema

Form Schema

Examples in Descriptions

Common Mistakes

❌ Too Many Optional Fields

❌ Ambiguous Field Names

❌ Missing Descriptions

❌ Wrong Data Types

❌ Overly Complex Schemas

Schema Testing

Iterative Development

A/B Testing

Next Steps

Prompt Design

Invoice Tutorial

Core Concepts

API Reference

Getting Started

Use-Case Tutorials

Advanced Guides

​Introduction

​Schema Structure Basics

​Minimum Viable Schema

​Complete Schema Template

​Best Practices

​1. Write Descriptive Field Names

​2. Always Include Descriptions

​3. Use Specific Field Descriptions

​4. Specify Data Types Correctly

​5. Mark Critical Fields as Required

​Working with Arrays

​Simple Arrays

​Array of Objects (Tables)

​Working with Arrays

​Working with Nested Objects

​Simple Nesting

​Deep Nesting

​Common Patterns

​Invoice Schema

​Receipt Schema

​Form Schema

​Examples in Descriptions

​Common Mistakes

​❌ Too Many Optional Fields

​❌ Ambiguous Field Names

​❌ Missing Descriptions

​❌ Wrong Data Types

​❌ Overly Complex Schemas

​Schema Testing

​Iterative Development

​A/B Testing

​Next Steps

Prompt Design

Invoice Tutorial

Core Concepts

API Reference

Introduction

Schema Structure Basics

Minimum Viable Schema

Complete Schema Template

Best Practices

1. Write Descriptive Field Names

2. Always Include Descriptions

3. Use Specific Field Descriptions

4. Specify Data Types Correctly

5. Mark Critical Fields as Required

Working with Arrays

Simple Arrays

Array of Objects (Tables)

Working with Arrays

Working with Nested Objects

Simple Nesting

Deep Nesting

Common Patterns

Invoice Schema

Receipt Schema

Form Schema

Examples in Descriptions

Common Mistakes

❌ Too Many Optional Fields

❌ Ambiguous Field Names

❌ Missing Descriptions

❌ Wrong Data Types

❌ Overly Complex Schemas

Schema Testing

Iterative Development

A/B Testing

Next Steps