Skip to main content

Endpoint

POST /extract/{document_id}
Extract structured information from an uploaded document using a defined schema. Choose between Basic, VLM, or Advanced extraction modes based on your accuracy and speed requirements.

Authentication

X-API-Key
string
required
API key for authentication. Your unique API key.

Path Parameters

document_id
string
required
UUID of the uploaded document. Obtained from the /upload endpoint.

Request Body

schema
object
required
JSON Schema defining the structure of data to extract. Uses named_entities format.
{
  "named_entities": {
    "field_name": {
      "type": "string|number|boolean|array|object",
      "description": "Field description for AI context"
    }
  },
  "required": ["field1", "field2"]
}
prompt
string
Additional instructions for extraction. Optional but recommended for complex documents.Default: "No additional instructions provided."
model
string
For Basic Extraction only. Specify the AI model to use:
  • openai-gpt-4o (6 credits/page) - Most accurate
  • openai-gpt-4.1 (4 credits/page) - Balanced
  • google-gemini-2.0-flash (2 credits/page) - Fastest
If provided, uses Basic extraction mode (single model, no confidence scores).
extraction_mode
string
For VLM Extraction only. Set to:
  • vlm (10 credits/page) - Vision-based extraction for scanned docs
For Advanced extraction (15 credits/page): Don’t set this parameter AND don’t set model.
For Basic extraction: Set model parameter instead.
review_threshold
number
default:"80"
Confidence threshold (0-100) for automatic review flagging. Only applies to Advanced/VLM modes.Fields with confidence below this threshold are flagged for review if they’re marked as required in the schema.

Response

document_id
string
UUID of the processed document.
results
object
Extracted data matching your schema structure. Fields are ordered according to schema definition.
needs_review
boolean
Whether this extraction requires human review. true if any required fields have confidence below the review threshold.
needs_review_metadata
object
Metadata about fields needing review. Only present in Advanced/VLM modes.

Examples

Basic Extraction

Fast, single-model extraction for simple documents:
curl -X POST https://api.documind.com/api/v1/extract/550e8400-e29b-41d4-a716-446655440000 \
  -H 'X-API-Key: YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "schema": {
      "named_entities": {
        "invoice_number": {
          "type": "string",
          "description": "The invoice number"
        },
        "total_amount": {
          "type": "number",
          "description": "Total invoice amount"
        }
      },
      "required": ["invoice_number"]
    },
    "prompt": "Extract invoice details",
    "model": "openai-gpt-4.1"
  }'
{
  "document_id": "550e8400-e29b-41d4-a716-446655440000",
  "results": {
    "invoice_number": "INV-2024-001",
    "total_amount": 1250.00,
    "vendor_name": "Acme Corporation"
  },
  "needs_review": false,
  "needs_review_metadata": {}
}

Advanced Extraction

Multi-model validation with confidence scores:
advanced_extract = {
    "schema": {
        "named_entities": {
            "invoice_number": {
                "type": "string",
                "description": "The invoice number"
            },
            "line_items": {
                "type": "array",
                "description": "Invoice line items",
                "items": {
                    "type": "object",
                    "named_entities": {
                        "description": {
                            "type": "string",
                            "description": "Item description"
                        },
                        "amount": {
                            "type": "number",
                            "description": "Line total"
                        }
                    }
                }
            }
        },
        "required": ["invoice_number", "line_items"]
    },
    "prompt": "Extract all invoice details with high accuracy",
    # Advanced mode: don't set 'model' or 'extraction_mode' - 15 credits per page
    "review_threshold": 85
}

response = requests.post(
    f"https://api.documind.com/api/v1/extract/{document_id}",
    headers={
        "X-API-Key": API_KEY,
        "Content-Type": "application/json"
    },
    json=advanced_extract
)

result = response.json()

# Check if review is needed
if result["needs_review"]:
    print("⚠️  Some fields need review:")
    for field, needs_review in result["needs_review_metadata"]["review_flags"].items():
        if needs_review:
            confidence = result["needs_review_metadata"]["confidence_scores"][field]
            print(f"  - {field}: {confidence:.1f}% confidence")
{
  "document_id": "550e8400-e29b-41d4-a716-446655440000",
  "results": {
    "invoice_number": "INV-2024-001",
    "line_items": [
      {
        "description": "Professional Services",
        "amount": 1000.00
      },
      {
        "description": "Software License",
        "amount": 250.00
      }
    ]
  },
  "needs_review": true,
  "needs_review_metadata": {
    "confidence_scores": {
      "invoice_number": 95.2,
      "line_items": {
        "0": {
          "description": 88.5,
          "amount": 92.1
        },
        "1": {
          "description": 72.3,
          "amount": 95.8
        }
      }
    },
    "review_flags": {
      "invoice_number": false,
      "line_items": {
        "0": {
          "description": false,
          "amount": false
        },
        "1": {
          "description": true,
          "amount": false
        }
      }
    }
  }
}

Extraction Mode Comparison

FeatureBasicVLMAdvanced
Credits/Page2-61015
SpeedFastestFastModerate
AccuracyGoodVery GoodHighest
Confidence ScoresNoYesYes
Review FlaggingNoYesYes
Best ForSimple docsScanned imagesCritical data
How to useSet model paramSet extraction_mode="vlm"Don’t set model or extraction_mode

Schema Guidelines

Field Types

"customer_name": {
  "type": "string",
  "description": "Full name of the customer"
}
Use for text data: names, addresses, identifiers.
"total_amount": {
  "type": "number",
  "description": "Total invoice amount in USD"
}
For numeric values: amounts, quantities, percentages.
"line_items": {
  "type": "array",
  "description": "List of invoice line items",
  "items": {
    "type": "object",
    "named_entities": {
      "description": {"type": "string"},
      "quantity": {"type": "number"}
    }
  }
}
For repeating data: tables, lists, multiple entries.
"billing_address": {
  "type": "object",
  "description": "Customer billing address",
  "named_entities": {
    "street": {"type": "string"},
    "city": {"type": "string"},
    "zip": {"type": "string"}
  }
}
For structured data groups.

Best Practices

  1. Descriptive Field Names: Use clear, meaningful names (invoice_date not date1)
  2. Detailed Descriptions: Help the AI understand context and format
  3. Mark Critical Fields: Add to required array for automatic review
  4. Consistent Naming: Use snake_case throughout your schema

Error Responses

402 Payment Required

{
  "detail": "Insufficient credits. Please upgrade your plan or wait for your daily credits to refresh."
}
Check your credit balance before processing large batches.

403 Forbidden

{
  "detail": "You don't have access to this document"
}
Document belongs to another user or organization.

500 Internal Server Error

{
  "detail": "Failed to extract information. Please contact support."
}
Extraction processing failed. Retry or contact support if it persists.

Next Steps