Skip to main content

Endpoint

POST /extract/{document_id}
Extract structured information from an uploaded document using a defined schema. Choose between Basic, VLM, or Advanced extraction modes based on your accuracy and speed requirements.

Authentication

X-API-Key
string
required
API key for authentication. Your unique API key.

Path Parameters

document_id
string
required
UUID of the uploaded document. Obtained from the /upload endpoint.

Request Body

schema
object
required
JSON Schema defining the structure of data to extract. Uses named_entities format.
{
  "named_entities": {
    "field_name": {
      "type": "string|number|boolean|array|object",
      "description": "Field description for AI context"
    }
  },
  "required": ["field1", "field2"]
}
prompt
string
Additional instructions for extraction. Optional but recommended for complex documents.Default: "No additional instructions provided."
model
string
For Basic Extraction only. Specify the AI model to use:
  • openai-gpt-4o (6 credits/page) - Most accurate
  • openai-gpt-4.1 (4 credits/page) - Balanced
  • google-gemini-2.0-flash (2 credits/page) - Fastest
If provided, uses Basic extraction mode (single model, no confidence scores).
extraction_mode
string
For VLM Extraction only. Set to:
  • vlm (10 credits/page) - Vision-based extraction for scanned docs
For Advanced extraction (15 credits/page): Don’t set this parameter AND don’t set model.
For Basic extraction: Set model parameter instead.
review_threshold
number
default:"80"
Confidence threshold (0-100) for automatic review flagging. Only applies to Advanced/VLM modes.Fields with confidence below this threshold are flagged for review if they’re marked as required in the schema.

Response

document_id
string
UUID of the processed document.
results
object
Extracted data matching your schema structure. Fields are ordered according to schema definition.
needs_review
boolean
Whether this extraction requires human review. true if any required fields have confidence below the review threshold.
needs_review_metadata
object
Metadata about fields needing review. Only present in Advanced/VLM modes.

Examples

Basic Extraction

Fast, single-model extraction for simple documents:
curl -X POST https://api.documind.com/api/v1/extract/550e8400-e29b-41d4-a716-446655440000 \
  -H 'X-API-Key: YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "schema": {
      "named_entities": {
        "invoice_number": {
          "type": "string",
          "description": "The invoice number"
        },
        "total_amount": {
          "type": "number",
          "description": "Total invoice amount"
        }
      },
      "required": ["invoice_number"]
    },
    "prompt": "Extract invoice details",
    "model": "openai-gpt-4.1"
  }'
{
  "document_id": "550e8400-e29b-41d4-a716-446655440000",
  "results": {
    "invoice_number": "INV-2024-001",
    "total_amount": 1250.00,
    "vendor_name": "Acme Corporation"
  },
  "needs_review": false,
  "needs_review_metadata": {}
}

Advanced Extraction

Multi-model validation with confidence scores:
advanced_extract = {
    "schema": {
        "named_entities": {
            "invoice_number": {
                "type": "string",
                "description": "The invoice number"
            },
            "line_items": {
                "type": "array",
                "description": "Invoice line items",
                "items": {
                    "type": "object",
                    "named_entities": {
                        "description": {
                            "type": "string",
                            "description": "Item description"
                        },
                        "amount": {
                            "type": "number",
                            "description": "Line total"
                        }
                    }
                }
            }
        },
        "required": ["invoice_number", "line_items"]
    },
    "prompt": "Extract all invoice details with high accuracy",
    # Advanced mode: don't set 'model' or 'extraction_mode' - 15 credits per page
    "review_threshold": 85
}

response = requests.post(
    f"https://api.documind.com/api/v1/extract/{document_id}",
    headers={
        "X-API-Key": API_KEY,
        "Content-Type": "application/json"
    },
    json=advanced_extract
)

result = response.json()

# Check if review is needed
if result["needs_review"]:
    print("⚠️  Some fields need review:")
    for field, needs_review in result["needs_review_metadata"]["review_flags"].items():
        if needs_review:
            confidence = result["needs_review_metadata"]["confidence_scores"][field]
            print(f"  - {field}: {confidence:.1f}% confidence")
{
  "document_id": "550e8400-e29b-41d4-a716-446655440000",
  "results": {
    "invoice_number": "INV-2024-001",
    "line_items": [
      {
        "description": "Professional Services",
        "amount": 1000.00
      },
      {
        "description": "Software License",
        "amount": 250.00
      }
    ]
  },
  "needs_review": true,
  "needs_review_metadata": {
    "confidence_scores": {
      "invoice_number": 95.2,
      "line_items": {
        "0": {
          "description": 88.5,
          "amount": 92.1
        },
        "1": {
          "description": 72.3,
          "amount": 95.8
        }
      }
    },
    "review_flags": {
      "invoice_number": false,
      "line_items": {
        "0": {
          "description": false,
          "amount": false
        },
        "1": {
          "description": true,
          "amount": false
        }
      }
    }
  }
}

Extraction Mode Comparison

FeatureBasicVLMAdvanced
Credits/Page2-61015
SpeedFastestFastModerate
AccuracyGoodVery GoodHighest
Confidence ScoresNoYesYes
Review FlaggingNoYesYes
Best ForSimple docsScanned imagesCritical data
How to useSet model paramSet extraction_mode="vlm"Don’t set model or extraction_mode

Schema Guidelines

Field Types

"customer_name": {
  "type": "string",
  "description": "Full name of the customer"
}
Use for text data: names, addresses, identifiers.
"total_amount": {
  "type": "number",
  "description": "Total invoice amount in USD"
}
For numeric values: amounts, quantities, percentages.
"line_items": {
  "type": "array",
  "description": "List of invoice line items",
  "items": {
    "type": "object",
    "named_entities": {
      "description": {"type": "string"},
      "quantity": {"type": "number"}
    }
  }
}
For repeating data: tables, lists, multiple entries.
"billing_address": {
  "type": "object",
  "description": "Customer billing address",
  "named_entities": {
    "street": {"type": "string"},
    "city": {"type": "string"},
    "zip": {"type": "string"}
  }
}
For structured data groups.

Best Practices

  1. Descriptive Field Names: Use clear, meaningful names (invoice_date not date1)
  2. Detailed Descriptions: Help the AI understand context and format
  3. Mark Critical Fields: Add to required array for automatic review
  4. Consistent Naming: Use snake_case throughout your schema

Error Responses

402 Payment Required

{
  "detail": "Insufficient credits. Please upgrade your plan or wait for your daily credits to refresh."
}
Check your credit balance before processing large batches.

403 Forbidden

{
  "detail": "You don't have access to this document"
}
Document belongs to another user or organization.

500 Internal Server Error

{
  "detail": "Failed to extract information. Please contact support."
}
Extraction processing failed. Retry or contact support if it persists.

Next Steps

Review Workflow

Handle documents that need review

Polling Pattern

Implement review polling for automation

List Extractions

Query extraction results