Prerequisites
Before you begin, ensure you have:
A Documind account with available credits
An API key (see Authentication )
A document to process (PDF, DOCX, JPG, or PNG)
Complete Example
This guide walks through a complete extraction workflow: upload → extract → handle results.
Upload Document
Upload your document and receive a document ID. curl -X POST https://api.documind.com/api/v1/upload \
-H 'X-API-Key: YOUR_API_KEY' \
-F 'files=@invoice.pdf'
Response: [
"550e8400-e29b-41d4-a716-446655440000"
]
Define Extraction Schema
Create or generate a JSON schema defining what data to extract. Manual Schema
Generate from Sample
{
"named_entities" : {
"invoice_number" : {
"type" : "string" ,
"description" : "The invoice number"
},
"invoice_date" : {
"type" : "string" ,
"description" : "Date of invoice"
},
"vendor_name" : {
"type" : "string" ,
"description" : "Name of the vendor"
},
"total_amount" : {
"type" : "number" ,
"description" : "Total invoice amount"
},
"line_items" : {
"type" : "array" ,
"description" : "Individual line items" ,
"items" : {
"type" : "object" ,
"named_entities" : {
"description" : {
"type" : "string" ,
"description" : "Item description"
},
"quantity" : {
"type" : "number" ,
"description" : "Quantity ordered"
},
"unit_price" : {
"type" : "number" ,
"description" : "Price per unit"
},
"amount" : {
"type" : "number" ,
"description" : "Line total"
}
}
}
}
},
"required" : [ "invoice_number" , "total_amount" ]
}
import requests
# Upload a sample invoice
with open ( "sample_invoice.pdf" , "rb" ) as f:
response = requests.post(
"https://api.documind.com/api/v1/upload" ,
headers = { "X-API-Key" : "your_api_key" },
files = { "files" : f}
)
sample_id = response.json()[ 0 ]
# Generate schema from the sample
response = requests.post(
f "https://api.documind.com/api/v1/schema/ { sample_id } " ,
headers = { "X-API-Key" : "your_api_key" }
)
schema = response.json()[ "schema" ]
Also available in UI: Dashboard → Schemas → Generate from Sample
Mark critical fields as required to enable automatic review flagging if confidence is low.
Extract Data
Process the document with your schema. curl -X POST https://api.documind.com/api/v1/extract/550e8400-e29b-41d4-a716-446655440000 \
-H 'X-API-Key: YOUR_API_KEY' \
-H 'Content-Type: application/json' \
-d '{
"schema": {
"named_entities": {
"invoice_number": {"type": "string"},
"total_amount": {"type": "number"}
},
"required": ["invoice_number", "total_amount"]
},
"prompt": "Extract invoice details",
"model": "openai-gpt-4.1",
"review_threshold": 80
}'
Response: {
"document_id" : "550e8400-e29b-41d4-a716-446655440000" ,
"results" : {
"invoice_number" : "INV-2024-001" ,
"invoice_date" : "2024-01-15" ,
"vendor_name" : "Acme Corp" ,
"total_amount" : 1250.00 ,
"line_items" : [
{
"description" : "Widget A" ,
"quantity" : 10 ,
"unit_price" : 50.00 ,
"amount" : 500.00
},
{
"description" : "Widget B" ,
"quantity" : 15 ,
"unit_price" : 50.00 ,
"amount" : 750.00
}
]
},
"needs_review" : false ,
"needs_review_metadata" : {}
}
Handle Review Workflow
If needs_review is true, implement polling to wait for human review. import time
def wait_for_review ( document_id , timeout = 300 , poll_interval = 10 ):
"""
Poll extraction status until reviewed or timeout.
Returns the reviewed results.
"""
start_time = time.time()
while (time.time() - start_time) < timeout:
# Get extraction by document_id
response = requests.get(
f "https://api.documind.com/api/v1/data/extractions" ,
headers = { "X-API-Key" : API_KEY },
params = { "document_id" : document_id, "limit" : 1 }
)
data = response.json()
if data[ "items" ]:
extraction = data[ "items" ][ 0 ]
if extraction[ "is_reviewed" ]:
print ( "✓ Review completed!" )
return extraction[ "reviewed_results" ]
print ( f "⏳ Waiting for review... ( { poll_interval } s)" )
time.sleep(poll_interval)
raise TimeoutError ( "Review timeout exceeded" )
# Usage
if result[ "needs_review" ]:
print ( "⚠️ Document needs review" )
reviewed_data = wait_for_review(document_id)
process_invoice(reviewed_data)
else :
process_invoice(result[ "results" ])
Your automation now handles both immediate results and reviewed data seamlessly!
Choose the right mode for your use case:
Basic (Fastest)
VLM (Balanced)
Advanced (Most Accurate)
Best for : Simple documents, high-volume processing{
"schema" : { ... },
"model" : "google-gemini-2.0-flash" , // 2 credits/page
"prompt" : "Extract invoice data"
}
Fastest processing
Single model
No confidence scores
No automatic review flagging
Best for : Scanned documents, forms with complex layouts{
"schema" : { ... },
"extraction_mode" : "vlm" , // 10 credits/page
"review_threshold" : 80 ,
"prompt" : "Extract form fields"
}
Visual document processing
Multiple VLM models
Includes confidence scores
Automatic review flagging
Best for : Critical documents, invoices, structured forms{
"schema" : { ... },
// Advanced mode: don't set 'model' or 'extraction_mode' - 15 credits/page
"review_threshold" : 85 ,
"prompt" : "Extract all fields with high accuracy"
}
Highest accuracy
Multi-model ensemble extraction
Detailed confidence scores
Automatic review flagging
Common Patterns
Batch Processing
Process multiple documents in parallel:
import concurrent.futures
import requests
import time
API_KEY = "your_api_key_here"
BASE_URL = "https://api.documind.com/api/v1"
def process_document ( file_path ):
# Upload
with open (file_path, "rb" ) as f:
response = requests.post(
f " { BASE_URL } /upload" ,
headers = { "X-API-Key" : API_KEY },
files = { "files" : f}
)
document_id = response.json()[ 0 ]
# Extract
result = requests.post(
f " { BASE_URL } /extract/ { document_id } " ,
headers = { "X-API-Key" : API_KEY , "Content-Type" : "application/json" },
json = { "schema" : schema, "prompt" : "Extract data" }
).json()
# Handle review if needed
if result[ "needs_review" ]:
# Poll for review completion
while True :
extractions = requests.get(
f " { BASE_URL } /data/extractions" ,
headers = { "X-API-Key" : API_KEY },
params = { "document_id" : document_id, "limit" : 1 }
).json()
if extractions[ "items" ][ 0 ][ "is_reviewed" ]:
return extractions[ "items" ][ 0 ][ "reviewed_results" ]
time.sleep( 10 )
return result[ "results" ]
# Process 10 documents concurrently
with concurrent.futures.ThreadPoolExecutor( max_workers = 5 ) as executor:
results = list (executor.map(process_document, document_files))
Error Handling
Handle common error scenarios:
try :
response = requests.post(
f "https://api.documind.com/api/v1/extract/ { document_id } " ,
headers = { "X-API-Key" : API_KEY , "Content-Type" : "application/json" },
json = { "schema" : schema, "prompt" : "Extract data" }
)
response.raise_for_status()
result = response.json()
except requests.exceptions.HTTPError as e:
if e.response.status_code == 402 :
print ( "Insufficient credits - please upgrade" )
elif e.response.status_code == 403 :
print ( "Document access denied" )
elif e.response.status_code == 500 :
print ( "Extraction failed - retry or contact support" )
else :
print ( f "Error: { e } " )
Check Credits Before Processing
Avoid failures by checking credits first:
response = requests.get(
"https://api.documind.com/usage/credits" ,
headers = { "X-API-Key" : API_KEY }
)
credits = response.json()
if credits [ "available_credits" ] < 100 :
print ( "⚠️ Low credits - consider waiting for daily refresh" )
Testing Your Integration
Use these test scenarios:
Simple Document : Single-page invoice with clear text
Complex Layout : Multi-column form or table
Poor Quality : Scanned or low-resolution image
Edge Cases : Missing fields, unusual formats
Start with Basic extraction for testing, then upgrade to Advanced for production.
Next Steps
Extraction Flow Deep dive into the complete extraction workflow
Review Polling Advanced patterns for handling reviews in automation
Data Endpoints Query and filter extraction results
Error Handling Robust error handling strategies