Prerequisites
Before you begin, ensure you have:
- A Documind account with available credits
- An API key (see Authentication)
- A document to process (PDF, DOCX, JPG, or PNG)
Complete Example
This guide walks through a complete extraction workflow: upload → extract → handle results.
Upload Document
Upload your document and receive a document ID.curl -X POST https://api.documind.com/api/v1/upload \
-H 'X-API-Key: YOUR_API_KEY' \
-F '[email protected]'
Response:[
"550e8400-e29b-41d4-a716-446655440000"
]
Define Extraction Schema
Create or generate a JSON schema defining what data to extract. Manual Schema
Generate from Sample
{
"named_entities": {
"invoice_number": {
"type": "string",
"description": "The invoice number"
},
"invoice_date": {
"type": "string",
"description": "Date of invoice"
},
"vendor_name": {
"type": "string",
"description": "Name of the vendor"
},
"total_amount": {
"type": "number",
"description": "Total invoice amount"
},
"line_items": {
"type": "array",
"description": "Individual line items",
"items": {
"type": "object",
"named_entities": {
"description": {
"type": "string",
"description": "Item description"
},
"quantity": {
"type": "number",
"description": "Quantity ordered"
},
"unit_price": {
"type": "number",
"description": "Price per unit"
},
"amount": {
"type": "number",
"description": "Line total"
}
}
}
}
},
"required": ["invoice_number", "total_amount"]
}
import requests
# Upload a sample invoice
with open("sample_invoice.pdf", "rb") as f:
response = requests.post(
"https://api.documind.com/api/v1/upload",
headers={"X-API-Key": "your_api_key"},
files={"files": f}
)
sample_id = response.json()[0]
# Generate schema from the sample
response = requests.post(
f"https://api.documind.com/api/v1/schema/{sample_id}",
headers={"X-API-Key": "your_api_key"}
)
schema = response.json()["schema"]
Also available in UI: Dashboard → Schemas → Generate from Sample
Mark critical fields as required to enable automatic review flagging if confidence is low.
Extract Data
Process the document with your schema.curl -X POST https://api.documind.com/api/v1/extract/550e8400-e29b-41d4-a716-446655440000 \
-H 'X-API-Key: YOUR_API_KEY' \
-H 'Content-Type: application/json' \
-d '{
"schema": {
"named_entities": {
"invoice_number": {"type": "string"},
"total_amount": {"type": "number"}
},
"required": ["invoice_number", "total_amount"]
},
"prompt": "Extract invoice details",
"model": "openai-gpt-4.1",
"review_threshold": 80
}'
Response:{
"document_id": "550e8400-e29b-41d4-a716-446655440000",
"results": {
"invoice_number": "INV-2024-001",
"invoice_date": "2024-01-15",
"vendor_name": "Acme Corp",
"total_amount": 1250.00,
"line_items": [
{
"description": "Widget A",
"quantity": 10,
"unit_price": 50.00,
"amount": 500.00
},
{
"description": "Widget B",
"quantity": 15,
"unit_price": 50.00,
"amount": 750.00
}
]
},
"needs_review": false,
"needs_review_metadata": {}
}
Handle Review Workflow
If needs_review is true, implement polling to wait for human review.import time
def wait_for_review(document_id, timeout=300, poll_interval=10):
"""
Poll extraction status until reviewed or timeout.
Returns the reviewed results.
"""
start_time = time.time()
while (time.time() - start_time) < timeout:
# Get extraction by document_id
response = requests.get(
f"https://api.documind.com/api/v1/data/extractions",
headers={"X-API-Key": API_KEY},
params={"document_id": document_id, "limit": 1}
)
data = response.json()
if data["items"]:
extraction = data["items"][0]
if extraction["is_reviewed"]:
print("✓ Review completed!")
return extraction["reviewed_results"]
print(f"⏳ Waiting for review... ({poll_interval}s)")
time.sleep(poll_interval)
raise TimeoutError("Review timeout exceeded")
# Usage
if result["needs_review"]:
print("⚠️ Document needs review")
reviewed_data = wait_for_review(document_id)
process_invoice(reviewed_data)
else:
process_invoice(result["results"])
Your automation now handles both immediate results and reviewed data seamlessly!
Choose the right mode for your use case:
Basic (Fastest)
VLM (Balanced)
Advanced (Most Accurate)
Best for: Simple documents, high-volume processing{
"schema": {...},
"model": "google-gemini-2.0-flash", // 2 credits/page
"prompt": "Extract invoice data"
}
- Fastest processing
- Single model
- No confidence scores
- No automatic review flagging
Best for: Scanned documents, forms with complex layouts{
"schema": {...},
"extraction_mode": "vlm", // 10 credits/page
"review_threshold": 80,
"prompt": "Extract form fields"
}
- Visual document processing
- Multiple VLM models
- Includes confidence scores
- Automatic review flagging
Best for: Critical documents, invoices, structured forms{
"schema": {...},
// Advanced mode: don't set 'model' or 'extraction_mode' - 15 credits/page
"review_threshold": 85,
"prompt": "Extract all fields with high accuracy"
}
- Highest accuracy
- Multi-model ensemble extraction
- Detailed confidence scores
- Automatic review flagging
Common Patterns
Batch Processing
Process multiple documents in parallel:
import concurrent.futures
import requests
import time
API_KEY = "your_api_key_here"
BASE_URL = "https://api.documind.com/api/v1"
def process_document(file_path):
# Upload
with open(file_path, "rb") as f:
response = requests.post(
f"{BASE_URL}/upload",
headers={"X-API-Key": API_KEY},
files={"files": f}
)
document_id = response.json()[0]
# Extract
result = requests.post(
f"{BASE_URL}/extract/{document_id}",
headers={"X-API-Key": API_KEY, "Content-Type": "application/json"},
json={"schema": schema, "prompt": "Extract data"}
).json()
# Handle review if needed
if result["needs_review"]:
# Poll for review completion
while True:
extractions = requests.get(
f"{BASE_URL}/data/extractions",
headers={"X-API-Key": API_KEY},
params={"document_id": document_id, "limit": 1}
).json()
if extractions["items"][0]["is_reviewed"]:
return extractions["items"][0]["reviewed_results"]
time.sleep(10)
return result["results"]
# Process 10 documents concurrently
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
results = list(executor.map(process_document, document_files))
Error Handling
Handle common error scenarios:
try:
response = requests.post(
f"https://api.documind.com/api/v1/extract/{document_id}",
headers={"X-API-Key": API_KEY, "Content-Type": "application/json"},
json={"schema": schema, "prompt": "Extract data"}
)
response.raise_for_status()
result = response.json()
except requests.exceptions.HTTPError as e:
if e.response.status_code == 402:
print("Insufficient credits - please upgrade")
elif e.response.status_code == 403:
print("Document access denied")
elif e.response.status_code == 500:
print("Extraction failed - retry or contact support")
else:
print(f"Error: {e}")
Check Credits Before Processing
Avoid failures by checking credits first:
response = requests.get(
"https://api.documind.com/usage/credits",
headers={"X-API-Key": API_KEY}
)
credits = response.json()
if credits["available_credits"] < 100:
print("⚠️ Low credits - consider waiting for daily refresh")
Testing Your Integration
Use these test scenarios:
- Simple Document: Single-page invoice with clear text
- Complex Layout: Multi-column form or table
- Poor Quality: Scanned or low-resolution image
- Edge Cases: Missing fields, unusual formats
Start with Basic extraction for testing, then upgrade to Advanced for production.
Next Steps