Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.documind.cloud/llms.txt

Use this file to discover all available pages before exploring further.

Complete Extraction Workflow

Understanding the full extraction lifecycle helps you build robust automation pipelines. Here’s how documents flow through the system:

Phase 1: Document Upload

1

Upload Files

Submit documents via POST /upload. Supports batching up to 100 files.
import requests

response = requests.post(
    "https://api.documind.com/api/v1/upload",
    headers={"X-API-Key": "your_api_key"},
    files=[
        ("files", open("invoice1.pdf", "rb")),
        ("files", open("invoice2.pdf", "rb"))
    ]
)
document_ids = response.json()
# Returns: ["uuid1", "uuid2"]
Credit Impact: No credits charged for upload
Storage: Documents stored for 30 days
2

Receive Document IDs

Store the returned UUIDs for extraction requests.
# Map filenames to IDs for tracking
doc_mapping = {
    "invoice1.pdf": document_ids[0],
    "invoice2.pdf": document_ids[1]
}

Phase 2: Schema Definition

Choose one of three approaches:
Use the Documind UI to access built-in templates for common document types like invoices, receipts, forms, etc.Navigate to: Dashboard → Schemas → Templates
✓ Fastest setup
✓ Proven accuracy
✗ Not available via API (use UI or define custom schema)

Phase 3: Data Extraction

Configure extraction mode based on requirements:

Mode Selection Decision Tree

Start
  ├─ Need confidence scores? ──No──> Basic Extraction
  │                                   (2-6 credits/page)

  └─ Yes
      ├─ Scanned/Image documents? ──Yes──> VLM Extraction
      │                                     (10 credits/page)

      └─ No (Native PDF/Text)
          └─ Critical accuracy needed? ──Yes──> Advanced Extraction
                                                  (15 credits/page)

Basic Extraction Example

import requests

response = requests.post(
    f"https://api.documind.com/api/v1/extract/{document_id}",
    headers={"X-API-Key": "your_api_key", "Content-Type": "application/json"},
    json={
        "schema": schema,
        "model": "openai-gpt-4.1",  # or "google-gemini-2.0-flash"
        "prompt": "Extract all invoice fields accurately"
    }
)
result = response.json()

# No confidence scores, no review flagging
if result["needs_review"]:  # Always False for Basic mode
    pass
else:
    process_data(result["results"])

Advanced Extraction Example

import requests

response = requests.post(
    f"https://api.documind.com/api/v1/extract/{document_id}",
    headers={"X-API-Key": "your_api_key", "Content-Type": "application/json"},
    json={
        "schema": schema,
        # Advanced mode - no model or extraction_mode specified
        "review_threshold": 85,
        "prompt": "Extract invoice with high accuracy"
    }
)
result = response.json()

# Includes confidence scores
if result["needs_review"]:
    # Some required fields below threshold
    print("⚠️  Needs human review")
    # Proceed to Phase 4
else:
    # All required fields above threshold
    process_data(result["results"])
Credit Usage: Credits deducted per page × model cost

Phase 4: Review Workflow

Only triggered when needs_review = true:
1

Identify Flagged Fields

Parse the metadata to find low-confidence fields:
for field, flag in result["needs_review_metadata"]["review_flags"].items():
    if flag:
        confidence = result["needs_review_metadata"]["confidence_scores"][field]
        print(f"⚠️  {field}: {confidence}% confidence")
2

Notify Review Team

Human review happens in the Documind UI. Direct your review team to:Dashboard → Review QueueThey can see all pending reviews, view extraction confidence scores, and correct/approve results.
Optionally, send notifications via your own system:
# Your custom notification logic
send_email(
    to="reviewers@company.com",
    subject=f"Review Needed: {filename}",
    body=f"Document {document_id} needs review at https://app.documind.com/review"
)
3

Poll for Completion

Implement polling to detect when is_reviewed = true:
import requests
import time

def poll_for_review(document_id, poll_interval=10, timeout=600):
    start_time = time.time()
    
    while (time.time() - start_time) < timeout:
        response = requests.get(
            "https://api.documind.com/api/v1/data/extractions",
            headers={"X-API-Key": "your_api_key"},
            params={"document_id": document_id, "limit": 1}
        )
        data = response.json()
        
        if data["items"] and data["items"][0]["is_reviewed"]:
            return data["items"][0]["reviewed_results"]
        
        time.sleep(poll_interval)
    
    return None

reviewed_data = poll_for_review(document_id)
if reviewed_data:
    process_data(reviewed_data)
else:
    handle_timeout(document_id)
See Polling Pattern for details.
4

Use Reviewed Results

Once is_reviewed = true, use reviewed_results instead of results:
import requests

response = requests.get(
    "https://api.documind.com/api/v1/data/extractions",
    headers={"X-API-Key": "your_api_key"},
    params={"document_id": document_id, "limit": 1}
)
extraction = response.json()["items"][0]

if extraction["is_reviewed"]:
    # Use human-corrected data
    data = extraction["reviewed_results"]
else:
    # Use original AI extraction
    data = extraction["results"]

Phase 5: Data Processing

Process the final data in your automation:
def process_invoice_data(data):
    """Process extracted/reviewed invoice data."""
    
    # Validate required fields
    assert "invoice_number" in data
    assert "total_amount" in data
    
    # Update your system
    create_accounting_record(
        invoice_number=data["invoice_number"],
        amount=data["total_amount"],
        vendor=data.get("vendor_name"),
        line_items=data.get("line_items", [])
    )
    
    # Archive original document
    archive_document(data["document_id"])
    
    return True

Complete Example

Here’s a full workflow implementation with real API calls:
import requests
import time
from typing import Dict, Optional

class DocumindWorkflow:
    """Complete extraction workflow manager using Documind API."""
    
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.base_url = "https://api.documind.com/api/v1"
        self.headers = {"X-API-Key": api_key}
    
    def process_document(
        self,
        file_path: str,
        schema: Dict,
        mode: str = "advanced"
    ) -> Dict:
        """
        Complete workflow: upload → extract → review → process
        
        Args:
            file_path: Path to document file
            schema: Extraction schema
            mode: 'basic', 'vlm', or 'advanced'
        
        Returns:
            Final extracted data (original or reviewed)
        """
        # Phase 1: Upload via API
        print(f"📤 Uploading {file_path}...")
        with open(file_path, "rb") as f:
            response = requests.post(
                f"{self.base_url}/upload",
                headers=self.headers,
                files={"files": f}
            )
        response.raise_for_status()
        doc_id = response.json()[0]
        print(f"✓ Uploaded: {doc_id}")
        
        # Phase 2 & 3: Extract via API
        print(f"🔍 Extracting data ({mode} mode)...")
        config = {"schema": schema, "prompt": "Extract all data accurately"}
        
        if mode == "basic":
            config["model"] = "openai-gpt-4.1"
        elif mode == "vlm":
            config["extraction_mode"] = "vlm"
            config["review_threshold"] = 80
        else:  # advanced
            config["review_threshold"] = 85
        
        response = requests.post(
            f"{self.base_url}/extract/{doc_id}",
            headers={**self.headers, "Content-Type": "application/json"},
            json=config
        )
        response.raise_for_status()
        result = response.json()
        print(f"✓ Extraction complete")
        
        # Phase 4: Handle review if needed
        if result["needs_review"]:
            print(f"⚠️  Document needs review - direct team to UI: https://app.documind.com/review")
            
            # Poll for review completion
            print(f"⏳ Waiting for human review...")
            start = time.time()
            
            while (time.time() - start) < 600:  # 10 min timeout
                response = requests.get(
                    f"{self.base_url}/data/extractions",
                    headers=self.headers,
                    params={"document_id": doc_id, "limit": 1}
                )
                data = response.json()
                
                if data["items"] and data["items"][0]["is_reviewed"]:
                    print(f"✓ Review completed")
                    return data["items"][0]["reviewed_results"]
                
                time.sleep(10)
            
            raise TimeoutError("Review not completed in time")
        else:
            print(f"✓ No review needed")
            return result["results"]

# Usage
workflow = DocumindWorkflow(api_key="your_api_key")

schema = {
    "named_entities": {
        "invoice_number": {"type": "string"},
        "total_amount": {"type": "number"}
    },
    "required": ["invoice_number", "total_amount"]
}

# Process document through complete workflow
try:
    data = workflow.process_document(
        file_path="invoice.pdf",
        schema=schema,
        mode="advanced"
    )
    
    print(f"✓ Final data: {data}")
    # Process your data here
    
except TimeoutError:
    print("Review took too long, escalating...")
except Exception as e:
    print(f"Error: {e}")

Troubleshooting

Problem: 500 Internal Server Error on uploadSolutions:
  • Verify file is not corrupted
  • Check file size < 50MB
  • Ensure file format is supported
  • Retry with exponential backoff
Problem: Extraction takes too long or times outSolutions:
  • Switch to Basic mode for faster processing
  • Reduce document page count
  • Simplify schema (fewer fields)
  • Contact support if issue persists
Problem: Review threshold too strictSolutions:
  • Lower review_threshold from 85 to 75
  • Mark fewer fields as required
  • Improve schema descriptions
  • Use Basic mode if reviews aren’t needed
Problem: Polling times out waiting for reviewSolutions:
  • Increase timeout to match your review SLA
  • Implement email notifications to reviewers
  • Check review queue isn’t backlogged
  • Consider async processing instead of blocking

Next Steps

Upload Documents

Detailed upload endpoint documentation

Extract Data

Complete extraction API reference

Polling Pattern

Robust polling implementation guide

Automation Patterns

Production-ready automation examples