Skip to main content

Complete Extraction Workflow

Understanding the full extraction lifecycle helps you build robust automation pipelines. Here’s how documents flow through the system:

Phase 1: Document Upload

1

Upload Files

Submit documents via POST /upload. Supports batching up to 100 files.
import requests

response = requests.post(
    "https://api.documind.com/api/v1/upload",
    headers={"X-API-Key": "your_api_key"},
    files=[
        ("files", open("invoice1.pdf", "rb")),
        ("files", open("invoice2.pdf", "rb"))
    ]
)
document_ids = response.json()
# Returns: ["uuid1", "uuid2"]
Credit Impact: No credits charged for upload
Storage: Documents stored for 30 days
2

Receive Document IDs

Store the returned UUIDs for extraction requests.
# Map filenames to IDs for tracking
doc_mapping = {
    "invoice1.pdf": document_ids[0],
    "invoice2.pdf": document_ids[1]
}

Phase 2: Schema Definition

Choose one of three approaches:
Use the Documind UI to access built-in templates for common document types like invoices, receipts, forms, etc.Navigate to: Dashboard → Schemas → Templates
✓ Fastest setup
✓ Proven accuracy
✗ Not available via API (use UI or define custom schema)

Phase 3: Data Extraction

Configure extraction mode based on requirements:

Mode Selection Decision Tree

Start
  ├─ Need confidence scores? ──No──> Basic Extraction
  │                                   (2-6 credits/page)

  └─ Yes
      ├─ Scanned/Image documents? ──Yes──> VLM Extraction
      │                                     (10 credits/page)

      └─ No (Native PDF/Text)
          └─ Critical accuracy needed? ──Yes──> Advanced Extraction
                                                  (15 credits/page)

Basic Extraction Example

import requests

response = requests.post(
    f"https://api.documind.com/api/v1/extract/{document_id}",
    headers={"X-API-Key": "your_api_key", "Content-Type": "application/json"},
    json={
        "schema": schema,
        "model": "openai-gpt-4.1",  # or "google-gemini-2.0-flash"
        "prompt": "Extract all invoice fields accurately"
    }
)
result = response.json()

# No confidence scores, no review flagging
if result["needs_review"]:  # Always False for Basic mode
    pass
else:
    process_data(result["results"])

Advanced Extraction Example

import requests

response = requests.post(
    f"https://api.documind.com/api/v1/extract/{document_id}",
    headers={"X-API-Key": "your_api_key", "Content-Type": "application/json"},
    json={
        "schema": schema,
        # Advanced mode - no model or extraction_mode specified
        "review_threshold": 85,
        "prompt": "Extract invoice with high accuracy"
    }
)
result = response.json()

# Includes confidence scores
if result["needs_review"]:
    # Some required fields below threshold
    print("⚠️  Needs human review")
    # Proceed to Phase 4
else:
    # All required fields above threshold
    process_data(result["results"])
Credit Usage: Credits deducted per page × model cost

Phase 4: Review Workflow

Only triggered when needs_review = true:
1

Identify Flagged Fields

Parse the metadata to find low-confidence fields:
for field, flag in result["needs_review_metadata"]["review_flags"].items():
    if flag:
        confidence = result["needs_review_metadata"]["confidence_scores"][field]
        print(f"⚠️  {field}: {confidence}% confidence")
2

Notify Review Team

Human review happens in the Documind UI. Direct your review team to:Dashboard → Review QueueThey can see all pending reviews, view extraction confidence scores, and correct/approve results.
Optionally, send notifications via your own system:
# Your custom notification logic
send_email(
    to="[email protected]",
    subject=f"Review Needed: {filename}",
    body=f"Document {document_id} needs review at https://app.documind.com/review"
)
3

Poll for Completion

Implement polling to detect when is_reviewed = true:
import requests
import time

def poll_for_review(document_id, poll_interval=10, timeout=600):
    start_time = time.time()
    
    while (time.time() - start_time) < timeout:
        response = requests.get(
            "https://api.documind.com/api/v1/data/extractions",
            headers={"X-API-Key": "your_api_key"},
            params={"document_id": document_id, "limit": 1}
        )
        data = response.json()
        
        if data["items"] and data["items"][0]["is_reviewed"]:
            return data["items"][0]["reviewed_results"]
        
        time.sleep(poll_interval)
    
    return None

reviewed_data = poll_for_review(document_id)
if reviewed_data:
    process_data(reviewed_data)
else:
    handle_timeout(document_id)
See Polling Pattern for details.
4

Use Reviewed Results

Once is_reviewed = true, use reviewed_results instead of results:
import requests

response = requests.get(
    "https://api.documind.com/api/v1/data/extractions",
    headers={"X-API-Key": "your_api_key"},
    params={"document_id": document_id, "limit": 1}
)
extraction = response.json()["items"][0]

if extraction["is_reviewed"]:
    # Use human-corrected data
    data = extraction["reviewed_results"]
else:
    # Use original AI extraction
    data = extraction["results"]

Phase 5: Data Processing

Process the final data in your automation:
def process_invoice_data(data):
    """Process extracted/reviewed invoice data."""
    
    # Validate required fields
    assert "invoice_number" in data
    assert "total_amount" in data
    
    # Update your system
    create_accounting_record(
        invoice_number=data["invoice_number"],
        amount=data["total_amount"],
        vendor=data.get("vendor_name"),
        line_items=data.get("line_items", [])
    )
    
    # Archive original document
    archive_document(data["document_id"])
    
    return True

Complete Example

Here’s a full workflow implementation with real API calls:
import requests
import time
from typing import Dict, Optional

class DocumindWorkflow:
    """Complete extraction workflow manager using Documind API."""
    
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.base_url = "https://api.documind.com/api/v1"
        self.headers = {"X-API-Key": api_key}
    
    def process_document(
        self,
        file_path: str,
        schema: Dict,
        mode: str = "advanced"
    ) -> Dict:
        """
        Complete workflow: upload → extract → review → process
        
        Args:
            file_path: Path to document file
            schema: Extraction schema
            mode: 'basic', 'vlm', or 'advanced'
        
        Returns:
            Final extracted data (original or reviewed)
        """
        # Phase 1: Upload via API
        print(f"📤 Uploading {file_path}...")
        with open(file_path, "rb") as f:
            response = requests.post(
                f"{self.base_url}/upload",
                headers=self.headers,
                files={"files": f}
            )
        response.raise_for_status()
        doc_id = response.json()[0]
        print(f"✓ Uploaded: {doc_id}")
        
        # Phase 2 & 3: Extract via API
        print(f"🔍 Extracting data ({mode} mode)...")
        config = {"schema": schema, "prompt": "Extract all data accurately"}
        
        if mode == "basic":
            config["model"] = "openai-gpt-4.1"
        elif mode == "vlm":
            config["extraction_mode"] = "vlm"
            config["review_threshold"] = 80
        else:  # advanced
            config["review_threshold"] = 85
        
        response = requests.post(
            f"{self.base_url}/extract/{doc_id}",
            headers={**self.headers, "Content-Type": "application/json"},
            json=config
        )
        response.raise_for_status()
        result = response.json()
        print(f"✓ Extraction complete")
        
        # Phase 4: Handle review if needed
        if result["needs_review"]:
            print(f"⚠️  Document needs review - direct team to UI: https://app.documind.com/review")
            
            # Poll for review completion
            print(f"⏳ Waiting for human review...")
            start = time.time()
            
            while (time.time() - start) < 600:  # 10 min timeout
                response = requests.get(
                    f"{self.base_url}/data/extractions",
                    headers=self.headers,
                    params={"document_id": doc_id, "limit": 1}
                )
                data = response.json()
                
                if data["items"] and data["items"][0]["is_reviewed"]:
                    print(f"✓ Review completed")
                    return data["items"][0]["reviewed_results"]
                
                time.sleep(10)
            
            raise TimeoutError("Review not completed in time")
        else:
            print(f"✓ No review needed")
            return result["results"]

# Usage
workflow = DocumindWorkflow(api_key="your_api_key")

schema = {
    "named_entities": {
        "invoice_number": {"type": "string"},
        "total_amount": {"type": "number"}
    },
    "required": ["invoice_number", "total_amount"]
}

# Process document through complete workflow
try:
    data = workflow.process_document(
        file_path="invoice.pdf",
        schema=schema,
        mode="advanced"
    )
    
    print(f"✓ Final data: {data}")
    # Process your data here
    
except TimeoutError:
    print("Review took too long, escalating...")
except Exception as e:
    print(f"Error: {e}")

Troubleshooting

Problem: 500 Internal Server Error on uploadSolutions:
  • Verify file is not corrupted
  • Check file size < 50MB
  • Ensure file format is supported
  • Retry with exponential backoff
Problem: Extraction takes too long or times outSolutions:
  • Switch to Basic mode for faster processing
  • Reduce document page count
  • Simplify schema (fewer fields)
  • Contact support if issue persists
Problem: Review threshold too strictSolutions:
  • Lower review_threshold from 85 to 75
  • Mark fewer fields as required
  • Improve schema descriptions
  • Use Basic mode if reviews aren’t needed
Problem: Polling times out waiting for reviewSolutions:
  • Increase timeout to match your review SLA
  • Implement email notifications to reviewers
  • Check review queue isn’t backlogged
  • Consider async processing instead of blocking

Next Steps