Skip to main content

Endpoint

GET /data/extractions
Retrieve a list of extractions with flexible filtering, sorting, and pagination. Essential for polling review status and managing extraction history.

Authentication

X-API-Key
string
required
API key for authentication. Your unique API key.

Query Parameters

Filters

document_id
string
Filter by specific document UUID. Most efficient for single-document queries.
?document_id=550e8400-e29b-41d4-a716-446655440000
status
string
Filter by extraction status.Options: completed, processing, failed, pending
needs_review
boolean
Filter by review requirement.
?needs_review=true   # Only extractions needing review
?needs_review=false  # Only extractions not needing review
is_reviewed
boolean
Filter by review completion status.
?is_reviewed=true    # Only reviewed extractions
?is_reviewed=false   # Not yet reviewed
original_filename
string
Filter by exact filename match.
?original_filename=invoice-2024-001.pdf
created_after
string
Filter by creation timestamp (ISO 8601 format).
?created_after=2024-01-15T00:00:00Z
created_before
string
Filter by creation timestamp (ISO 8601 format).
?created_before=2024-01-31T23:59:59Z
organization_id
string
Filter by organization UUID. Admin-only parameter.

Sorting

sort_by
string
default:"created_at"
Field to sort by.Options: created_at, updated_at, status, original_filename
sort_order
string
default:"desc"
Sort direction.Options: asc (ascending), desc (descending)

Pagination

skip
integer
default:"0"
Number of results to skip. Use for pagination.
?skip=20  # Skip first 20 results
limit
integer
default:"100"
Maximum results to return. Range: 1-100.
?limit=50  # Return max 50 results

Response

items
array
Array of extraction objects matching the query.
total
integer
Total number of extractions matching the filters (before pagination).
skip
integer
Number of results skipped.
limit
integer
Maximum results returned.

Extraction Object

id
string
Unique extraction ID (UUID).
document_id
string
UUID of the source document.
original_filename
string
Name of the uploaded file.
status
string
Processing status: completed, processing, failed, pending.
created_at
string
ISO 8601 timestamp of extraction creation.
updated_at
string
ISO 8601 timestamp of last update.
needs_review
boolean
Whether extraction requires human review.
is_reviewed
boolean
Whether extraction has been reviewed by a human.
reviewed_at
string | null
ISO 8601 timestamp of review completion. null if not reviewed.
reviewed_by
string | null
UUID of user who performed review. null if not reviewed.
results
object
Extracted data matching the schema.
reviewed_results
object | null
Corrected data after human review. null if not reviewed. Use this for automation if is_reviewed = true.
needs_review_metadata
object
Confidence scores and review flags. Only present in Advanced/VLM extractions.

Examples

Poll for Review Completion

Check if a specific document has been reviewed:
curl "https://api.documind.com/api/v1/data/extractions?document_id=550e8400-e29b-41d4-a716-446655440000&limit=1" \
  -H 'X-API-Key: YOUR_API_KEY'
{
  "items": [
    {
      "id": "extr_abc123",
      "document_id": "550e8400-e29b-41d4-a716-446655440000",
      "original_filename": "invoice-2024-001.pdf",
      "status": "completed",
      "created_at": "2024-01-15T10:30:00Z",
      "updated_at": "2024-01-15T10:35:00Z",
      "needs_review": true,
      "is_reviewed": true,
      "reviewed_at": "2024-01-15T10:35:00Z",
      "reviewed_by": "user_xyz789",
      "results": {
        "invoice_number": "INV-2024-001",
        "total_amount": 1250.00
      },
      "reviewed_results": {
        "invoice_number": "INV-2024-001",
        "total_amount": 1275.00
      },
      "needs_review_metadata": {
        "confidence_scores": {
          "invoice_number": 95.2,
          "total_amount": 78.5
        },
        "review_flags": {
          "invoice_number": false,
          "total_amount": true
        }
      }
    }
  ],
  "total": 1,
  "skip": 0,
  "limit": 1
}

List Pending Reviews

Get all extractions waiting for review:
cURL
curl "https://api.documind.com/api/v1/data/extractions?needs_review=true&is_reviewed=false&sort_by=created_at&sort_order=desc&limit=50" \
  -H 'X-API-Key: YOUR_API_KEY'
Python
response = requests.get(
    "https://api.documind.com/api/v1/data/extractions",
    headers={"X-API-Key": API_KEY},
    params={
        "needs_review": True,
        "is_reviewed": False,
        "sort_by": "created_at",
        "sort_order": "desc",
        "limit": 50
    }
)

pending = response.json()
print(f"Pending reviews: {pending['total']}")

for extraction in pending["items"]:
    print(f"- {extraction['original_filename']} ({extraction['created_at']})")

Filter by Date Range

Get extractions from last 24 hours:
Python
from datetime import datetime, timedelta

yesterday = (datetime.utcnow() - timedelta(days=1)).isoformat() + "Z"

response = requests.get(
    "https://api.documind.com/api/v1/data/extractions",
    headers={"X-API-Key": API_KEY},
    params={
        "created_after": yesterday,
        "status": "completed",
        "limit": 100
    }
)

recent = response.json()
print(f"Extractions in last 24h: {recent['total']}")

Pagination Example

Iterate through all extractions:
Python
def get_all_extractions(api_key, filters=None):
    """
    Fetch all extractions matching filters, handling pagination.
    """
    all_extractions = []
    skip = 0
    limit = 100
    
    while True:
        params = {
            "skip": skip,
            "limit": limit,
            **(filters or {})
        }
        
        response = requests.get(
            "https://api.documind.com/api/v1/data/extractions",
            headers={"X-API-Key": api_key},
            params=params
        )
        
        data = response.json()
        all_extractions.extend(data["items"])
        
        # Check if we've fetched everything
        if len(data["items"]) < limit:
            break
        
        skip += limit
    
    return all_extractions

# Usage
filters = {
    "status": "completed",
    "created_after": "2024-01-01T00:00:00Z"
}

all_completed = get_all_extractions(API_KEY, filters)
print(f"Total completed extractions: {len(all_completed)}")

Common Query Patterns

Pattern 1: Polling for Review

# Query by document_id to check specific extraction
params = {
    "document_id": document_id,
    "limit": 1
}

Pattern 2: List All Pending Reviews

# Get extractions waiting for human review
params = {
    "needs_review": True,
    "is_reviewed": False,
    "sort_by": "created_at",
    "sort_order": "asc"  # Oldest first
}

Pattern 3: Get Completed Reviews

# Get extractions reviewed today
from datetime import datetime

params = {
    "is_reviewed": True,
    "created_after": datetime.utcnow().replace(hour=0, minute=0).isoformat() + "Z"
}

Pattern 4: Failed Extractions

# Find failed extractions for retry
params = {
    "status": "failed",
    "created_after": yesterday,
    "sort_by": "created_at",
    "sort_order": "desc"
}

Pattern 5: Organization-Wide Query (Admin)

# Get all extractions for organization
params = {
    "organization_id": "org_uuid",
    "created_after": "2024-01-01T00:00:00Z",
    "limit": 100
}

Response Codes

200 OK

Successful query, returns paginated results.

400 Bad Request

Invalid query parameters:
{
  "detail": "Invalid sort field: invalid_field"
}

403 Forbidden

Insufficient permissions:
{
  "detail": "You don't have permission to access these extractions"
}

500 Internal Server Error

Server-side error:
{
  "detail": "Failed to retrieve extractions. Please try again later."
}

Best Practices

Filter by document_id when possible for fastest queries:
# ✓ Fast: Direct document lookup
params = {"document_id": doc_id, "limit": 1}

# ✗ Slower: Scan all extractions
params = {"limit": 100}  # Then filter in code
Handle large result sets with pagination:
def fetch_page(skip=0, limit=100):
    response = requests.get(url, params={"skip": skip, "limit": limit})
    return response.json()

# Process in batches
skip = 0
while True:
    page = fetch_page(skip=skip)
    process_batch(page["items"])
    
    if len(page["items"]) < limit:
        break
    skip += limit
For dashboard views, cache results briefly:
import time

cache = {}
CACHE_TTL = 30  # seconds

def get_pending_reviews_cached(api_key):
    now = time.time()
    
    if "pending" in cache:
        cached_data, timestamp = cache["pending"]
        if (now - timestamp) < CACHE_TTL:
            return cached_data
    
    # Fetch fresh data
    data = fetch_pending_reviews(api_key)
    cache["pending"] = (data, now)
    return data
Choose limits based on use case:
# Polling: Just need one result
params = {"document_id": doc_id, "limit": 1}

# Dashboard: Show recent items
params = {"sort_by": "created_at", "limit": 20}

# Batch export: Process all
params = {"limit": 100}  # Max per page

Next Steps