What are Reviews?
When using Advanced or VLM extraction modes, Documind automatically analyzes the confidence of extracted data. If required fields fall below your specified threshold, the extraction is flagged for human review.
This creates a human-in-the-loop workflow where:
AI extracts data with confidence scoring
Low-confidence fields are automatically flagged
Human reviewer corrects flagged fields
Automation continues with corrected data
Why Use Reviews?
Accuracy Assurance Catch AI errors before they propagate through your automation pipeline
Cost Optimization Only review documents that need it, not every extraction
Audit Trail Track who reviewed what and when for compliance
Continuous Improvement Reviewed data helps improve future extractions
How Flagging Works
Confidence Calculation
For each extracted field, confidence is calculated as:
confidence = (0.4 × lexical_similarity) + (0.6 × semantic_similarity)
Lexical similarity : How well the extracted text matches across models
Semantic similarity : How similar the meaning is across model outputs
Review Threshold
Set your threshold based on risk tolerance:
extract_config = {
"schema" : { ... },
# Advanced mode - don't set model or extraction_mode
"review_threshold" : 85 # Flag fields below 85% confidence
}
Threshold Use Case Review Rate 90-100 Critical financial data ~30-40% 80-89 Standard business documents ~15-25% 70-79 Non-critical extraction ~5-15% < 70 Not recommended High
Start with 80% threshold and adjust based on your accuracy requirements and review capacity.
Required Fields Only
Only required fields trigger review flags:
{
"named_entities" : {
"invoice_number" : { "type" : "string" },
"optional_notes" : { "type" : "string" }
},
"required" : [ "invoice_number" ] // Only this field can trigger review
}
If invoice_number has low confidence → needs_review = true
If optional_notes has low confidence → No review needed
Response Structure
Without Review
{
"document_id" : "550e8400-e29b-41d4-a716-446655440000" ,
"results" : {
"invoice_number" : "INV-2024-001" ,
"total_amount" : 1250.00
},
"needs_review" : false ,
"needs_review_metadata" : {}
}
✓ All required fields above threshold → Use results immediately
With Review
{
"document_id" : "550e8400-e29b-41d4-a716-446655440000" ,
"results" : {
"invoice_number" : "INV-2024-001" ,
"vendor_name" : "Acme Corp" ,
"line_items" : [
{ "description" : "Service A" , "amount" : 500 },
{ "description" : "Service B" , "amount" : 750 }
]
},
"needs_review" : true ,
"needs_review_metadata" : {
"confidence_scores" : {
"invoice_number" : 95.2 ,
"vendor_name" : 88.5 ,
"line_items" : {
"0" : { "description" : 92.1 , "amount" : 95.8 },
"1" : { "description" : 72.3 , "amount" : 89.5 }
}
},
"review_flags" : {
"invoice_number" : false ,
"vendor_name" : false ,
"line_items" : {
"0" : { "description" : false , "amount" : false },
"1" : { "description" : true , "amount" : false }
}
}
}
}
⚠️ One field below threshold → Wait for human review before using data
Handling Reviews in Automation
Decision Flow
import requests
def process_extraction ( document_id , schema , api_key ):
# Extract with Advanced mode
response = requests.post(
f "https://api.documind.com/api/v1/extract/ { document_id } " ,
headers = { "X-API-Key" : api_key, "Content-Type" : "application/json" },
json = {
"schema" : schema,
# Advanced mode - don't set model or extraction_mode
"review_threshold" : 85 ,
"prompt" : "Extract data accurately"
}
)
result = response.json()
if result[ "needs_review" ]:
# Option 1: Wait for review (polling) - direct team to UI
print ( "⚠️ Review needed - direct team to: https://app.documind.com/review" )
reviewed_data = wait_for_review(document_id, api_key)
return reviewed_data
else :
# Option 2: Use immediate results
return result[ "results" ]
Three Approaches
Polling (Recommended)
Webhook (Future)
Manual Check
Best for : Automation pipelines, background jobsPoll until is_reviewed = true: import requests
import time
def wait_for_review ( document_id , api_key , timeout = 300 ):
start = time.time()
while (time.time() - start) < timeout:
response = requests.get(
"https://api.documind.com/api/v1/data/extractions" ,
headers = { "X-API-Key" : api_key},
params = { "document_id" : document_id, "limit" : 1 }
)
data = response.json()
if data[ "items" ] and data[ "items" ][ 0 ][ "is_reviewed" ]:
return data[ "items" ][ 0 ][ "reviewed_results" ]
time.sleep( 10 ) # Poll every 10 seconds
raise TimeoutError ( "Review timeout" )
See Polling Pattern for details. Best for : Event-driven architecturesReceive notification when review is complete: # Configure webhook endpoint
webhook_config = {
"url" : "https://your-app.com/webhooks/review-complete" ,
"events" : [ "review.completed" ]
}
Webhooks are planned for a future release.
Best for : Batch processing, scheduled jobsCheck review status at specific times: import requests
API_KEY = "your_api_key"
# Morning: Extract documents
for doc in daily_documents:
requests.post(
f "https://api.documind.com/api/v1/extract/ { doc.id } " ,
headers = { "X-API-Key" : API_KEY , "Content-Type" : "application/json" },
json = { "schema" : schema, "prompt" : "Extract data" }
)
# Afternoon: Process reviewed documents
response = requests.get(
"https://api.documind.com/api/v1/data/extractions" ,
headers = { "X-API-Key" : API_KEY },
params = {
"is_reviewed" : True ,
"created_after" : today_start
}
)
extractions = response.json()[ "items" ]
for extraction in extractions:
process_data(extraction[ "reviewed_results" ])
Identifying Fields Needing Review
Parse the needs_review_metadata to identify problematic fields:
def find_low_confidence_fields ( metadata , threshold = 85 , path = "" ):
"""Recursively find fields below confidence threshold."""
low_confidence = []
scores = metadata.get( "confidence_scores" , {})
flags = metadata.get( "review_flags" , {})
for field, flag_value in flags.items():
current_path = f " { path } . { field } " if path else field
if isinstance (flag_value, bool ) and flag_value:
confidence = scores.get(field, 0 )
low_confidence.append({
"field" : current_path,
"confidence" : confidence
})
elif isinstance (flag_value, dict ):
# Recurse into nested structure
nested = find_low_confidence_fields(
{ "confidence_scores" : scores.get(field, {}),
"review_flags" : flag_value},
threshold,
current_path
)
low_confidence.extend(nested)
return low_confidence
# Usage
if result[ "needs_review" ]:
flagged = find_low_confidence_fields(result[ "needs_review_metadata" ])
print ( f "Fields needing review: { len (flagged) } " )
for field in flagged:
print ( f " - { field[ 'field' ] } : { field[ 'confidence' ] :.1f} %" )
Best Practices
Set Appropriate Thresholds
Match threshold to business risk: # Financial documents - strict
financial_config = {
"review_threshold" : 90 ,
"required" : [ "amount" , "account_number" , "date" ]
}
# General documents - balanced
general_config = {
"review_threshold" : 80 ,
"required" : [ "document_type" , "reference_id" ]
}
Mark Critical Fields as Required
Only flag fields that truly need verification: {
"named_entities" : {
"invoice_number" : { ... }, // Critical
"total_amount" : { ... }, // Critical
"notes" : { ... } // Not critical
},
"required" : [ "invoice_number" , "total_amount" ]
}
Implement Timeout Handling
Don’t wait indefinitely for reviews: try :
reviewed = wait_for_review(doc_id, api_key, timeout = 600 )
process_data(reviewed)
except TimeoutError :
# Escalate or use original extraction
log_for_manual_processing(doc_id)
Provide Context to Reviewers
Include the original document and extraction prompt: review_request = {
"extraction_id" : extraction_id,
"document_url" : get_document_download_url(doc_id),
"prompt" : extraction_config[ "prompt" ],
"schema" : extraction_config[ "schema" ],
"flagged_fields" : find_low_confidence_fields(metadata)
}
Monitoring Review Metrics
Track these metrics to optimize your review workflow:
def calculate_review_metrics ( extractions ):
total = len (extractions)
needs_review = sum ( 1 for e in extractions if e[ "needs_review" ])
reviewed = sum ( 1 for e in extractions if e[ "is_reviewed" ])
return {
"review_rate" : needs_review / total * 100 ,
"completion_rate" : reviewed / needs_review * 100 if needs_review > 0 else 0 ,
"avg_confidence" : sum (
get_avg_confidence(e[ "needs_review_metadata" ])
for e in extractions
) / total
}
Aim for a 15-25% review rate for most business documents. If higher, consider lowering your threshold or improving your schema descriptions.
Common Scenarios
Scenario 1: All Fields High Confidence
{
"needs_review" : false ,
"needs_review_metadata" : {}
}
Action : Use results immediately in your automation
Scenario 2: Optional Field Low Confidence
{
"needs_review" : false , // No review needed
"needs_review_metadata" : {
"confidence_scores" : { "notes" : 65.2 }, // Low but not required
"review_flags" : { "notes" : false }
}
}
Action : Use results immediately. Optional field doesn’t trigger review.
Scenario 3: Required Field Low Confidence
{
"needs_review" : true ,
"needs_review_metadata" : {
"confidence_scores" : { "invoice_number" : 72.1 },
"review_flags" : { "invoice_number" : true }
}
}
Action : Poll for review completion, then use reviewed_results
Next Steps