While schemas define what to extract, prompts guide how to extract it. A well-crafted prompt can significantly improve extraction accuracy, especially for edge cases and ambiguous documents.
{ "schema": {...}, "prompt": "Extract invoice information. Use MM/DD/YYYY format for dates. If a field is not found, return null rather than guessing."}
Extract invoice information from this document. The document is a standard business invoice. Focus on the header section for invoice number and date. Use MM/DD/YYYY format for all dates. If any field is unclear or missing, return null.
Documents often have multiple similar values. Guide the extraction:
{ "prompt": "Extract the TOTAL amount (not subtotal, not tax). This is usually the last amount in the document, often labeled 'Total', 'Amount Due', or 'Total Amount'."}
Tell the AI how to handle missing or unusual data:
{ "prompt": "Extract customer information. If customer name is not found, check for 'Bill To' or 'Client' labels. If still not found, return null rather than using vendor name."}
{ "prompt": "Extract all line items from the invoice table. Include ALL rows, even if some fields are empty. Do not skip rows or merge similar items."}
Extract invoice information accurately. Invoice number is typically in the header, labeled 'Invoice #' or 'Invoice Number'.Dates should be in YYYY-MM-DD format.For line items, extract ALL rows from the items table including item description, quantity, unit price, and total.The total amount is the final amount including tax, usually at the bottom right.If a field is not clearly visible, return null.
Extract receipt information from this transaction.Merchant name is usually at the top in large text.Date and time are typically near the top or bottom.Extract all individual items purchased with their prices.The total is the final amount paid, including tax and tip if applicable.Look for payment method near the bottom (cash, card, etc.).
Extract form data carefully, preserving exact field names from the form.For checkboxes, return true if checked, false if unchecked.For dropdowns/selections, return the selected option exactly as written.Leave optional fields null if not filled in.Pay attention to multi-part fields like full name (first/last) or address (street/city/state/zip).
Extract key contract information.Contract number and date are typically in the header.Party names are usually labeled 'Party A', 'Party B', 'Client', 'Vendor', etc.Effective date and end date are in the terms section.Extract key obligations and terms exactly as written.For monetary amounts, include the full amount with any payment schedule details.
"Invoice number is in the top-right corner of the first page.""Line items are in the table in the middle of the page.""Total amount is at the bottom-right, usually in bold."
"Extract state name. Convert to 2-letter abbreviation (e.g., California → CA, New York → NY).""Extract phone number. Remove all formatting, return only digits."
{ "prompt": "This may be an invoice, receipt, or purchase order. First identify the document type, then extract relevant information based on that type. For invoices, focus on invoice number and due date. For receipts, focus on transaction time. For POs, focus on PO number and delivery date."}
{ "prompt": "Document may be in English, Spanish, or French. Extract information regardless of language, but return all values in English. Translate field values where appropriate."}
{ "prompt": "Extract invoice information. Verify that line item totals sum to the subtotal. Invoice number should not contain spaces. Date should be in the past, not future."}
{ "prompt": "This is a medical invoice. 'Procedure codes' are 5-digit numbers. 'Diagnosis codes' start with a letter. Extract both separately. Insurance information is typically in a box on the right side."}
{ "schema": { "type": "object", "named_entities": { "invoice_date": { "type": "string", "description": "Invoice issue date in YYYY-MM-DD format" // Schema describes the field } } }, "prompt": "Extract the invoice date from the header section. It's usually labeled 'Invoice Date' or just 'Date'. Use YYYY-MM-DD format." // Prompt provides extraction hints}
prompts = [ "Extract invoice information", "Extract invoice information. Look for invoice number in the top-right corner.", "Extract invoice information accurately. Invoice number is in the header, typically top-right. Use YYYY-MM-DD for dates."]results = []for prompt in prompts: result = extract(document_id, schema, prompt) results.append(result)# Compare accuracyfor i, result in enumerate(results): print(f"Prompt {i+1}: {result}")
{ "prompt": "Please kindly extract the invoice information from this document. We need you to carefully look at the document and find the invoice number, which might be in the header or somewhere else, and then also find the date, making sure to format it correctly, and also find the vendor name, being careful not to confuse it with the customer name, and also..."}
Problem: Too much information confuses the model. Solution: Be concise and direct.
Extract {document_type} information from this document.Focus on accuracy over speed.For dates, use {date_format} format.For amounts, include only numeric values without currency symbols.If a field is not found or unclear, return null.Verify that extracted values match the document exactly.
Extract {fields} from this {document_type}.Follow these rules strictly:1. {Rule 1}2. {Rule 2}3. {Rule 3}If any rule cannot be followed, return null for that field.Double-check all numeric values for accuracy.
Extract available information from this document.The document may be a {type1}, {type2}, or {type3}.Adapt extraction based on document type.Extract as many fields as possible, leaving missing fields as null.Prioritize accuracy over completeness.