We released the ability for developers to include confidence scores via the structured endpoint when using the Box Extract APIs. Confidence scores are numerical values between 0 and 1 that estimate the likelihood that an extracted field value is correct. The API response includes a confidence score in decimal format. For example, a 0.875 value means the agent is 87.5% confident of the extracted value. The API response also includes a confidence level (I.E. Low, Medium, or High) to describe confidence levels in the extracted results. The scores are calibrated to approximate real-world correctness probabilities and are produced by aggregating multiple LLM responses and measuring consistency, enabling automated decisioning and human-in-the-loop workflows based on estimated extraction accuracy.
Confidence scores, confidence levels, and recommended actions should be interpreted based on your risk tolerance, the criticality of the use case, and the degree to which you have tested and validated the extraction results. We recommend validating confidence score thresholds against your specific document types and accuracy requirements. Developers can use confidence scores to flag certain extracted values for human review if they fall below a specific numeric threshold.
Box provides suggested thresholds as a starting point:
- Scores of 0.90 and above indicate high confidence, meaning these extractions typically need minimal review.
- Scores between 0.70 and 0.89 suggest medium confidence, meaning these extractions typically need light review.
- Scores below 0.70 signal low confidence, and in these situations, manual review is recommended.
Confidence scores support the following Google Gemini LLMs:
- gemini-2.5-flash
- gemini-2.5-pro
To learn more about this release, please see Metadata Confidence Level.