Box Extract automates data extraction from PDF files at scale by enabling select users or groups to create and configure Custom Extract Agents and apply them in Box. Any content that is uploaded will trigger an extraction process that extracts data from those files and applies it as metadata natively in Box. Use it to capture structured data (for example: invoice numbers, dates, and supplier IDs) at scale.
- Files supported: PDF only
- Enablement: Enterprise admin must enable Box AI before using Box Extract
- Agent types: Standard Extract Agent and Enhanced Extract Agent
-
Sources per agent: Up to 10 folders
- Box Extract currently only supports file extraction at the folder root. Subfolders are not processed.
- Agent limit per user: 100
- Box Extract does not currently support sharing Custom Extract Agents with other users. Only users who create their Custom Extract Agents will be able to access and run them at scale.
Confirm Box AI and Box Extract are enabled for your organization. For admin enablement, visit Configuring Box AI
- Metadata templates are configured by Admins or Co-Admins via the Admin Console. See Customizing Metadata Templates to learn more.
Create a Custom Extract Agent
- Navigate to Relay and select the Extract tab within Relay.
- Click New+ and select Custom Extract Agent.
- New Custom Extract Agent with default name (Untitled Extract Agent+date and timestamp) gets created and saved.
- From the configuration page of the Custom Extract Agent, Choose a Metadata Template from a list of available templates. By default, all fields are selected. However, you can select or deselect which fields from the selected metadata template you want to extract data to.
- Click Add selected to continue configuration of Custom Extract Agent.
Configure the agent
- Once you’ve selected which metadata template and fields you want to extract to you can rename the Custom Extract Agent by clicking the ellipses (...) and selecting Rename. When renaming the Custom Extract Agent, the character limit is 255 characters.
-
Select the AI Agent:
- Box AI Standard Extract Agent: Recommended for high-volume, structured or semi-structured documents with 50 pages or less and fewer than 20 extraction fields.
- Box AI Enhanced Extract Agent: Recommended for complex, large, or unstructured documents with 50 pages or more and more than 20 extraction fields for advanced use cases requiring deep insights and precision.
- Enable which metadata fields you want to extract. At least one field must be enabled to save the Custom Extract Agent. Users also have the option to toggle all metadata fields on for extraction.
- Add AI Instructions (recommended) for more accurate and precise extraction results. Attach short, structured prompts per field including details such as where the field is located, expected format, validation rules, edge cases, and more. The current prompt limit is 1,500 characters.
- Select Extraction Policy by choosing whether the extraction process should preserve existing metadata or overwrite it entirely.
| Note: Replacing a metadata template in a Custom Extract Agent will restore all existing settings including AI instructions and prompts. |
Activate and attach source folders
-
Once the Custom Extract Agent has been saved, you can activate or deactivate it, and begin applying it to up to 10 source folders by selecting the ellipses (...) and selecting Add Source Folder.
- You can add folders to inactive Custom Extract Agents.
- Only PDF files added to the source folder root are processed. If needed, you can navigate and select subfolders within the source folder as well.
-
Once the Custom Extract Agent is activated and a source folder is selected, Box Extract will extract data from files that are uploaded to source folders automatically and apply that data as metadata to those files.
- The process will need to be monitored for newly uploaded and created files by active end users awaiting folder access.
- Deactivating the agent stops new extraction processes. Previously extracted data will remain as metadata to the associated files.
Troubleshooting
- If the template has been deleted or is inaccessible, you should select another template or contact your administrator for assistance.
- If enabled template fields have been removed, you should update and save the agent configuration.
- If the source folder or file has been deleted or permission has been lost, you should restore the content or source manually.
- If activation fails due to missing configuration, ensure that at least one field is enabled and that the template is valid.
Best practices
- Use consistent, descriptive naming when creating or modifying a Custom Extract Agent name to simplify large-scale management.
- In the AI instructions, be as descriptive with as much detail as possible related to the document being extracted for more accurate and precise results.
- If you are not receiving the extraction results you want, try adjusting the AI instructions or revising your prompts until you achieve desired results.
- Select the Box AI Standard Extract Agent for consistent, high-volume extraction processes or the Box AI Enhanced Extract Agent for more complex documents requiring higher accuracy.
- Confirm folder permissions and template visibility with Admins or Co-Admins before activating agents.