Ability to query ALL entities using enterprise-wide search
CompletedIn our application, we are trying to use enterprise-wide search to find and crawl all entities stored under all the Enterprise accounts. We are aware of the ability to traverse all users' folders but it doesn't seem effective enough for a large number of objects.
However, using Search API considers providing either “query” and/or “mdfilters” parameter(s).
Is there a way to query Box API to get ALL entities, without any specific filter? We tried to use a query like ”A OR NOT A” but it doesn’t work for some reason.
Thank you.
-
Official comment
Hello,
You have to provide some sort of search parameter or filter. Why are you looking to crawl through the entire enterprise? If I can understand a bit more about the use case, I may be able to help a little further.
Thanks,
Alex, Box Developer Advocate
Comment actions -
Hello Alex,
Thanks for your reply.
We are building an enterprise index intended to allow users to search objects (files, folders, web links in Box) across different repositories including but not limited to Box. To put information into the index our crawler application should scrape all objects. The best option would be to query ALL objects sorted by update timestamp and go over the result set.
Thanks and looking forward to hearing from you.
-
Ah! I see. This is an interesting use case.
This could be potentially problematic... because how would the solution know that new content exists? I wouldn't think iterating through the entire enterprise content everytime is the most efficient use of resources. Unfortunately, I'm not sure there would be another way to index all the content. The search API does have a created/modified date parameter but you would still need something to search by - so if there isn't anything, you would need to traverse the folder trees.
Does one user own all content in your enterprise? Or does every user own their own content?
Alex, Box Developer Advocate
-
There are 2 phases considered in this process - initial indexing of the existing data and following indexing of changes.
For the first phase, aside from the fact that we can't pass empty or wildcard queries to get "everything", it would be enough to use the following parameters to sort the result set and define a window:
- https://developer.box.com/reference/get-search/#param-updated_at_range - from the latest crawl point to 'now'
- https://developer.box.com/reference/get-search/#param-sort - set it to 'modified_at'
- https://developer.box.com/reference/get-search/#param-direction - set 'asc'
- https://developer.box.com/reference/get-search/#param-limit set it to configured page size
For the second phase, we were planning to switch over processing enterprise event feed to reflect all the changes: https://developer.box.com/guides/events/for-enterprise/
As Box partners, we build software for different customers, so we must consider the most common use case. Unfortunately, it means each user owns the content. And the number of users might be quite high. That's why I would avoid traversing user folders.
-
I reach out to some internal folks to get some more insight/recommendations.
Unfortunately, I don't really have great news. The search api is not going to be useful here due to there not being a way to search by only date...
The only way to do it would be to crawl through every users owned objects building your index as you go... followed by using the events stream to add/remove from the index as time goes on.
We just don't have an endpoint or easy way to get the information you are wanting all at once today.
Alex
-
Alex,
Thank you for your attention to our challenge. We came up with the idea of performing a search request with query = "NOT <VERY_UNIQUE_VALUE>" where <VERY_UNIQUE_VALUE> might be UUID or any other string which can not appear in real user objects with a super high probability. It returns a lot of objects so we hope this is the solution we were looking for.
Thank you.
-- Oleg.
Please sign in to leave a comment.
Comments
6 comments