Avoiding parallel folders created by concurrent uploads via sdk
We use the Box node sdk to upload files whenever a file arrives in one of our S3 buckets. As part of the upload, we first create a desired folder structure per upload to Box by creating any folders that don't exist (according to folder name) in the desired folderpath. I recently discovered that when there are concurrent uploads whose desired folderpaths are similar, there's a race condition and as a consequence, sibling folders with the same name are created.
I'd prefer not to serialize my requests to Box, unless there's some support in the sdk for queuing. I searched these forums for any advice about post-processing to merge such folders, or avoid creating duplicates, but found none.
Can you recommend a best practice?
-
Yes, some names high in our folderpaths refer to entities that persist over
months. We deliver updates about those entities over months via files.
Sometimes the updates occur within a few seconds of each other, leading to
a race condition about whether a folder has been created yet or not.
I believe AWS SQS will solve this by linearizing the concurrent updates.
I'm no longer looking for a solution in the Box sdks. -
My team's been running into the same issue.
In short, when files are uploaded to one of our systems, we publish events to a queue. Items on this queue are consumed by multiple consumers. Each consumer checks whether or not a required folder structure is in place in a predetermined project folder. If the project folder does not contain the desired folder structure, it is created. The consumer then uploads the file into a target child folder.
Our Box folder structure looks something like this:
+-- projects/
+-- Project1/
+-- Project2/
+-- ...etcA consumer creates the following folder structure within a project folder when it detects that it doesn't exist:
+-- projects/
+-- Project1/
+-- Data/
+-- Foo/
+-- Bar/
+-- ...etcWe see duplicate sibling folders being created when the following occurs:
- Multiple files are uploaded to our system around the same time.
- Each uploaded file is related and therefore belongs somewhere within the same project folder.
- The target project folder does not contain the desired folder structure, so each consumer attempts to create the folder structure
When this occurs, we see duplicate sibling folders being created. For example:
+-- projects/
+-- Project1/
+-- Data/
+-- Foo/
+-- Data/
+-- Data/
+-- Bar/We are also seeing a similar issue when the same file is uploaded multiple times in rapid succession. In that case, there are sometimes multiple files with the same name uploaded to the same parent folder.
This behavior appears to be unexpected behavior from the Box API. From what I could find, Box API docs don't mention anything about race conditions, limiting concurrency, or atomic folder/file creation.
Can you provide us with any recommendations? Is there a way to atomically create a folder?
We are investigating how we should refactor our consumers right now. Any tips would be greatly appreciated. Thanks!
-
Since I posted the original question 7 months ago, the only work-around we've found is to linearize our requests into the Box SDK. This is...not great.
Ideally, the SDK's method for uploading a file would allow specifying the full target filepath and would use-or-create every foldername on that path, and provide an option to either overwrite any conflicting file in the leaf folder or update one of the two filenames to avoid collision.
Please sign in to leave a comment.
Comments
4 comments