Box Enterprise Events API missing events

Answered

New post

June 15, 2021 03:43

I have an application that consistently makes requests to the Box Enterprise Events API to get and monitor events for a customer. The requests occur every 10 minutes, and every request extracts the next_stream_position cursor and uses it as the stream_position parameter for the subsequent request. The purpose here being that we can retrieve events in 10 minute intervals.

One thing that I noticed was that the Box Enterprise Events API is unreliable in the way we are using it, as we have seen a number of events missed/not returned by the API. We were able to confirm this by doing the following:

- Look at all of the events our request cycle retrieved that have a timestamp dated yesterday (say Monday)

- Start a secondary API request cycle, using yesterday's date (Monday) as the starting point

- Compare the events retrieved by the secondary cycle that have a timestamp of Monday to the events from the primary cycle

- The results were that the secondary cycle retrieved events (dated for Monday) that the primary cycle did not. So it retrieved MORE events, meaning that some were missed by the primary cycle

The primary difference here is that the secondary request cycle looked for events a day after they occurred, while the primary request cycle is looking for events that have occurred in the previous 10 minutes. But that being said, the Box API uses stream_position, so I am confused how some events would be missed even if there is some latency between when events occur and when they become available in the API. Should the stream_position not guarantee that the events are complete, even if they might come out of order?

Comments

3 comments

Official comment
Chase Lyall

June 15, 2021 20:20
Hi Doug,

The stream_position is a bit of a misnomer when stream_type = admin_logs (i.e. Enterprise Events). Specifically for Enterprise Events, stream_position is a cursor position in chronological event time (not ingestion or processing time), so it does not guarantee at least once delivery for late arriving events to subscribers polling from this API. Events are not dropped, but if they are ingested by Box later than your polling window then they will not be pulled in your request. Therefore, we recommend for Enterprise Events you use the created_after and created_before parameters instead, and then follow one of two patterns: (1) when near real time latency is not necessary, please wait to pull events so late arriving events have time to be ingested and made available. And (2) if near real time latency is necessary, then we recommend you poll in near real time (every 1 to 10 minutes), and additionally poll every hour or 24 hours to catch and refill late arriving events. (2) appears to be what you are doing with your primary and secondary polling cycles. Events can be late either because the user is working offline, user has poor network connection, or because Box is experiencing peak load and a backlog of event ingestion.

We understand that this re-polling strategy to catch late events is painful, so we are currently developing a long term fix that will ensure at least once delivery to downstream subscribers and greatly reduce the likelihood of Box experiencing an ingestion backlog. Look forward to more news about this improvement towards the end of this year.

Best regards,

Chase
Comment actions Permalink
Doug Braam

June 15, 2021 21:21

Edited
Hey Chase

Thank you for the quick and detailed response - I believe you confirmed our assumptions after seeing this behaviour.

Does Box provide any "expected" latency range for late arriving events? With your first pattern suggestion, how long of a delay off of "real-time" would you recommend? Any information on this would be much appreciated.
0

Comment actions Permalink
Chase Lyall

June 15, 2021 23:01
Sorry, but we do not have an SLA for the event API. Our aim under peak load is for ingestion latency to spike no higher than 5 minutes. However, in a few rare incidents that we have since worked to resolve and prevent, we have seen ingestion latency spike as high as 36 hours. In addition, events can be late naturally due to a user working offline. Therefore, we recommend your secondary API request cycle be 24 hours, but you can certainly adjust this based on your needs and tolerances.
0

Comment actions Permalink

Please sign in to leave a comment.