You've put your trust in Box as a valued service provider and partner. You should also trust us to let you know when something is impacting the customer experience. With this in mind, we want to be certain to inform you about what's happening with and within the Box Services, whether planned maintenance or unexpected service degradations and outages.
Responding to Events Impacting the Customer Experience
When our Network Operations Center (NOC) recognizes a potential service issue, an investigation promptly begins in earnest to identify the root cause and resolve the issue. One of the first things we do is identify the estimated or known customer impact. This activity directly supports a quick turnaround to update the Box Status Site.
Our goal is to update the Status Site within 60 minutes from the start of impact, though some event notifications may stretch beyond this goal. To be as efficient as possible, we follow a guideline* to decide what we post and when:
Login/All Files Page
Admin Console: User Management
Admin Console: Automations
* This chart represents a guideline. Our response times and decisions to provide notification on the Status Site may not always exactly follow this matrix.
- Outage = Component/Service is unavailable and no workaround is available.
- Critical = Component/Service is severely degraded with monitoring indicating >50% failure rate or impact to successful throughput.
- Major = Component/Service is severely degraded with monitoring indicating 25-50% failure rate or impact to successful throughput.
- Minor = Component/Service is severely degraded with monitoring indicating <25% failure rate or impact to successful throughput.
As we investigate, we continually review and evaluate the customer impact against the above guidelines. As we discover new information we adjust the impact/severity accordingly.
On the Box Status Site, our objective is to provide regular updates -- at least every 30-60 minutes or at the next significant change in status. These updates include any known impact to the customer experience, including the affected service(s) and approximate times of impact. Where feasible, we share additional information about potential workarounds, remediation progress/actions, and estimated time to recovery.
Stages of an Incident
With a few exceptions, most events on the Box Status Site follow a four-stage incident process:
- Investigating - Most events begin when we receive the initial notice of a disruption to the customer experience. The status may stay as "Investigating" while we determine what led to the current state and can identify an action plan to restore availability/stability of the affected service(s).
- Identified - As soon as a proximal cause is understood, we quickly move to address the problem and ensure that action prevents a reoccurrence.
- Monitoring - When we've completed remediation and an analysis affirms that the affected service(s) are returned to expected levels, we move into this stage. We may extend this stage for certain events to ensure the impact of the remediation is observed across multiple timelines and other criteria.
- Resolved - A designation that the customer experience has returned to expected levels based on the results observed during the monitoring period.
Status of our Service(s) and Subcomponents
In concert with the above incident stages, we also do our best to identify the impact to a specific service or subcomponent. These states include:
- Operational (green dot) - Our services are online and functioning within expected norms. Sometimes we may still be in an active investigation or nearing the final stages and will indicate the service is operational.
- Degraded (orange/yellow dot) - Our services are not performing at the levels we expect. Examples include higher than expected latency loading the Box Web Application or extended intervals to receive new events/changes in our desktop clients (Sync and Drive).
- Outage (red dot) - Our services may be experiencing a full or partial outage, preventing customers from completing their tasks or accessing critical components within Box.
Monitoring the Customer Experience
As a core part of our operations, we continuously measure ourselves against two primary metrics - the availability of our service and a more holistic customer experience measurement. Our Premier customers (learn more here) are most familiar with the former, also known as site uptime. The latter is something to which we hold ourselves accountable to ensure we meet the most critical end-to-end needs of all our customers. Simply said, our tracking extends beyond SLA commitments.
To achieve these results, we monitor and protect the customer experience by using a variety of continuous monitoring and alerting tools. We have multiple checks on each individual Box server, a system of synthetic monitoring* agents, and a collection mechanism that together examine real-time user transactions. These checks give us leading indicators of issues or degradation that may be occurring but have yet to impact our customers. These checks measure the health and performance of the subcomponents on top of which Box runs its services, and enable us to prevent events from becoming incidents. The output from each check also feeds into a time-series database which enables us to see trends over time.
We also collect data that tracks real user transactions. This time-series data continually monitors and collects actual Box users' interactions with all of our services on all of our hosts, and is an indicator of what customers are really seeing when they interact with Box. It gives us information about total number of errors being encountered with our services at any given point in time and is a more accurate measure of how many users are completing their tasks successfully.
*Synthetic monitoring is a form of website monitoring that uses scripted actions in an emulated web browser to imitate key customer journeys such as logins or shared link previews. Our synthetic clients execute a wide variety of checks at one minute intervals from various internal and external points. This enables us to identify whether the Box service is responding appropriately to a given set of inputs, from a variety of geographic locations.
On occasion, we need to execute changes in our data center(s) or to specific services. While we never anticipate adverse user impact or downtime from these activities, we want to be transparent that this is taking place. Throughout the maintenance window and immediately afterward, we closely monitor the status of the Box Services. We share updates on the maintenance as well as any changes in status of our service(s) via the Box Status Site.
Root Cause and Defect Remediation
Immediately following remediation/stabilization of a customer-impacting event, our Engineering team commences a postmortem review. This exhaustive process examines all details, including the full event timeline, validation of root cause, identified defect remediations and owners. An output of this process is posted back to the related Status Site post as soon as it is available.