Over the last 2 days we experienced some unplanned API downtime. We’ve identified the root causes, and by next Monday we expect to have a resolution in place. In order to prevent additional downtime between now and Monday, we will need to disable our import process, which means there will be no new or updated data between now and Monday.
There was a series of related issues that caused the downtime. First, we imported a significant amount of new data earlier this week. This resulted in significant growth to our data and the search indexes around our data. This in turn caused one of our disk drives to fill up, which resulted in some corrupted data and brought down one of our servers. When we tried to restart the server, it got stuck in a disk check process enforced by the RAID setup on the drives, which is a very slow process.
In order to prevent this from happening in the future, we are doing the following:
- Replacing the current disk drives with new, larger drives
- Removing the RAID setup on the server, which is unnecessary given our database setup
- Removing some corrupted data in the database and cleaning up the search indexes
These steps should be completed by Monday. In the meantime, the API is available to query and retrieve existing data. We are also exploring some longer-term solutions for even more reliability.
We appreciate our customers’ patience as we address these issues. We are putting in a significant amount of work to improve back-end reliability and performance, throughout our entire stack.