Incident Management Performance Measures, Decentraland Analytics – Complete Report of Performance & Stability Issues

2 min read

On September 9-10, many users experienced performance and stability-related issues in Decentraland. The incident affected the visibility of some avatars, messaging and voice chat functionality, and overall navigation of the platform. As transparency is a core value of the Decentraland Foundation, we wanted to provide an explanation as to why these issues happened and what will be done to try and avoid similar scenarios in the future.

What happened?

The Decentraland platform relies on a set of P2P communication services hosted on the Catalyst network that create links between players to share their avatars’ positions, profile changes, nearby chat messages, and even voice chat. A new version of these communication services was released earlier than planned, resulting in the services becoming unstable and disrupting users’ experience. In response, the technical team reverted the services back to the previous version which resolved the issues users were experiencing.

Why did it happen?

A new version of Decentraland’s communication service (v3) was developed in order to decrease latency time (the amount of time it takes to transfer data) and level up users’ in-world experience, clearing the way for new opportunities to develop data-heavy interactions like multiplayer games or high-fidelity voice chat. If you want to dive deeper into these upcoming changes, check out the architecture decision record: ADR-70.

In preparation for the release of v3, the tech team had already performed a stress test with the new version which resulted in zero issues being reported or identified. Next, the plan was to implement the new version gradually in order to detect unknown issues early on and mitigate them before they could affect the whole platform.

However, an unexpected change in a third-party vendor’s credentials forced the team to have to decide whether to set back the release of the new version by weeks or release early. Due to the previous stress test yielding no issues and the desire to deliver an improved experience to the Decentraland community as soon as possible, the team chose to release early. Unfortunately, it turned out that the new version wasn’t yet prepared to carry the whole load of Decentraland’s communications, and users experienced performance and stability-related issues.

Next steps

We apologize that this decision resulted in the community not being able to interact with friends and interfered with your plans and activities in Decentraland on those days. As we are Decentraland community members as well, we understand the frustration you went through and regret the choice that was made. We strive to strike the right balance between improving Decentraland as quickly as possible with the innovative Web3 technologies being developed everyday and honoring the platform’s stability and scalability in a responsible way. It is a challenge to be sure, but we are always looking for ways to improve our processes.

Some of the project’s contributors are taking a deep look at how the P2P communication networks are orchestrated, and this may require some re-writes of old parts of the code. The rollout of v3 will be planned again and better communicated, looking for the community’s feedback and support to test and detect more scenarios in a controlled environment.

Finally, we would like to remind the community that every time an incident occurs in the Decentraland platform a detailed analysis is performed and documented. If you want to gain more insights about this incident please visit the postmortem on GitHub.

Via this site