摘要:
Upgrade is one of the most disruptive yet unavoidable maintenance tasks that undermine the availability of distributed systems. Any failure during an upgrade is catastrophic, as it further extends the service disruption caused by the upgrade. The increasing adoption of continuous deployment further increases the frequency and burden of the upgrade task. In practice, upgrade failures have caused many of today's high-profile cloud outages. Unfortunately, there has been little understanding of their characteristics. This thesis presents an in-depth study of 123 real-world upgrade failures in 8 widely used distributed systems. Our study shows that upgrade failures are more critical and catastrophic than non-upgrade failures, with most of them caught after code release. Guided by our study, we have designed a testing framework DUPTester to expose upgrade failures. Testing 4 distributed systems with DUPTester revealed 20 previously unknown upgrade failures, many of which already fixed or confirmed by developers based on our reports.