摘要:
Businesses of today produce data items on the order of millions on a daily basis. This is especially true in cloud environments, where much of this data comes in the form of logs and metrics about the performance and status of components in their cloud configurations. Maintaining efficient data storage and retrieval along with growing customer data capacity is very challenging. One reason for this is that newer data tends to be accessed more frequently, while older data needs to be archived for future analysis. Another reason is that maintaining large amounts of data in fast storage disks is very costly. One approach to this problem is a tiered storage system, where new data is allocated to faster storage tiers and older data is pushed to lower tiers with slower retrieval time. This thesis presents a fully online and configurable design and implementation for this in a database management system (DBMS) [1, 2], which has been difficult in the past due to two key constraints: the immutability of its columns and its lack of atomicity for sub-partition level operations. Without atomicity, there are no mechanisms in place that guarantee that a tenant’s data within a partition is moved or deleted completely, which can cause undetermined states that are difficult to identify and resolve. With the immutability of columns, data must be copied and inserted into other tiers, which raises a problem of duplicate data across tiers when a tenant is issuing queries. While these constraints are the exact optimizations that make this particular DBMS so performant for large analytical uses, they are the key features that need to be redesigned in building this system. The proof of concept developed here satisfies all of these requirements with an ingestion rate of 1 TB per day, minimal overhead, and about 70% in projected savings per instance — which could amount to hundreds of thousands of dollars saved per month in large production installations.