DTC: A Dynamic Transaction Chopping Technique for Geo-Replicated Storage Services

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Replicating data across geo-distributed datacenters is usually necessary for large scale cloud services to achieve high locality, durability and availability. One of the major challenges in such geo-replicated data services lies in consistency maintenance, which usually suffers from long latency due to costly coordination across datacenters. Among others, transaction chopping is an effective and efficient approach to address this challenge. However, existing chopping is conducted statically during programming, which is stubborn and complex for developers. In this article, we propose Dynamic Transaction Chopping (DTC), a novel technique that does transaction chopping and determines piecewise execution in a dynamic and automatic way. DTC mainly consists of two parts: a dynamic chopper to dynamically divide transactions into pieces according to the data partition scheme, and a conflict detection algorithm to check the safety of the dynamic chopping. Compared with existing techniques, DTC has several advantages: transparency to programmers, flexibility in conflict analysis, high degree of piecewise execution, and adaptability to data partition schemes. A prototype of DTC is implemented to verify the correctness of DTC and evaluate its performance. The experiment results show that our DTC technique can achieve much better performance than similar work.

Similar Papers
  • Conference Article
  • 10.1109/srds.2016.026
DTC: A Dynamic Transaction Chopping Technique for Geo-replicated Storage Systems
  • Sep 1, 2016
  • Ning Huang + 2 more

Large Web applications usually require replicating data across geo-distributed datacenters to achieve high locality, durability and availability. However, maintaining strong consistency in geo-replicated systems usually suffers from long latency due to costly coordination across datacenters. Among others, transaction chopping is an effective and efficient approach to cope with such a challenge. In this paper, we propose DTC (Dynamic Transaction Chopping), a novel technique that chops transactions and checks their conflicts in a dynamic and automatic way, during application execution. DTC mainly consists of two parts: a dynamic chopper that chops transaction dynamically according to data partition scheme, and a conflict detection algorithm for determining the safety of the dynamic chopping. Compared with existing transaction chopping technique for geo-replicated systems, DTC has several advantages, including transparency to programmers, flexibility in conflict analysis, high degree of piecewise execution, and adaptability to dynamic partition schemes. We implement our DTC technique and conduct experiments to examine the correctness of DTC and evaluate its performance. The experiment results show that our DTC technique can achieve much more piecewise execution than the existing chopping approach does, and reduce execution time obviously.

  • Research Article
  • Cite Count Icon 6
  • 10.1145/3014431
Cost-Optimized Microblog Distribution over Geo-Distributed Data Centers
  • Apr 20, 2017
  • ACM Transactions on Intelligent Systems and Technology
  • Han Hu + 3 more

The unprecedent growth of microblog services poses significant challenges on network traffic and service latency to the underlay infrastructure (i.e., geo-distributed data centers). Furthermore, the dynamic evolution in microblog status generates a huge workload on data consistence maintenance. In this article, motivated by insights of cross-media analysis-based propagation patterns, we propose a novel cache strategy for microblog service systems to reduce the inter-data center traffic and consistence maintenance cost, while achieving low service latency. Specifically, we first present a microblog classification method, which utilizes the external knowledge from correlated domains, to categorize microblogs. Then we conduct a large-scale measurement on a representative online social network system to study the category-based propagation diversity on region and time scales. These insights illustrate social common habits on creating and consuming microblogs and further motivate our architecture design. Finally, we formulate the content cache problem as a constrained optimization problem. By jointly using the Lyapunov optimization framework and simplex gradient method, we find the optimal online control strategy. Extensive trace-driven experiments further demonstrate that our algorithm reduces the system cost by 24.5% against traditional approaches with the same service latency.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/hpcc.and.euc.2013.21
QoS-Aware Task Placement in Geo-distributed Data Centers with Low OPEX Using Dynamic Frequency Scaling
  • Nov 1, 2013
  • Lin Gu + 2 more

With the rising demands on cloud services, the electricity consumption has been increasing drastically as the main operational expenditure (OPEX) to data center providers. The geographical heterogeneity of electricity prices motivates us to study the task placement problem over geo-distributed data centers. We exploit the dynamic frequency scaling technique and formulate an optimization problem that inimizes OPEX while guaranteeing the quality-of-service, i.e., the expected response time of tasks. The experimental results show that our proposal achieves much higher cost-efficiency than the traditional resizing scheme, i.e., by activating/deactivating certain servers in data centers.

  • Research Article
  • Cite Count Icon 70
  • 10.1109/tc.2014.2349510
Optimal Task Placement with QoS Constraints in Geo-Distributed Data Centers Using DVFS
  • Jul 1, 2015
  • IEEE Transactions on Computers
  • Lin Gu + 4 more

With the rising demands on cloud services, the electricity consumption has been increasing drastically as the main operational expenditure (OPEX) to data center providers. The geographical heterogeneity of electricity prices motivates us to study the task placement problem over geo-distributed data centers. We exploit the dynamic frequency scaling technique and formulate an optimization problem that minimizes OPEX while guaranteeing the quality-of-service, i.e., the expected response time of tasks. Furthermore, an optimal solution is discovered for this formulated problem. The experimental results show that our proposal achieves much higher cost-efficiency than the traditional resizing scheme, i.e., by activating/deactivating certain servers in data centers.

  • Research Article
  • Cite Count Icon 4
  • 10.1109/tcc.2023.3280983
Placement of High Availability Geo-Distributed Data Centers in Emerging Economies
  • Jul 1, 2023
  • IEEE Transactions on Cloud Computing
  • Ruiyun Liu + 2 more

The data center markets in emerging economies are being built at a furious pace. When high availability is required, as it always is in the modern digital economy, the placement of geo-distributed data centers may be influenced by factors such as technician shortage and under-developed infrastructure, both of which are typical in emerging economies. Although the data center availability subject in general has been well studied, it remains unclear how rapid and unbalanced economic development in emerging economies may affect the availability of geo-distributed data centers and their cost of ownership. In this paper, we incorporate the unbalanced availability of infrastructure and technician into the data center placement. The problem is first formulated as a mixed integer nonlinear program (MINLP).To solve this potentially large scale problem, we transform it into a QCQP, capable of handling heterogeneous workloads. The resulting problem can then be efficiently solved by off-the-shelf optimization toolboxes. With real-life data in China, we show how unbalanced development of infrastructure and technician shortage may affect the placement of data centers, and analyze the tradeoff between cost and availability. Our results indicate that technician shortage and unbalanced network infrastructure will lead to increased cost and distinct data center placement strategies.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/infcomw.2017.8116466
Calantha: Content distribution across geo-distributed datacenters
  • May 1, 2017
  • Yangyang Li + 4 more

Large cloud service providers often replicate data to multiple geographically distributed datacenters for availability and service quality purposes. The enormous amount of data needed to be shuffled among datacenters call for efficient schemes to maximally exploit the capacity of the inter-datacenter networks. In this paper, we propose Calantha, a new rate allocation scheme that improves the reliability and operability of content distribution across geo-distributed datacenters, without sacrificing capacity utilization and max-min fairness among competing sessions. Calantha leverages hop-constrained spanning tree to enhance the reliability of inter-datacenter links. A novel approximation algorithm is proposed to solve the rate allocation problem in polynomial time and achieve α-optimal approximation. Our simulation results have shown that we can reduce the number of spanning trees by 44.5%, as well as has 2.7% more average capacity utilization and 1.0% less minimum spanning tree calculations.

  • Research Article
  • Cite Count Icon 3
  • 10.3233/fi-2013-913
Complex Decision Systems and Conflicts Analysis Problem
  • Jan 1, 2013
  • Fundamenta Informaticae
  • Alicja Wakulicz-Deja + 2 more

This paper discusses the issues related to the conflict analysis method and the rough set theory, process of global decision-making on the basis of knowledge which is stored in several local knowledge bases. The value of the rough set theory and conflict analysis applied in practical decision support systems with complex domain knowledge are expressed. The furthermore examples of decision support systems with complex domain knowledge are presented in this article. The paper proposes a new approach to the organizational structure of a multi-agent decision-making system, which operates on the basis of dispersed knowledge. In the presented system, the local knowledge bases will be combined into groups in a dynamic way. We will seek to designate groups of local bases on which the test object is classified to the decision classes in a similar manner. Then, a process of knowledge inconsistencies elimination will be implemented for created groups. Global decisions will be made using one of the methods for analysis of conflicts.

  • Conference Article
  • Cite Count Icon 8
  • 10.3850/9783981537079_0143
Exploiting CPU-Load and Data Correlations in Multi-Objective VM Placement for Geo-Distributed Data Centers
  • Jan 1, 2016
  • Ali Pahlevan + 2 more

Cloud computing has been proposed as a new paradigm to deliver services over the internet. The proliferation of cloud services and increasing users' demands for computing resources have led to the appearance of geo-distributed data centers (DCs). These DCs host heterogeneous applications with changing characteristics, like the CPU-load correlation, that provides significant potential for energy savings when the utilization peaks of two virtual machines (VMs) do not occur at the same time, or the amount of data exchanged between VMs, that directly impacts performance, i.e. response time. This paper presents a two-phase multi-objective VM placement, clustering and allocation algorithm, along with a dynamic migration technique, for geo-distributed DCs coupled with renewable and battery energy sources. It exploits the holistic knowledge of VMs characteristics, CPU-load and data correlations, to tackle the challenges of operational cost optimization and energyperformance trade-off. Experimental results demonstrate that the proposed method provides up to 55% operational cost savings, 15% energy consumption, and 12% performance (response time) improvements when compared to state-of-the-art schemes.

  • Research Article
  • Cite Count Icon 31
  • 10.1109/tnet.2020.3027814
A Low-Cost Multi-Failure Resilient Replication Scheme for High-Data Availability in Cloud Storage
  • Dec 1, 2020
  • IEEE/ACM Transactions on Networking
  • Jinwei Liu + 6 more

Data availability is one of the most important performance factors in cloud storage systems. To enhance data availability, replication is a common approach to handle the machine failures. However, previously proposed replication schemes cannot effectively handle both correlated and non-correlated machine failures, especially while increasing the data availability with limited resources. The schemes for correlated machine failures must create a constant number of replicas for each data object, which often neglects diverse data popularities and does not utilize the resource to maximize the expected data availability. Also, the previous schemes neglect the consistency maintenance cost and the storage cost caused by replication. It is critical for cloud providers to maximize data availability (hence minimize SLA violations) while minimizing costs caused by replication in order to maximize the revenue. In this paper, we build a nonlinear integer programming model to maximize data availability in both types of failures, and therefore minimize the cost caused by replication. Based on the model's solution for the replication degree of each data object, we propose a low-cost multi-failure (correlated and non-correlated machine failures) resilient replication scheme (MRR). MRR can effectively handle both correlated and non-correlated machine failures, considers data popularities to enhance data availability, and also tries to minimize consistency maintenance and storage cost. Extensive numerical results from trace parameters and experiments from real-world Amazon S3 demonstrate that MRR achieves high data availability, low data loss probability and low consistency maintenance and storage costs when compared to previous replication schemes.

  • Conference Article
  • Cite Count Icon 37
  • 10.1109/quatic.2012.17
A Runtime Quality Measurement Framework for Cloud Database Service Systems
  • Sep 1, 2012
  • Markus Klems + 2 more

Cloud database services promise high performance, high availability, and elastic scalability. The system that provides cloud database services must, hence, be designed and managed in a way to achieve these high quality objectives. There are two technology trends that facilitate the design and management of cloud database service systems. First, the development of distributed replicated database software that is optimally designed for highly available and scalable Web applications and offered as open source software. Second, the possibility to deploy the system on cloud computing infrastructure to facilitate availability and scalability via on-demand provisioning of geo-located servers. We argue that a runtime quality measurement and analysis framework is necessary for the successful runtime management of cloud database service systems. Our framework offers three contributions over the state of the art: (i) the analysis of scaling strategies, (ii) the analysis of conflicts between contradictory objectives, and (iii) the analysis of system configuration changes on runtime performance and availability.

  • Conference Article
  • 10.1109/iccons.2017.8250645
Managing the geographically distributed datacenters
  • Jun 1, 2017
  • D.S Amitha Kumari + 1 more

In the field of Big Data analytics the steps taken to collect and analyze the large amount of data to get useful information is the major use of cloud services today. Traditionally the large volume of data was gathered, stored and analyzed in only one data center. As the volume of data has started increasing at a tremendous rate, using the single data centers will lead to improper and less efficient handling of data from the point of performance, where the single data center cannot handle the huge volume of datasets. So, for the better performance and availability, deploying the multiple data centers around the world geographically by the large cloud services providers can resolve this challenge [9]. There are many approaches but widely used approach for the analytics of the geographical distributed data is the centralized approaches. In centralized approach, the data is gathered and stored locally in the main datacenter but this approach may leads to worse performance because this approach may consume a significant amount of bandwidth. To achieve the optimal performance numbers of mechanisms have been used where the data analytics has been performed over the geo-distributed data center.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/icc.2016.7511593
Minimizing cost of provisioning in fault-tolerant distributed data centers with durability constraints
  • May 1, 2016
  • Rakesh Tripathi + 2 more

Many popular e-commerce applications run on geo-distributed data centers requiring high availability. Fault-tolerant distributed data centers are designed by provisioning spare compute capacity to support the load of failed data center, apart from ensuring data durability. The main challenge during the planning phase is how to provision spare capacity such that the total cost of ownership (TCO) is minimized. While the literature handled spare capacity provisioning by minimizing the number of servers, variation in electricity cost and PUE corroborate the need to minimize the operating cost for capacity provisioning. We develop an MILP model for spare capacity provisioning for geo-distributed data centers with durability requirements. We consider spare capacity provisioning problem with the objective of minimizing TCO. We model variation in the demand, fluctuation in electricity prices across locations, cost of state replication, carbon tax across different countries, and delay constraints while formulating the optimization model. Solving the model shows that TCO is reduced while leveraging the electricity price variation and demand multiplexing. The proposed model outperforms the CDN model by 50% and the minimum server model by 34%. Results also demonstrate the effect of power usage effectiveness (PUE), latency, number of data centers and demand on the TCO.

  • Conference Article
  • Cite Count Icon 34
  • 10.1145/3377813.3381353
DeCaf
  • Jun 27, 2020
  • Chetan Bansal + 4 more

Large scale cloud services use Key Performance Indicators (KPIs) for tracking and monitoring performance. They usually have Service Level Objectives (SLOs) baked into the customer agreements which are tied to these KPIs. Dependency failures, code bugs, infrastructure failures, and other problems can cause performance regressions. It is critical to minimize the time and manual effort in diagnosing and triaging such issues to reduce customer impact. Large volume of logs and mixed type of attributes (categorical, continuous) in the logs makes diagnosis of regressions non-trivial. In this paper, we present the design, implementation and experience from building and deploying DeCaf, a system for automated diagnosis and triaging of KPI issues using service logs. It uses machine learning along with pattern mining to help service owners automatically root cause and triage performance issues. We present the learnings and results from case studies on two large scale cloud services in Microsoft where DeCaf successfully diagnosed 10 known and 31 unknown issues. DeCaf also automatically triages the identified issues by leveraging historical data. Our key insights are that for any such diagnosis tool to be effective in practice, it should a) scale to large volumes of service logs and attributes, b) support different types of KPIs and ranking functions, c) be integrated into the DevOps processes.

  • Research Article
  • Cite Count Icon 12
  • 10.1109/tst.2016.7442496
Wide area analytics for geographically distributed datacenters
  • Apr 1, 2016
  • Tsinghua Science and Technology
  • Siqi Ji + 1 more

Big Data analytics, the process of organizing and analyzing data to get useful information, is one of the primary uses of cloud services today. Traditionally, collections of data are stored and processed in a single datacenter. As the volume of data grows at a tremendous rate, it is less efficient for only one datacenter to handle such large volumes of data from a performance point of view. Large cloud service providers are deploying datacenters geographically around the world for better performance and availability. A widely used approach for analytics of geo-distributed data is the centralized approach, which aggregates all the raw data from local datacenters to a central datacenter. However, it has been observed that this approach consumes a significant amount of bandwidth, leading to worse performance. A number of mechanisms have been proposed to achieve optimal performance when data analytics are performed over geo-distributed datacenters. In this paper, we present a survey on the representative mechanisms proposed in the literature for wide area analytics. We discuss basic ideas, present proposed architectures and mechanisms, and discuss several examples to illustrate existing work. We point out the limitations of these mechanisms, give comparisons, and conclude with our thoughts on future research directions.

  • Conference Article
  • Cite Count Icon 36
  • 10.1109/bigdataservice.2016.10
A Practical Approach to Hard Disk Failure Prediction in Cloud Platforms: Big Data Model for Failure Management in Datacenters
  • Mar 1, 2016
  • Sandipan Ganguly + 5 more

Large scale cloud platforms can benefit from a service that runs a machine learning model to predict disk drive failures. Unlike previous studies in this space, we have combined multiple data inputs for the model and obtained a better model performance compared to earlier published models. In this paper we explain how we developed and deployed the predictive model in a large scale cloud service. To build the model, we used a combination of two open data sources - Self-Monitoring, Analysis and Reporting technology (S.M.A.R.T or SMART) data and Windows performance counters. The nature of both these data sources is different and complex. The paper provides unique ways of parsing and transforming the data to make it most suited for a classification problem. Trails with different machine learning (ML) and statistical modeling techniques led us to the best performing two-stage ensemble model. We implemented this model to be configurable such that it could be deployed on large scale distributed cloud management systems and iterated on with minimal code impact. We provide a glimpse of the complex cloud hardware ecosystem and how a predictive model would impact such an ecosystem. Although our study focused on hard disk drives, we believe a similar modeling approach can apply to other hardware components as well. A successfully executed hard disk failure prediction model can pre-empt negative impact to client workloads and improve the economics of running a large scale cloud service. We provide the details of our model as a possible template for future extensions and improvements towards building more robust hardware fault prediction services. Finally we give a staged approach to operationalizing the model in large scale cloud systems.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.