Efficient control of cloud data lakes is important for improving big data dealing in cloud computing settings. Cloud data lakes serve as stores for vast amounts of organized, semi-structured, and unstructured data from varied sources, offering scaling and freedom in data storage and processing. However, the sheer amount and range of data offer difficulties in improving storage, access, and analysis processes. This study suggests a heuristic method to handle these issues by improving the order and recovery of data within cloud data lakes. The heuristic method focuses on automatically handling data splitting, storage, and caching strategies based on data access trends and task factors. By leveraging heuristic principles, such as adaptable splitting and predictive caching, the suggested method aims to reduce delay and improve total query speed. Key components of the heuristic management include adaptable splitting methods that actively change data partition sizes based on access frequency and data distribution patterns. This flexible method ensures that frequently accessed data is easily available, while less accessed data is efficiently saved to improve storage usage. Furthermore, predictive storing methods expect future data access patterns using machine learning models or previous data trends. By strategically saving relevant datasets in memory or high-speed storage, the system reduces delay for repeated questions, thereby improving real-time data processing capabilities. The success of the suggested heuristic management method is tested through models and benchmarks using real big data tasks. Comparative research against standard static splitting and caching methods shows significant changes in query response times and resource utilization efficiency.
Read full abstract