Abstract

Scientific computing has advanced in the ways it deals with massive amounts of data, since the production capacities have increased significantly for the last decades. Most large science experiments require vast computing and data storage resources in order to provide results or predictions based on the data obtained. For scientific distributed computing systems with hundreds of petabytes of data and thousands of users it is important to keep track not just of how data is distributed in the system, but also of individual users’ interests in the distributed data (reveal implicit interconnection between user and data objects). This however requires the collection and use of specific statistics such as correlations between data distribution, the mechanics of data distribution, and mainly user preferences. This work focuses on user activities (specifically, data usages) and interests in such a distributed computing system, namely PanDA (Production ANd Distributed Analysis system). PanDA is a high-performance workload management system originally designed to meet production and analysis requirements for a data-driven workload at the Large Hadron Collider Computing Grid for the ATLAS Experiment hosted at CERN (the European Organization for Nuclear Research). In this work we are going to investigate whether data collection that was gathered in the past in PanDA shows any trends indicating that users could have mutual interests that would be kept for the next data usages (i.e., data usage patterns), using data mining techniques such as association analysis, sequential pattern mining, and basics of the recommender system approach. We will show that such common interests between users indeed exist and thus could be used to provide recommendations (in terms of the collaborative filtering) to help users with their data selection process.

Highlights

  • Recommender SystemA recommender system uses a set of machine learning/data mining processes, that aim to guide users in a personalized way to interesting or useful items in a large space of possible options [3]

  • Production and Distributed Analysis system PanDA [1] is a high-performance pilot-based workload management system. This means that workload is assigned based on the feedback from successfully activated and validated pilot jobs, which are lightweight processes that probe the environment and act as “smart wrappers” for the payload

  • In PanDA, an independent subsystem manages the delivery of pilot jobs to all worker nodes via a number of well-known cluster and grid scheduling systems (e.g., Condor-G)

Read more

Summary

Recommender System

A recommender system uses a set of machine learning/data mining processes, that aim to guide users in a personalized way to interesting or useful items in a large space of possible options [3]. Information Retrieval Information Filtering Recommender System assists users to locate data filters out irrelevant items from a user’s information highlight valuable items in a user’s information stream

Sequential Pattern Mining
Recommendation Simulation
User Activities
Findings
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.