This study examines the smoking patterns of youth across various states and union territories of India using the Global Youth Tobacco Survey (GYTS) dataset. The analysis employs three clustering algorithms K-Means, DBSCAN, and Hierarchical Clustering within a federated learning framework, which ensures that sensitive public health data remains decentralized and private. Federated learning enables collaborative analysis across different regions by sharing only model parameters rather than raw data, thus enhancing privacy. Furthermore, the integration of differential privacy ensures additional protection by adding controlled noise to the model parameters, safeguarding individual-level data from exposure during the learning process. The study highlights the varying performances of the clustering algorithms, revealing valuable insights into regional smoking behaviors and the effectiveness of government anti-tobacco campaigns. These insights offer important guidance for public health authorities, allowing for the design and implementation of more targeted and effective campaigns tailored to the needs of specific regions. By leveraging federated learning and differential privacy, this study demonstrates a privacy-preserving approach to analyzing large-scale public health data, providing a blueprint for future health interventions and tobacco control strategies in India and beyond.
Read full abstract