Human-robot collaboration is a promising solution to relieve construction workers from repetitive and physically demanding tasks, thus improving construction safety and productivity. Many studies have developed various deep learning models for human intention prediction, which will form the basis for proactive and adaptive robot planning and control to enable intelligent human-robot collaboration. However, there remain two challenges. First, most research only focuses on a single type of human intention, without a holistic understanding of multi-level intention, including both high-level intended actions and objects of interest, and low-level body movements. Second, conventional deep learning approaches train a centralized model with aggregated datasets, which requires the sharing of sensitive information (e.g., personal images and behavior data), posing broad privacy concerns in practical implementation. This study proposes a vision-based multi-task federated learning (FL) framework, FedHIP, for multi-level human intention prediction in human-robot collaborative assembly tasks. Specifically, taking body movements and assembly components as inputs, a long short-term memory based multi-task learning model was developed to simultaneously predict multi-level human intention in assembly tasks. FL was employed to train the model in a distributed and privacy-preserving way on local clients without the need of transmitting sensitive data. The results show that the proposal FedHIP without and with pre-train can achieve an accuracy of 80.1% and 85.7% in action prediction, 97.6% and 97.8% in object prediction, and an average displacement error of 12.7 and 11.7 pixels in motion prediction, respectively. Models trained from FedHIP were also compared with those obtained from traditional centralized training and local training. It was found that FL leads to compatible accuracy with centralized training and much higher accuracy than local training while preserving data privacy.