Inferring object-wise human attention in 3D space from the third-person perspective (e.g., a camera) is crucial to many visual tasks and applications, including human-robot collaboration, unmanned vehicle driving, etc. Challenges arise from classical human attention when human eyes are not visible to cameras, gaze point is outside the field of vision, or the gazed object is occluded by others in the 3D space. In this case, blind 3D human attention inference brings a new paradigm to the community. In this paper, we address these challenges by proposing a scene-behavior associated mechanism, in which both 3D scene and temporal behavior of human are adopted to infer object-wise human attention and its transition. Specifically, point cloud is reconstructed and used for the spatial representation of 3D scene, which is beneficial to handle the blind problem from the perspective of a camera. Based on this, in order to address the blind human attention inference without eye information, we propose a Sequential Skeleton Based Attention Network (S2BAN) for behavior-based attention modeling. As is embedded in the scene-behavior associated mechanism, the proposed S2BAN is built under the temporal architecture of Long-Short-Term-Memory (LSTM). Our network employs human skeleton as behavior representation, and maps it to the attention direction frame by frame, which makes attention inference a temporal-correlated issue. With the help of S2BAN, 3D gaze spot and further the attended objects can be obtained frame by frame via intersection and segmentation on the previously reconstructed point cloud. Finally, we conduct experiments from various aspects to verify the object-wise attention localization accuracy, the angular error of attention direction calculation, as well as the subjective results. The experimental results show that the proposed outperforms other competitors.