This paper proposes a framework for industrial and collaborative robot programming based on the integration of hand gestures and poses. The framework allows operators to control the robot via both End-Effector (EE) and joint movements and to transfer compound shapes accurately to the robot. Seventeen hand gestures, which cover the position and orientation controls of the robotic EE and other auxiliary operations, are designed according to cognitive psychology. Gestures are classified by a deep neural network, which is pre-trained for two-hand pose estimation and fine-tuned on a custom dataset, achieving a test accuracy of 99%. The index finger’s pointing direction and the hand’s orientation are extracted via 3D hand pose estimation to indicate the robotic EE’s moving direction and orientation, respectively. The number of stretched fingers is detected via two-hand pose estimation to represent decimal digits for selecting robot joints and inputting numbers. Finally, we integrate these three manners seamlessly to form a programming framework. We conducted two interaction experiments. The reaction time of the proposed hand gestures in indicating randomly given instructions is significantly less than that of other gesture sets, such as American Sign Language (ASL). The accuracy of our method in compound shape reconstruction is much better than that of hand movement trajectory-based methods, and the operating time is comparable with that of teach pendants.