Pose estimation methods for robotics should return a distribution of poses rather than just a single pose estimate. Motivated by this, in this work we investigate multi-modal pose representations for reliable 6-DoF object tracking. A neural network architecture for simultaneous object segmentation and estimation of fiducial points of the object on RGB images is proposed. Given a priori probability distribution of object poses a particle filter is employed to estimate the posterior probability distribution of object poses. An advanced observation model relying on matching the projected 3D model with the segmented object and a distance transform-based object representation is used to weight samples representing the probability distribution. Afterwards, the object pose determined by the PnP algorithm is included in the probability distribution via replacing a particle with the smallest weight. Next, a k-means++ algorithm is executed to determine modes in a multi-modal probability distribution. A multi-swarm particle swarm optimization is then executed to determine the finest modes in the probability distribution. A subset of particles for final pose optimization is found in a multi-criteria analysis using the TOPSIS algorithm. They are verified using conflicting criteria that are determined on the basis of object keypoints, segmented object, and the distance transform. On the challenging YCB-Video dataset it outperforms recent algorithms for both object pose estimation and object pose tracking.