Supporting remote collaboration through augmented reality facilitates the experience of co-presence by presenting the collaborator’s avatar in the user’s physical environment. While visual user representations are continuously researched and advanced, the audio configuration – especially in combination with different visualizations – is rarely considered. In a user study (n = 48, 24 dyads), we evaluate the combination of two visual (Simple vs. Rich Avatar) with two auditory (Mono vs. Spatial Audio) user representations, to investigate their impact on user’s overall experience, performance, and perceived social presence during collaboration. Our results show a preference for rich auditory and visual user representations, as Spatial Audio supports completion of parallel tasks and the Rich Avatar positively influence user experience. However, the Simple Avatar draws less attention, which potentially benefits task efficiency, advocating for simpler visualizations in performance-oriented settings. Our findings contribute to a deeper understanding of how visual and auditory user representation combinations impact remote collaboration in augmented reality.