Existing augmented reality (AR) assembly mainly provides visual instructions for operators from a first-person perspective, and it is hard to share individual working intents for co-located workers on the shop floor, especially for large-scale product assembly task that requires multiple operators working together. To bridge this gap for practical deployments, this paper proposes Co2iAR, a co-located audio-visual enabled mobile collaborative AR assembly. Firstly, according to the stereo visual-inertial fusion strategy, robust and accurate self-contained motion tracking is achieved for the resource-constrained mobile AR platform, followed by a co-located alignment from multiple mobile AR clients on the shop floor. Then, a lightweight text-aware network for online wiring harness character recognition is proposed, as well as the audio-based confirming strategy, enabling natural audio-visual interaction among co-located workers within a shared immersive workplace, which can also monitor the current wiring assembly status and activate the step-by-step tutorials automatically. The novelty of this work is focused on the deployment of audio-visual aware interaction using the same device that is being used to deploy the co-located collaborative AR work instructions, establishing shared operating intents among multiple co-located workers. Finally, comprehensive experiments are carried out on the collaborative performance among multiple AR clients, and results illustrate that the proposed Co2iAR can alleviate the cognitive load and achieve superior performance for the co-located AR assembly tasks, providing a more human-centric collaborative assembly performance.