We investigated a novel AI-assisted automated 'immersive' audio panning system designed to track audio-related objects within a video clip. This system comprises four sequential steps: Object-Tracking, Stage dimension Estimation, XY-Coordinate Calculation, and Object Audio Rendering. The system is designed to overcome existing challenges arising from the rapid and frequent movement of target objects by employing a pre-trained object-tracking model and integrating depth information to ensure stability in subsequent tasks. Additionally, we introduce a stage size-aware model to extrapolate stage dimension using our manually collected dataset, formatted as (Image, Width, Depth), which facilitates model training. Consequently, the system calculates XY-Coordinate pairs, serving as panning values for conventional audio mixers or decoders to enable immersive audio reproduction. We anticipate that this video- and space-aware automatic panning system will be valuable for the rapid production of new media.