Despite the widespread availability of HDR (High Dynamic Range) display devices, the majority of video sources are still stored in SDR (Standard Dynamic Range) format, leading to an urgent need for UHD (Ultra High Definition) video reconstruction. A simple solution is to divide the task into two sub-tasks: video super-resolution (SR) and video Bit-Depth Enhancement (BDE). While some joint enhancement tasks led by SR have been explored in multiple dimensions, the quantitative dimension has often been overlooked. In this paper, we address the first joint task of BDE and SR and propose a succinct network called MSTG. We conduct a systematic analysis to explain how this effect can be achieved and overcome the core challenge arising from distortions exacerbated by the differing optimization directions of BDE and SR. This challenge is attributed to indistinguishable contours and detailed textures. To tackle this, we employ a dual-branch module that leverages both gradient and image information to discern between BDE and SR. Furthermore, we propose a novel model named Cross-scale Transformer (CTF) embedded with Cross-scale Attention (CSA). This model facilitates information aggregation while adaptively utilizing structural similarity features. It jointly extracts, aligns, and fuses frame features at multiple scales, forming the MST module. Experiments demonstrate that the proposed algorithm outperforms the cascaded video enhancement methods by large margins on both synthetic and realistic datasets.