The explosive growth of users and applications in IoT environments has promoted the development of cloud computing. In the cloud computing environment, task scheduling plays a crucial role in optimizing resource utilization and improving overall performance. However, effective task scheduling remains a key challenge. Traditional task scheduling algorithms often rely on static heuristics or manual configuration, limiting their adaptability and efficiency. To overcome these limitations, there is increasing interest in applying reinforcement learning techniques for dynamic and intelligent task scheduling in cloud computing. How can reinforcement learning be applied to task scheduling in cloud computing? What are the benefits of using reinforcement learning-based methods compared to traditional scheduling mechanisms? How does reinforcement learning optimize resource allocation and improve overall efficiency? Addressing these questions, in this paper, we propose a Q-learning-based Multi-Task Scheduling Framework (QMTSF). This framework consists of two stages: First, tasks are dynamically allocated to suitable servers in the cloud environment based on the type of servers. Second, an improved Q-learning algorithm called UCB-based Q-Reinforcement Learning (UQRL) is used on each server to assign tasks to a Virtual Machine (VM). The agent makes intelligent decisions based on past experiences and interactions with the environment. In addition, the agent learns from rewards and punishments to formulate the optimal task allocation strategy and schedule tasks on different VMs. The goal is to minimize the total makespan and average processing time of tasks while ensuring task deadlines. We conducted simulation experiments to evaluate the performance of the proposed mechanism compared to traditional scheduling methods such as Particle Swarm Optimization (PSO), random, and Round-Robin (RR). The experimental results demonstrate that the proposed QMTSF scheduling framework outperforms other scheduling mechanisms in terms of the makespan and average task processing time.