Irregular and dynamic applications, such as graph problems and agent-based simulations, often require fine-grained parallelism to achieve good performance. However, current multicore processors only provide architectural support for coarse-grained parallelism, making it necessary to use software-based multithreading environments to effectively implement fine-grained parallelism. Although these software-based environments have demonstrated superior performance over heavyweight, OS-level threads, they are still limited by the significant overhead involved in thread management and synchronization. In order to address this, we propose a Lightweight Chip Multi-Threaded (LCMT) architecture that further exploits thread-level parallelism (TLP) by incorporating direct architectural support for an “unlimited” number of dynamically created lightweight threads with very low thread management and synchronization overhead. The LCMT architecture can be implemented atop a mainstream architecture with minimum extra hardware to leverage existing legacy software environments. We compare the LCMT architecture with a Niagara-like baseline architecture. Our results show up to 1.8X better scalability, 1.91X better performance, and more importantly, 1.74X better performance per watt, using the LCMT architecture for irregular and dynamic benchmarks, when compared to the baseline architecture. The LCMT architecture delivers similar performance to the baseline architecture for regular benchmarks.