Wireless sensing represented by WiFi channel state information (CSI) is now enabling various fields of applications such as person identification, human activity recognition, occupancy detection, localization, and crowd estimation these days. So far, those fields are mostly considered as separate topics in WiFi CSI-based methods, on the contrary, some camera and vision-based crowd estimation systems intuitively estimate both crowd size and location at the same time. Our work is inspired by the idea that WiFi CSI also may be able to perform the same as the camera does. In this paper, we construct <i>Wi-CaL</i>, a simultaneous crowd counting and localization system by using ESP32 modules for WiFi links. We extract several features that contribute to dynamic state (moving crowd) and static state (location of the crowd) from the CSI bundles, then assess our system by both conventional machine learning (ML) and deep learning (DL). As a result of ML-based evaluation, we achieved 0.35 median absolute error (MAE) of counting and 91.4% of localization accuracy with five people in a small-sized room, and 0.41 MAE of counting and 98.1% of localization accuracy with 10 people in a medium-sized room, by leave-one-session-out cross-validation. We compared our result with percentage of non-zero elements metric (PEM), which is a state-of-the-art metric for crowd counting, and confirmed that our system shows higher performance (0.41 MAE, 81.8% of within-1-person error) than PEM (0.62 MAE, 66.5% of within-1-person error).