Communication limitation at the edge is widely recognized as a major bottleneck for federated learning (FL). Multi-hop wireless networking provides a cost-effective solution to enhance service coverage and spectrum efficiency at the edge, which could facilitate large-scale and efficient machine learning (ML) model aggregation. However, FL over multi-hop wireless networks has rarely been investigated. In this paper, we optimize FL over wireless mesh networks by taking into account the heterogeneity in communication and computing resources at mesh routers and clients. We present a framework that each intermediate router performs <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">in-network</i> model aggregation before sending the data to the next hop, so as to reduce the outgoing data traffic and hence aggregate more models under limited communication resources. To accelerate model training, we formulate our optimization problem by jointly considering model aggregation, routing, and spectrum allocation. Although the problem is a non-convex mixed-integer nonlinear programming, we transform it into a mixed-integer linear programming (MILP), and develop a coarse-grained fixing procedure to solve it efficiently. Simulation results demonstrate the effectiveness of the solution approach, and the superiority of the in-network aggregation scheme over the counterpart without in-network aggregation.