Fog computing promises to enable machine learning tasks to scale to large amounts of data by distributing processing across connected devices. Two key challenges to achieving this goal are (i) heterogeneity in devices’ compute resources and (ii) topology constraints on which devices communicate with each other. We address these challenges by developing a novel network-aware distributed learning methodology where devices optimally share local data processing and send their learnt parameters to a server for periodic aggregation. Unlike traditional federated learning, our method enables devices to offload their data processing tasks to each other, with these decisions optimized to trade off costs associated with data processing, offloading, and discarding. We analytically characterize the optimal data transfer solution under different assumptions on the fog network scenario, showing for example that the value of offloading is approximately linear in the range of computing costs in the network when the cost of discarding is modeled as decreasing linearly in the amount of data processed at each node. Our experiments on real-world data traces from our testbed confirm that our algorithms improve network resource utilization substantially without sacrificing the accuracy of the learned model, for varying distributions of data across devices. We also investigate the effect of network dynamics on model learning and resource costs.