To develop a convolutional neural network (CNN) that can directly estimate material density distribution from multi-energy computed tomography (CT) images without performing conventional material decomposition. The proposed CNN (denoted as Incept-net) followed the general framework of encoder-decoder network, with an assumption that local image information was sufficient for modeling the nonlinear physical process of multi-energy CT. Incept-net was implemented with a customized loss function, including an in-house-designed image-gradient-correlation (IGC) regularizer to improve edge preservation. The network consisted of two types of customized multibranch modules exploiting multiscale feature representation to improve the robustness over local image noise and artifacts. Inserts with various densities of different materials [hydroxyapatite (HA), iodine, a blood-iodine mixture, and fat] were scanned using a research photon-counting detector (PCD) CT with two energy thresholds and multiple radiation dose levels. The network was trained using phantom image patches only, and tested with different-configurations of full field-of-view phantom and in vivo porcine images. Furthermore, the nominal mass densities of insert materials were used as the labels in CNN training, which potentially provided an implicit mass conservation constraint. The Incept-net performance was evaluated in terms of image noise, detail preservation, and quantitative accuracy. Its performance was also compared to common material decomposition algorithms including least-square-based material decomposition (LS-MD), total-variation regularized material decomposition (TV-MD), and U-net-based method. Incept-net improved accuracy of the predicted mass density of basis materials compared with the U-net, TV-MD, and LS-MD: the mean absolute error (MAE) of iodine was 0.66, 1.0, 1.33, and 1.57mgI/cc for Incept-net, U-net, TV-MD, and LS-MD, respectively, across all iodine-present inserts (2.0-24.0mgI/cc). With the LS-MD as the baseline, Incept-net and U-net achieved comparable noise reduction (both around 95%), both higher than TV-MD (85%). The proposed IGC regularizer effectively helped both Incept-net and U-net to reduce image artifact. Incept-net closely conserved the total mass densities (i.e., mass conservation constraint) in porcine images, which heuristically validated the quantitative accuracy of its outputs in anatomical background. In general, Incept-net performance was less dependent on radiation dose levels than the two conventional methods; with approximately 40% less parameters, the Incept-net achieved relatively improved performance than the comparator U-net, indicating that performance gain by Incept-net was not achieved by simply increasing network learning capacity. Incept-net demonstrated superior qualitative image appearance, quantitative accuracy, and lower noise than the conventional methods and less sensitive to dose change. Incept-net generalized and performed well with unseen image structures and different material mass densities. This study provided preliminary evidence that the proposed CNN may be used to improve the material decomposition quality in multi-energy CT.