Atomistic modeling of nanostructures often leads to computationally challenging problems involving millions of atoms and tens of thousands of Coulomb matrix elements. In our previous work, we presented a practical solution to this problem, where quasi-linear efficiency, both in time and memory, was obtained by utilizing the fast Fourier transform. Here, we present an updated version of our highly-parallelized computer program, named Coulombo-Lattice, that eliminates the necessity of introducing an auxiliary basis set for the wave-function transfer to the computational grid. Here, we instead exploit the properties of the underlying crystal lattice and run calculations on a regular three-dimensional grid superimposed on the original, lower-symmetry lattice. Due to removal of spurious interactions from other supercells, the resulting Coulomb matrix elements are, up to numerical precision, identical to those obtained by the direct summation O(N2) method, yet our code maintains O(NlogN) scaling. We illustrate our approach by calculations involving up to 1.7 million integrals, and number of atoms reaching up to 2.8 million, for the problem of dopant charging energy for a single phosphorus dopant embedded in a silicon lattice. Next, to emphasize the broad applicability of our code, we show the results for mixed zinc-blend/wurtzite lattice systems, also known as crystal phase quantum dots. Program summaryProgram Title: Coulombo-LatticeCPC Library link to program files:https://doi.org/10.17632/jwkvh5ycbf.1Licensing provisions: CC BY 4.0Programming language: C++Nature of problem: Computing the Coulomb matrix elements (Coulomb and exchange integrals), while being a demanding computational task, is a necessary step in a range of quantum mechanical calculations. For example, in the field of nanostructures, such as quantum dots and nanowires or novel single dopant devices, this stage of calculation (even after a series of approximations) is at least an O(N2) operation (a summation over all pairs of N atoms or grid points in the analyzed system). Moreover, calculating the full Coulomb matrix usually requires the computation of thousands of such elements, thus presenting a formidable computational challenge.Solution method: We provide a ready-to-use implementation for calculating Coulomb matrix elements for a given set of input wavefunctions in LCAO (linear combination of atomic orbitals) form resulting from tight-binding calculation. This implementation is based on the approach introduced in [1,2], by using a fast Fourier transform. In this work, we further significantly improved the method by eliminating the need to introduce wave-function transfer via auxiliary basis set.Additional comments including Restrictions and Unusual features: The implementation is fully parallelized in a distributed-memory model, using MPI and parallel routines from FFTW [3].