As an attractive research in biometric authentication, Text Independent Speaker Verification (TI-SV) problem aims to specify whether two given unconstrained utterances come from the same speaker or not. As state-of-the-art solutions, end-to-end approaches using deep neural networks seek to learn a highly discriminative speaker embedding space.In this paper, we propose a novel end-to-end approach for speaker embedding learning by focusing on two crucial factors: speaker embedder architecture and objective function. The proposed module in the speaker embedder is composed of an Efficient Multi-resolution feature Representation (EMR) block followed by a Multi-scale Channel Attention Fusion (MCAF) block. The EMR effectively addresses the issue of fixed resolution convolutional kernels which commonly used in most embedder architectures. Moreover, the MCAF significantly improves the simple summation-based feature fusion used in residual embedder networks. Regarding the objective function, we conduct the speaker embedding space towards learning the embedding-to-embedding relations, in addition to only embedding-to-training class relations employed by most previous methods. So, we propose to employ a dynamic graph attention network, on top of the proposed embedder to learn all informative relations between embeddings, and then learn both embedder and graph-based networks in an end-to-end manner.We conduct various experiments on a large-scale benchmark dataset called VoxCeleb1&2. The effectiveness of all proposed components is verified through an ablation study. We show the superior or competitive performances of the proposed approach compared to seven well-known embedding architectures and 32 SV systems, regarding two evaluation metrics, EER and minDCF, as well as the number of embedder parameters.