Nowadays, social networks have taken on an irreplaceable role as sources of information. Millions of people use them daily to find out about the issues of the moment. This success has meant that the amount of content present in social networks is unmanageable and, in many cases, fake or non-credible. Therefore, a correct pre-processing of the data is necessary if we want to obtain knowledge and value from these data sets. In this paper, we propose a new data pre-processing technique based on Big Data that seeks to solve two of the key concepts of the Big Data paradigm, data validity and credibility of the data and volume. The system is a Spark-based filter that allows us to flexibly select credible users related to a given topic under analysis, reducing the volume of data and keeping only valid data for the problem under study. The proposed system uses the power of word embeddings in conjunction with other text mining and natural language processing techniques. The system has been validated using three real-world use cases.