Recent advances in technology and devices have caused a data explosion on the Internet and on our home PCs. This data is predominantly obtained in various modalities (text, image, video, etc.) and is essential for e-commerce websites. The products on these websites have both images and descriptions in text form, making them multimodal in nature. Earlier categorization and information retrieval methods focused mostly on a single modality. This study employs multimodal data for classification using neutrosophic fuzzy sets for uncertainty management for information retrieval tasks. This effort utilizes image and text data and, inspired by past techniques of embedding text over an image, attempts to classify the images using neutrosophic classification algorithms. For classification tasks, Neutrosophic Convolutional Neural Networks (NCNNs) are used to learn feature representations of the produced images. We demonstrate how a pipeline based on NCNN can be utilized to learn representations of the innovative fusion method. Traditional convolutional neural networks are vulnerable to unknown noisy conditions in the test phase, and as a result, their performance for the classification of noisy data declines. Comparing our method against individual sources on two large-scale multi-modal categorization datasets yielded good results. In addition, we have compared our method to two well-known multi-modal fusion methodologies, namely early fusion and late fusion.