Fashion Meets Computer Vision
Fashion is the way we present ourselves to the world and has become one of the world’s largest industries. Fashion, mainly conveyed by vision, has thus attracted much attention from computer vision researchers in recent years. Given the rapid development, this article provides a comprehensive survey of more than 200 major fashion-related works covering four main aspects for enabling intelligent fashion: (1) Fashion detection includes landmark detection, fashion parsing, and item retrieval; (2) Fashion analysis contains attribute recognition, style learning, and popularity prediction; (3) Fashion synthesis involves style transfer, pose transformation, and physical simulation; and (4) Fashion recommendation comprises fashion compatibility, outfit matching, and hairstyle suggestion. For each task, the benchmark datasets and the evaluation protocols are summarized. Furthermore, we highlight promising directions for future research.
- Research Article
1
- 10.1162/neco_a_01677
- Jul 19, 2024
- Neural computation
In computer vision research, convolutional neural networks (CNNs) have demonstrated remarkable capabilities at extracting patterns from raw pixel data, achieving state-of-the-art recognition accuracy. However, they significantly differ from human visual perception, prioritizing pixel-level correlations and statistical patterns, often overlooking object semantics. To explore this difference, we propose an approach that isolates core visual features crucial for human perception and object recognition: color, texture, and shape. In experiments on three benchmarks-Fruits 360, CIFAR-10, and Fashion MNIST-each visual feature is individually input into a neural network. Results reveal data set-dependent variations in classification accuracy, highlighting that deep learning models tend to learn pixel-level correlations instead of fundamental visual features. To validate this observation, we used various combinations of concatenated visual features as input for a neural network on the CIFAR-10 data set. CNNs excel at learning statistical patterns in images, achieving exceptional performance when training and test data share similar distributions. To substantiate this point, we trained a CNN on CIFAR-10 data set and evaluated its performance on the "dog" class from CIFAR-10 and on an equivalent number of examples from the Stanford Dogs data set. The CNN poor performance on Stanford Dogs images underlines the disparity between deep learning and human visual perception, highlighting the need for models that learn object semantics. Specialized benchmark data sets with controlled variations hold promise for aligning learned representations with human cognition in computer vision research.
- Conference Article
28
- 10.1109/iciset.2016.7856498
- Oct 1, 2016
Popularity prediction of online news aims to predict the future popularity of news article prior to its publication estimating the number of shares, likes, and comments. Yet, popularity prediction is a challenging task due to various issues including difficulty to measure the quality of content and relevance of content to users; prediction difficulty of complex online interactions and information cascades; inaccessibility of context outside the web; local and geographic conditions; social network properties. This paper focuses on popularity prediction of online news by predicting whether users share an article or not, and how many users share the news adopting before publication approach. This paper proposes the gradient boosting machine for popularity prediction using features that are known before publication of articles. The proposed model shows around 1.8% improvement over previously applied techniques on a benchmark dataset. This model also indicates that features extracted from articles keywords, publication day, and the data channel are highly influential for popularity prediction.
- Research Article
11
- 10.1142/s0219691323500662
- Feb 9, 2024
- International Journal of Wavelets, Multiresolution and Information Processing
Spatio-temporal action detection (STAD) aims to classify the actions present in a video and localize them in space and time. It has become a particularly active area of research in computer vision because of its explosively emerging real-world applications, such as autonomous driving, visual surveillance and entertainment. Many efforts have been devoted in recent years to build a robust and effective framework for STAD. This paper provides a comprehensive review of the state-of-the-art deep learning-based methods for STAD. First, a taxonomy is developed to organize these methods. Next, the linking algorithms, which aim to associate the frame- or clip-level detection results together to form action tubes, are reviewed. Then, the commonly used benchmark datasets and evaluation metrics are introduced, and the performance of state-of-the-art models is compared. At last, this paper is concluded, and a set of potential research directions of STAD are discussed.
- Research Article
45
- 10.1007/s11554-023-01286-8
- Mar 6, 2023
- Journal of Real-Time Image Processing
Automatic crowd counting using density estimation has gained significant attention in computer vision research. As a result, a large number of crowd counting and density estimation models using convolution neural networks (CNN) have been published in the last few years. These models have achieved good accuracy over benchmark datasets. However, attempts to improve the accuracy often lead to higher complexity in these models. In real-time video surveillance applications using drones with limited computing resources, deep models incur intolerable higher inference delay. In this paper, we propose (i) a Lightweight Crowd Density estimation model (LCDnet) for real-time video surveillance, and (ii) an improved training method using curriculum learning (CL). LCDnet is trained using CL and evaluated over two benchmark datasets i.e., DroneRGBT and CARPK. Results are compared with existing crowd models. Our evaluation shows that the LCDnet achieves a reasonably good accuracy while significantly reducing the inference time and memory requirement and thus can be deployed over edge devices with very limited computing resources.
- Research Article
28
- 10.1007/s11042-022-12790-7
- Mar 26, 2022
- Multimedia Tools and Applications
Emotion recognition from face images is a challenging task that gained interest in recent years for its applications to business intelligence and social robotics. Researchers in computer vision and affective computing focused on optimizing the classification error on benchmark data sets, which do not extensively cover possible variations that face images may undergo in real environments. Following on investigations carried out in the field of object recognition, we evaluated the robustness of existing methods for emotion recognition when their input is subjected to corruptions caused by factors present in real-world scenarios. We constructed two data sets on top of the RAF-DB test set, named RAF-DB-C and RAF-DB-P, that contain images modified with 18 types of corruption and 10 of perturbation. We benchmarked existing networks (VGG, DenseNet, SENet and Xception) trained on the original images of RAF-DB and compared them with ARM, the current state-of-the-art method on the RAF-DB test set. We carried out an extensive study on the effects that modifications to the training data or network architecture have on the classification of corrupted and perturbed data. We observed a drop of recognition performance of ARM, with the classification error raising up to 200% of that achieved on the original RAF-DB test set. We demonstrate that the use of the AutoAugment data augmentation and an anti-aliasing filter within down-sampling layers provide existing networks with increased robustness to out-of-distribution variations, substantially reducing the error on corrupted inputs and outperforming ARM. We provide insights about the resilience of existing emotion recognition methods and an estimation of their performance in real scenarios. The processing time required by the modifications we investigated (35 ms in the worst case) supports their suitability for application in real-world scenarios. The RAF-DB-C and RAF-DB-P test sets, trained models and evaluation framework are available at https://github.com/MiviaLab/emotion-robustness.
- Conference Article
2
- 10.1145/3355166.3355172
- Aug 2, 2019
Rapid development has been achieved since the emergence of MOOC in 2008, but there are still many defects in the popularization of MOOC. Developing blended teaching by utilizing is considered to be one of effective means to overcome these shortcomings. The existing studies on MOOC blended teaching focus on how to improve the teaching effect, but lack of in-depth studies on MOOC pedagogical tools and learners' learning styles in blended teaching. Through a case study of MOOC blended teaching, this study aims to gain an in-depth understanding of the learners' perception of MOOC pedagogical tools, and thoroughly understand the choice of learners' learning styles in this diverse learning environment. The results show that learning styles of learners can be divided into five categories in the MOOC blended teaching, and have different correlations with the perception of MOOC learning tools and curriculum satisfaction, but have no correlation with curriculum achievement. This indicates that learners will choose different learning styles according to their own characteristics and preferences, and will not affect their academic performance. The design of existing MOOC pedagogical tools cannot meet the needs of all learners. Although MOOC blended teaching can make up for some deficiencies in this area, further research is still needed to perfect it.
- Research Article
19
- 10.14288/1.0051790
- Mar 1, 1978
- PubMed
This thesis is concerned with aspects of a theory of machine perception. It is shown that a comprehensive theory is emerging from research in computer vision, natural language understanding, cognitive psychology, and Artificial Intelligence programming language technology. A number of aspects of machine perception are characterized. Perception is a recognition process which composes new descriptions of sensory experience in terms of stored steriotypical knowledge of the world. Perception requires both a schema-based formalism for the representation of knowledge and a model of the processes necessary for performing search and deduction on that representation. As an approach towards the development of a theory of machine perception, a computational model of recognition is presented. The similarity of the model to formal mechanisms in parsing theory is discussed. The recognition model integrates top-down, hypothesis-driven search with bottom-up, data-driven search in hierarchical schemata representations. Heuristic procedural methods are associated with particular schemata as models to guide their recognition. Multiple methods may be applied concurrently in both top-down and botton-up search modes. The implementation of the recognition model as an Artificial Intelligence programming language called MAYA is described. MAYA is a multiprocessing dialect of LISP that provides data structures for representing schemata networks and control structures for integrating top-down and bottom-up processing. A characteristic example from scene analysis, written in MAYA, is presented to illustrate the operation of the model and the utility of the programming language. A programming reference manual for MAYA is included. Finally, applications for both the recognition model and MAYA are discussed and some promising directions for future research proposed.
- Conference Article
3
- 10.1109/issnip.2014.6827622
- Apr 1, 2014
Large variations in human actions lead to major challenges in computer vision research. Several algorithms are designed to solve the challenges. Algorithms that stand apart, help in solving the challenge in addition to performing faster and efficient manner. In this paper, we propose a human cognition inspired projection based learning for person-independent human action recognition in the H.264/AVC compressed domain and demonstrate a PBL-McRBFN based approach to help take the machine learning algorithms to the next level. Here, we use gradient image based feature extraction process where the motion vectors and quantization parameters are extracted and these are studied temporally to form several Group of Pictures (GoP). The GoP is then considered individually for two different bench mark data sets and the results are classified using person independent human action recognition. The functional relationship is studied using Projection Based Learning algorithm of the Meta-cognitive Radial Basis Function Network (PBL-McRBFN) which has a cognitive and meta-cognitive component. The cognitive component is a radial basis function network while the Meta-Cognitive Component(MCC) employs self regulation. The McC emulates human cognition like learning to achieve better performance. Performance of the proposed approach can handle sparse information in compressed video domain and provides more accuracy than other pixel domain counterparts. Performance of the feature extraction process achieved more than 90% accuracy using the PBL-McRBFN which catalyzes the speed of the proposed high speed action recognition algorithm. We have conducted twenty random trials to find the performance in GoP. The results are also compared with other well known classifiers in machine learning literature.
- Research Article
23
- 10.1109/tpami.1999.761260
- Apr 1, 1999
- IEEE Transactions on Pattern Analysis and Machine Intelligence
——————————F—————————— omputer vision emerged as a subfield in computer science and in electrical engineering in the 1960s. Two main motivations for research in computer vision are to develop algorithms to solve vision problems and to understand and model the human visual system. It turns out that finding satisfactory answers to either motivation is significantly harder than common wisdom initially assumed. Research in computer vision has actively continued to the current time. Most of the research in the computer vision and pattern recognition community is focused on developing solutions to vision problems. With three decades of research behind current efforts and with the availability of powerful, inexpensive computers, there is a common belief that computer vision is poised to deliver reliable solutions. The area of empirical evaluation of computer vision algorithms is developing the methods and tools for measuring the ability of algorithms to meet requirements to be fielded, for determining the state-of-the-art, and for pointing out future research directions. The goal of this special theme section of IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) is to highlight progress in empirical evaluation and identify it as a maturing area of computer vision. Out of 18 submissions, three were accepted for this special section. In addition, one submission was accepted to appear in a regular issue, and two others are being revised for consideration as regular papers. “Filtering for Supervised Texture Segmentation: A Comparative Study” by T. Randen and J.H. Husoy presents a comparative study of methods for texture classification. The emphasis of the study is filtering methods from signal processing. Most major filtering approaches are evaluated. For reference, a statistical algorithm and a model-based algorithm are also evaluated. The paper presents performance results on a number of mosaic texture images. In a first for PAMI, the raw image files for these images are being made available as part of the electronic version of the paper. (The electronic version of the paper is part of the Computer Society’s digital library, accessible online at www.computer.org.) It is hoped that future papers on texture segmentation will take advantage of this in order to present directly comparable experimental results. “Performance Evaluation and Analysis of Monocular Building Extraction From Aerial Imagery” by J.A. Shufelt evaluates end-to-end performance of four systems on their ability to extract buildings from 83 aerial images of 18 sites. The methodology allows for an examination of traditional assumptions made in designing algorithms that extract buildings from monocular imagery. “Evaluation of Methods for Ridge and Valley Detection” by A.M. Lopez, F. Lumbreras, and J. Serrat evaluates ridge and valley detectors. The authors discuss what are desirable properties of ridge and valley detectors and the methods for measuring desirable properties. Then they present an evaluation using these methods. We hope the papers in this special section are interesting and present challenges for future researchers.
- Research Article
- 10.65307/pe.v1i2.76
- Jan 31, 2026
- Pustaka Edukasi: Jurnal Pendidikan Indonesia
This study aims to analyze the influence of digital literacy and learning styles on the academic achievement of Gen Z students in Asahan Regency. Rapid developments in digital technology require students to have good digital literacy skills and appropriate learning styles in order to achieve optimal academic performance. This study uses a quantitative approach with a survey method. The research population consisted of all Gen Z students in Asahan Regency, with a simple random sampling technique used. Data were collected through questionnaires that had been tested for validity and reliability. Data analysis was performed using multiple linear regression analysis with the help of statistical programs. The results showed that digital literacy had a positive but insignificant effect on student academic achievement. In addition, learning style also has a positive and significant effect on student academic achievement. Simultaneously, digital literacy and learning style have a significant effect on the academic achievement of Gen Z students in Asahan Regency. This study is expected to be taken into consideration by educational institutions in improving student academic achievement through strengthening digital literacy and developing effective learning styles. Keywords: Digital Literacy, Learning Style, Academic Achievement, Gen Z Students
- Conference Article
2
- 10.1109/icip.2015.7351443
- Sep 1, 2015
Facial Expression Recognition is an active area of research in computer vision with a wide range of applications. Several approaches have been developed to solve this problem for different benchmark datasets. However, Facial Expression Recognition in the wild remains an area where much work is still needed to serve real-world applications. To this end, in this paper we present a novel approach towards facial expression recognition. We fuse rich deep features with domain knowledge through encoding discriminant facial patches. We conduct experiments on two of the most popular benchmark datasets; CK and TFE. Moreover, we present a novel dataset that, unlike its precedents, consists of natural - not acted - expression images. Experimental results show that our approach achieves state-of-the-art results over standard benchmarks and our own dataset
- Research Article
- 10.1145/3773697
- Oct 25, 2025
- ACM Computing Surveys
Salient Object Detection (SOD) focuses on identifying the most noticeable regions in images or videos those that naturally draw human attention. It has become an active area of research in computer vision, with direct applications in tasks such as video summarization, intelligent cropping, image captioning, and visual tracking. Over the past two decades, many efforts have been made to simulate how the human visual system processes and prioritizes visual information. These approaches have evolved from conventional, handcrafted techniques to more recent deep learning-based models. This review aims to provide a clear and structured overview of the progress in deep learning methods for saliency detection. It also summarizes widely used benchmark datasets, evaluation metrics, and key application areas where saliency detection plays an important role.
- Research Article
146
- 10.1109/tip.2018.2845742
- Jun 8, 2018
- IEEE Transactions on Image Processing
While action recognition has become an important line of research in computer vision, the recognition of particular events such as aggressive behaviors, or fights, has been relatively less studied. These tasks may be extremely useful in several video surveillance scenarios such as psychiatric wards, prisons or even in personal camera smartphones. Their potential usability has led to a surge of interest in developing fight or violence detectors. One of the key aspects in this case is efficiency, that is, these methods should be computationally fast. "Handcrafted" spatiotemporal features that account for both motion and appearance information can achieve high accuracy rates, albeit the computational cost of extracting some of those features is still prohibitive for practical applications. The deep learning paradigm has been recently applied for the first time to this task too, in the form of a 3D Convolutional Neural Network that processes the whole video sequence as input. However, results in human perception of other's actions suggest that, in this specific task, motion features are crucial. This means that using the whole video as input may add both redundancy and noise in the learning process. In this work, we propose a hybrid "handcrafted/learned" feature framework which provides better accuracy than the previous feature learning method, with similar computational efficiency. The proposed method is compared to three related benchmark datasets. The method outperforms the different state-of-the-art methods in two of the three considered benchmark datasets.
- Book Chapter
6
- 10.1007/978-3-030-05921-7_29
- Jan 1, 2019
Facial Expression Recognition is one of the most active areas of research in computer vision. However, existing approaches lack generalizability and almost all studies ignore the effects of facial attributes, such as age, on expression recognition even though research indicates that facial expression manifestation varies with ages. Recently, a lot of progress has been made in this topic and great improvements in classification task were achieved with the emergence of Deep Learning methods. Such approaches allow to avoid classical hand designed feature extraction methods that generally rely on manual operations with labelled data. In the present work a deep learning approach that utilizes Convolutional Neural Networks (CNNs) to automatically extract features from facial images is evaluated on a benchmark dataset (FACES), the only one present in literature that contains also labelled facial expressions performed by ageing adults. As baselines, with the aim of making a comparison, two traditional machine learning approaches using handcrafted features are evaluated on the same dataset. Our experiments show that the CNN-based approach is very effective in expression recognition performed by ageing adults, significantly improving the baseline approaches, at least with a 8% margin.
- Research Article
85
- 10.1016/j.jvcir.2016.11.008
- Nov 17, 2016
- Journal of Visual Communication and Image Representation
Integration of wavelet transform, Local Binary Patterns and moments for content-based image retrieval