Recent advancements in text-to-image models have been substantial, generating new images based on personalized datasets. However, even within a single category, such as furniture, where the structures vary and the patterns are not uniform, the ability of the generated images to preserve the detailed information of the input images remains unsatisfactory. This study introduces a novel method to enhance the quality of the results produced by text-image models. The method utilizes mask preprocessing with an image pyramid-based salient object detection model, incorporates visual information into input prompts using concept image embeddings and a CNN local feature extractor, and includes a filtering process based on similarity measures. When using this approach, we observed both visual and quantitative improvements in CLIP text alignment and DINO metrics, suggesting that the generated images more closely follow the text prompts and more accurately reflect the input image’s details. The significance of this research lies in addressing one of the prevailing challenges in the field of personalized image generation: enhancing the capability to consistently and accurately represent the detailed characteristics of input images in the output. This method enables more realistic visualizations through textual prompts enhanced with visual information, additional local features, and unnecessary area removal using a SOD mask; it can also be beneficial in fields that prioritize the accuracy of visual data.