Abstract Background The precise assessment of the aortic valve via echocardiography is critical for early detection and management of aortic valve diseases. Until recently, previous studies have examined machine learning models to estimate individual measurements and severity of aortic stenosis (AS) from echocardiographic images. These image processing algorithms, while precise in their narrow focus, fall short in mirroring the holistic and interconnected clinical judgment typical of human echocardiographers in producing a qualitative report. Large language models (LLMs), particularly image-to-text multi-modal LLMs, are a fundamental advance in the field of deep learning with implications for a host of applications in medical imaging. They promise to encapsulate not just discrete data points typical in traditional machine learning, but also the complex contextual interrelations in clinical diagnosis. Methods In this study, a large-scale heterogeneous database of echocardiographic images containing over 90,681 studies with textual descriptors of the aortic valve was used to train a single, image-to-text multimodal LLM called ValveVision AI. The ground truth textual summaries were drafted by level III echocardiographers in a clinical setting between 2015-2020. BLEU and ROUGE score was calculated. The models were retrospectively assessed on a holdout dataset. Reviewing physicians compared the generated summary to the ground truth and binarily agreed or rejected it. Receiver Operating Characteristics (ROC) for distinct pathologies were also assessed (Figure I). Results ValveVision AI performed with a BLEU score of 0.45 and a ROUGE score of 0.49. The performance of the model in reporting on classification of moderate/severe vs none/mild AS in concordance with the validation protocol described above achieved a specificity of 91.98% and a sensitivity of 83.89%, along with more precise qualitative description. Qualitatively, the model exhibited the capability of zero-shot learning in certain instances, however, this result remains an area of exploration. Conclusion This study represents to our knowledge, the first attempt at an image-representation to text-tokenizer deep learning model architecture to mimic the thought and subtlety of echocardiographic qualitative analysis of the aortic valve. The results suggest that this multimodal LLM has sufficient accuracy to create a preliminary textual summary of the aortic valve that, if paired with a point of care ultrasound (POCUS) device in a primary care setting, may facilitate case triage, increase efficiency, and determine a more precise care pathway for patients.