Introduction: Echocardiogram is one of the most performed cardiac study usually stored in heterogenous non-standardized format and not readily available for analysis, resulting in barriers to improve phenotyping use in prospective electronic health medical record (EMR) based cardiovascular research initiatives. Methods: We extracted echocardiograms imaging report data for 73,986 patients in Houston Methodist Hospital between 1 st January 2016 to 31 st March 2021 to be merged with the ongoing prospective Houston Methodist CVD Learning Health System registry. Using Structured Query Language (SQL), we implemented Extraction, Transformation and Load (ETL) data pipeline to extract and label data for 7 commonly used echocardiogram parameters: diastolic function (DF), ejection fraction (EF), severities for mitral regurgitation (MR), mitral stenosis (MS), aortic regurgitation (AR), aortic stenosis (AS) and classification per gradient characteristics. Results: Successful ETL operations transformed & loaded data for identified echocardiogram parameters into Data warehouse. A total of 37 labels for MR, MS, AR, AS, DF, AS classification along with distribution for EF were generated. Using random sample draw of 2% (n=1,352) of comparing classification results of program to study files, the SQL program was debugged iteratively 7 times for data quality validation with accuracy of 100% for manual validation. For a total of 73,986 latest echo reports, 3 machine learning models i.e. support vector machines, XGBoost and random forest classifier, were trained on 67% data to test accuracy of data labels over 33% remaining data. All models generated 100% accuracy for of data labelling using rule-based SQL. Conclusion: A well-built locally developed ETL data pipeline removes barriers to semantically annotate, integrate and accurately curate large scale untapped echocardiographic data with significant potential to scale and support clinical and research registries.