Study ObjectivesPoint-of-care ultrasound (POCUS) can detect sonographic features of COVID-19 on lung ultrasound including B-lines, thickened/irregular pleural lines, subpleural consolidations, and effusions. However, there is still a need to standardize classification and severity rating of these diverse findings. The purpose of this study was to develop a severity rating scale for lung ultrasound images collected on patients with COVID-19 disease based on multicenter expert consensus, and to test inter-rater reliability.MethodsDevelopment of the severity rating scale was done with a group of ten POCUS-trained emergency physicians from three academic institutions through review of the literature, expert opinion, pilot testing, and iterative refinement of the tool. The rating scale was refined after 8 one-hour consensus-building discussions based on challenging cases from three smaller-sample rater studies. The final scale consisted of a set of ordinal scores ranging from 0 to 4 for five sonographic findings: B-lines, pleural line abnormalities, consolidations, pleural effusions, and overall lung aeration. Lung POCUS clips from adult patients with COVID-19 were selected from a database of prospectively collected ultrasound exams curated at two academic hospitals. Ultrasounds were acquired from 14-zones (two anterior, two lateral, and three posterior zones on each side of the chest) using a handheld C5-2 curvilinear transducer on a lung preset. Using the refined scale, ten blinded reviewers independently rated selected clips using a Web-based annotation software. We analyzed the ratings to determine inter-rater agreement based on intraclass correlation coefficient (ICC) and linear-weighted Krippendorff’s alpha statistic (α).ResultsWe acquired 11,041 cine clips from 220 patients with lower respiratory tract symptoms suspected to have COVID-19. 62 patients were excluded due to negative COVID-tests, and an additional 40 patients were excluded because the exams were either incomplete or performed with an incorrect preset or different transducer. A research investigator independently completed pre-ratings for the remaining 4,115 clips using the refined scale. We then applied stratified random sampling to select one clip per patient, resulting in a dataset of 118 cine clips with high pathological burden sampled from this group of patients. After severity ratings were completed on the first 30 clips of the dataset, we held a final discussion session with a case-by- case review. For subsequent ratings done on the remaining 88 clips in the dataset, the average ICC was 0.80 across the five sonographic findings (0.85 for B-lines, 0.68 for pleural line abnormalities, 0.79 for consolidations, 0.88 for pleural effusions, and 0.81 for overall lung aeration). A similar trend in rater agreement was seen based on α (average 0.69 across the five sonographic findings). Strong improvements in rater agreement were observed with each successive review session and rater study (Figure 1).ConclusionWe achieved good inter-rater agreement with our lung POCUS severity scoring system established by expert consensus. This severity scale will be used in future studies for training machine learning algorithms and could be utilized clinically for longitudinal assessment of COVID-19 severity.Yes, authors have interests to discloseDisclosureFunding and technical support for this work is provided by the Biomedical Advanced Research and Development Authority (BARDA), under the Assistant Secretary for Preparedness and Response (ASPR), within the U.S. Department of Health and Human Services (HHS), under ongoing USG Contract No. 75A50120C00097. For more information about BARDA, refer to https://www.medicalcountermeasures.gov/.Grant SupportFunding and technical support for this work is provided by the Biomedical Advanced Research and Development Authority (BARDA), under the Assistant Secretary for Preparedness and Response (ASPR), within the U.S. Department of Health and Human Services (HHS), under ongoing USG Contract No. 75A50120C00097. For more information about BARDA, refer to https://www.medicalcountermeasures.gov/.DisclosureAll authors have current research partnerships with Philips Healthcare North America.Grant SupportAll authors have current research partnerships with Philips Healthcare North America.DisclosureAlvin Chen is an employee of Philips Research North AmericaEmployeeAlvin Chen is an employee of Philips Research North America Study ObjectivesPoint-of-care ultrasound (POCUS) can detect sonographic features of COVID-19 on lung ultrasound including B-lines, thickened/irregular pleural lines, subpleural consolidations, and effusions. However, there is still a need to standardize classification and severity rating of these diverse findings. The purpose of this study was to develop a severity rating scale for lung ultrasound images collected on patients with COVID-19 disease based on multicenter expert consensus, and to test inter-rater reliability. Point-of-care ultrasound (POCUS) can detect sonographic features of COVID-19 on lung ultrasound including B-lines, thickened/irregular pleural lines, subpleural consolidations, and effusions. However, there is still a need to standardize classification and severity rating of these diverse findings. The purpose of this study was to develop a severity rating scale for lung ultrasound images collected on patients with COVID-19 disease based on multicenter expert consensus, and to test inter-rater reliability. MethodsDevelopment of the severity rating scale was done with a group of ten POCUS-trained emergency physicians from three academic institutions through review of the literature, expert opinion, pilot testing, and iterative refinement of the tool. The rating scale was refined after 8 one-hour consensus-building discussions based on challenging cases from three smaller-sample rater studies. The final scale consisted of a set of ordinal scores ranging from 0 to 4 for five sonographic findings: B-lines, pleural line abnormalities, consolidations, pleural effusions, and overall lung aeration. Lung POCUS clips from adult patients with COVID-19 were selected from a database of prospectively collected ultrasound exams curated at two academic hospitals. Ultrasounds were acquired from 14-zones (two anterior, two lateral, and three posterior zones on each side of the chest) using a handheld C5-2 curvilinear transducer on a lung preset. Using the refined scale, ten blinded reviewers independently rated selected clips using a Web-based annotation software. We analyzed the ratings to determine inter-rater agreement based on intraclass correlation coefficient (ICC) and linear-weighted Krippendorff’s alpha statistic (α). Development of the severity rating scale was done with a group of ten POCUS-trained emergency physicians from three academic institutions through review of the literature, expert opinion, pilot testing, and iterative refinement of the tool. The rating scale was refined after 8 one-hour consensus-building discussions based on challenging cases from three smaller-sample rater studies. The final scale consisted of a set of ordinal scores ranging from 0 to 4 for five sonographic findings: B-lines, pleural line abnormalities, consolidations, pleural effusions, and overall lung aeration. Lung POCUS clips from adult patients with COVID-19 were selected from a database of prospectively collected ultrasound exams curated at two academic hospitals. Ultrasounds were acquired from 14-zones (two anterior, two lateral, and three posterior zones on each side of the chest) using a handheld C5-2 curvilinear transducer on a lung preset. Using the refined scale, ten blinded reviewers independently rated selected clips using a Web-based annotation software. We analyzed the ratings to determine inter-rater agreement based on intraclass correlation coefficient (ICC) and linear-weighted Krippendorff’s alpha statistic (α). ResultsWe acquired 11,041 cine clips from 220 patients with lower respiratory tract symptoms suspected to have COVID-19. 62 patients were excluded due to negative COVID-tests, and an additional 40 patients were excluded because the exams were either incomplete or performed with an incorrect preset or different transducer. A research investigator independently completed pre-ratings for the remaining 4,115 clips using the refined scale. We then applied stratified random sampling to select one clip per patient, resulting in a dataset of 118 cine clips with high pathological burden sampled from this group of patients. After severity ratings were completed on the first 30 clips of the dataset, we held a final discussion session with a case-by- case review. For subsequent ratings done on the remaining 88 clips in the dataset, the average ICC was 0.80 across the five sonographic findings (0.85 for B-lines, 0.68 for pleural line abnormalities, 0.79 for consolidations, 0.88 for pleural effusions, and 0.81 for overall lung aeration). A similar trend in rater agreement was seen based on α (average 0.69 across the five sonographic findings). Strong improvements in rater agreement were observed with each successive review session and rater study (Figure 1). We acquired 11,041 cine clips from 220 patients with lower respiratory tract symptoms suspected to have COVID-19. 62 patients were excluded due to negative COVID-tests, and an additional 40 patients were excluded because the exams were either incomplete or performed with an incorrect preset or different transducer. A research investigator independently completed pre-ratings for the remaining 4,115 clips using the refined scale. We then applied stratified random sampling to select one clip per patient, resulting in a dataset of 118 cine clips with high pathological burden sampled from this group of patients. After severity ratings were completed on the first 30 clips of the dataset, we held a final discussion session with a case-by- case review. For subsequent ratings done on the remaining 88 clips in the dataset, the average ICC was 0.80 across the five sonographic findings (0.85 for B-lines, 0.68 for pleural line abnormalities, 0.79 for consolidations, 0.88 for pleural effusions, and 0.81 for overall lung aeration). A similar trend in rater agreement was seen based on α (average 0.69 across the five sonographic findings). Strong improvements in rater agreement were observed with each successive review session and rater study (Figure 1). ConclusionWe achieved good inter-rater agreement with our lung POCUS severity scoring system established by expert consensus. This severity scale will be used in future studies for training machine learning algorithms and could be utilized clinically for longitudinal assessment of COVID-19 severity.Yes, authors have interests to disclose We achieved good inter-rater agreement with our lung POCUS severity scoring system established by expert consensus. This severity scale will be used in future studies for training machine learning algorithms and could be utilized clinically for longitudinal assessment of COVID-19 severity.