Abstract
BackgroundBayesian modelling and statistical text analysis rely on informed probability priors to encourage good solutions. ObjectiveThis paper empirically analyses whether text in medical discharge reports follow Zipf’s law, a commonly assumed statistical property of language where word frequency follows a discrete power-law distribution. MethodWe examined 20,000 medical discharge reports from the MIMIC-III dataset. Methods included splitting the discharge reports into tokens, counting token frequency, fitting power-law distributions to the data, and testing whether alternative distributions—lognormal, exponential, stretched exponential, and truncated power-law—provided superior fits to the data. ResultDischarge reports are best fit by the truncated power-law and lognormal distributions. Discharge reports appear to be near-Zipfian by having the truncated power-law provide superior fits over a pure power-law. ConclusionOur findings suggest that Bayesian modelling and statistical text analysis of discharge report text would benefit from using truncated power-law and lognormal probability priors and non-parametric models that capture power-law behavior.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.