Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/129742
Citations
Scopus Web of Science® Altmetric
?
?
Type: Journal article
Title: Empirical analysis of Zipf's law, power law, and lognormal distributions in medical discharge reports
Author: Quiroz, J.C.
Laranjo, L.
Tufanaru, C.
Kocaballi, A.B.
Rezazadegan, D.
Berkovsky, S.
Coiera, E.
Citation: International Journal of Medical Informatics, 2020; 145:104324-1-104324-9
Publisher: Elsevier
Issue Date: 2020
ISSN: 1386-5056
1872-8243
Statement of
Responsibility: 
Juan C Quiroz, Liliana Laranjo, Catalin Tufanaru, Ahmet BakiKocaballi, Dana Rezazadegan, Shlomo Berkovsky, Enrico Coiera
Abstract: BACKGROUND: Bayesian modelling and statistical text analysis rely on informed probability priors to encourage good solutions. OBJECTIVE: This paper empirically analyses whether text in medical discharge reports follow Zipf's law, a commonly assumed statistical property of language where word frequency follows a discrete power-law distribution. METHOD: We examined 20,000 medical discharge reports from the MIMIC-III dataset. Methods included splitting the discharge reports into tokens, counting token frequency, fitting power-law distributions to the data, and testing whether alternative distributions-lognormal, exponential, stretched exponential, and truncated power-law-provided superior fits to the data. RESULT: Discharge reports are best fit by the truncated power-law and lognormal distributions. Discharge reports appear to be near-Zipfian by having the truncated power-law provide superior fits over a pure power-law. CONCLUSION: Our findings suggest that Bayesian modelling and statistical text analysis of discharge report text would benefit from using truncated power-law and lognormal probability priors and non-parametric models that capture power-law behavior.
Keywords: Data mining
MIMIC-III dataset
Machine learning
Maximum likelihood estimation
Power-law with exponential cut-off
Statistical distributions
Rights: © 2020 Elsevier B.V. All rights reserved.
DOI: 10.1016/j.ijmedinf.2020.104324
Grant ID: http://purl.org/au-research/grants/nhmrc/1134919
Appears in Collections:Aurora harvest 4
Computer Science publications

Files in This Item:
There are no files associated with this item.


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.