Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/129742
Citations
Scopus Web of Science® Altmetric
?
?
Full metadata record
DC FieldValueLanguage
dc.contributor.authorQuiroz, J.C.-
dc.contributor.authorLaranjo, L.-
dc.contributor.authorTufanaru, C.-
dc.contributor.authorKocaballi, A.B.-
dc.contributor.authorRezazadegan, D.-
dc.contributor.authorBerkovsky, S.-
dc.contributor.authorCoiera, E.-
dc.date.issued2020-
dc.identifier.citationInternational Journal of Medical Informatics, 2020; 145:104324-1-104324-9-
dc.identifier.issn1386-5056-
dc.identifier.issn1872-8243-
dc.identifier.urihttp://hdl.handle.net/2440/129742-
dc.description.abstractBACKGROUND: Bayesian modelling and statistical text analysis rely on informed probability priors to encourage good solutions. OBJECTIVE: This paper empirically analyses whether text in medical discharge reports follow Zipf's law, a commonly assumed statistical property of language where word frequency follows a discrete power-law distribution. METHOD: We examined 20,000 medical discharge reports from the MIMIC-III dataset. Methods included splitting the discharge reports into tokens, counting token frequency, fitting power-law distributions to the data, and testing whether alternative distributions-lognormal, exponential, stretched exponential, and truncated power-law-provided superior fits to the data. RESULT: Discharge reports are best fit by the truncated power-law and lognormal distributions. Discharge reports appear to be near-Zipfian by having the truncated power-law provide superior fits over a pure power-law. CONCLUSION: Our findings suggest that Bayesian modelling and statistical text analysis of discharge report text would benefit from using truncated power-law and lognormal probability priors and non-parametric models that capture power-law behavior.-
dc.description.statementofresponsibilityJuan C Quiroz, Liliana Laranjo, Catalin Tufanaru, Ahmet BakiKocaballi, Dana Rezazadegan, Shlomo Berkovsky, Enrico Coiera-
dc.language.isoen-
dc.publisherElsevier-
dc.rights© 2020 Elsevier B.V. All rights reserved.-
dc.source.urihttp://dx.doi.org/10.1016/j.ijmedinf.2020.104324-
dc.subjectData mining-
dc.subjectMIMIC-III dataset-
dc.subjectMachine learning-
dc.subjectMaximum likelihood estimation-
dc.subjectPower-law with exponential cut-off-
dc.subjectStatistical distributions-
dc.titleEmpirical analysis of Zipf's law, power law, and lognormal distributions in medical discharge reports-
dc.typeJournal article-
dc.identifier.doi10.1016/j.ijmedinf.2020.104324-
dc.relation.granthttp://purl.org/au-research/grants/nhmrc/1134919-
pubs.publication-statusPublished-
dc.identifier.orcidTufanaru, C. [0000-0002-3457-8770]-
Appears in Collections:Aurora harvest 4
Computer Science publications

Files in This Item:
There are no files associated with this item.


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.