Simple models are all you need: Ensembling stylometric, part-of-speech, and information-theoretic models for the ALTA 2024 Shared Task

Files

hdl_148276.pdf (1.39 MB)
  (Published version)

Date

2024

Authors

Thomas, J.
Hoang, G.B.
Mitchell, L.

Editors

Advisors

Journal Title

Journal ISSN

Volume Title

Type:

Conference paper

Citation

Proceedings of the 22nd Annual Workshop of the Australasian Language Technology Association (ALTA 2024), 2024, pp.207-212

Statement of Responsibility

Joel Thomas, Gia Bao Hoang and Lewis Mitchell

Conference Name

22nd Annual Workshop of the Australasian Language Technology Association (ALTA) (2 Dec 2024 - 4 Dec 2024 : Canberra, Australia)

Abstract

The ALTA 2024 shared task concerned automated detection of AI-generated text. Large language models (LLM) were used to generate hybrid documents, where individual sentences were authored by either humans or a state-ofthe- art LLM. Rather than rely on similarly computationally expensive tools like transformerbased methods, we decided to approach this task using only an ensemble of lightweight “traditional” methods that could be trained on a standard desktop machine. Our approach used models based on word counts, stylometric features, readability metrics, part-of-speech tagging, and an information-theoretic entropy estimator to predict authorship. These models, combined with a simple weighting scheme, performed well on a held-out test set, achieving an accuracy of 0.855 and a kappa score of 0.695. Our results show that relatively simple, interpretable models can perform effectively at tasks like authorship prediction, even on short texts, which is important for democratisation of AI as well as future applications in edge computing.

School/Discipline

Dissertation Note

Provenance

Description

Access Status

Rights

©2024 Association for Computational Linguistics. Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License.

License

Call number

Persistent link to this record