Data Quality for Data-Driven Software Vulnerability Analysis
dc.contributor.advisor | Babar, M. Ali | |
dc.contributor.advisor | Jayatilaka, Asangi | |
dc.contributor.author | Croft, Roland Lloyd | |
dc.contributor.school | School of Computer and Mathematical Sciences | en |
dc.date.issued | 2023 | |
dc.description.abstract | Software vulnerabilities enable malicious actors to exploit security weaknesses of a software system, potentially causing enormous damages for organisations. Detection and prevention of software vulnerabilities is vital to achieve software security. However, software vulnerability discovery is a difficult task due to the high expertise and labor requirements. Consequently, many developers rely on tool support for achieving software security. Traditional tools that rely on rule-based techniques fail to make inroads in software security however, due to their high false-positive rates and lack of scalability. Data-driven methods that utilise artificial intelligence have promising capabilities for automated software vulnerability discovery. A properly trained model can effectively learn the underlying patterns of software vulnerabilities and provide classifications efficiently on incoming source code modules. However, such models require sufficiently large and high-quality datasets to learn from. Data preparation for software vulnerability datasets is not a trivial task, due to the scarcity, lacking documentation, and sensitivity of associated software vulnerability data. Consequently, we observe that data preparation challenges are currently ill-considered and overlooked for data-driven software vulnerability analysis, and current datasets are usually of poor quality. These data challenges prevent software vulnerability prediction models from satisfying industrial applications. We have made the following contributions towards improving data quality for datadriven software vulnerability analysis. Firstly, we have benchmarked data-driven software vulnerability analysis approaches in comparison to traditional rule-based tools. This investigation yielded insights into the relative strengths and weaknesses of each approach, but we particularly observed that the promising capabilities of learningbased models were inhibited by their data requirements. Secondly, we provided a systematized view of software vulnerability data preparation practices for software vulnerability prediction. Through a systematic literature review, we uncovered a taxonomy of 16 data preparation challenges which act as obstacles towards achieving practical software vulnerability prediction. Thirdly, we conducted formal assessment of the data quality for state-of-the-art vulnerability datasets using five inherent data quality attributes. This research provided measurement of data insufficiencies and demonstrated their impact to inspire the need for data improvement. Furthermore, we also investigated inconsistency stemming from original vulnerability data sources. Finally, we proposed a technique for training robust vulnerability prediction models that can leverage noisy training datasets to still provide effective predictions. Our proposed method can circumvent some potentially unsolvable issues of software vulnerability datasets. We expect our contributions to help unlock the potential of software vulnerability data by improving dataset quality and use. These efforts in turn enable effective and practical applications for data-driven software vulnerability analysis. | en |
dc.description.dissertation | Thesis (Ph.D.) -- University of Adelaide, School of Computer and Mathematical Sciences, 2023 | en |
dc.identifier.uri | https://hdl.handle.net/2440/139386 | |
dc.language.iso | en | en |
dc.provenance | This electronic version is made publicly available by the University of Adelaide in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. This thesis may incorporate third party material which has been used by the author pursuant to Fair Dealing exceptions. If you are the owner of any included third party copyright material you wish to be removed from this electronic version, please complete the take down form located at: http://www.adelaide.edu.au/legals. | en |
dc.subject | Cybersecurity; Machine Learning; Software Vulnerability; Data Quality | en |
dc.title | Data Quality for Data-Driven Software Vulnerability Analysis | en |
dc.type | Thesis | en |
Files
Original bundle
1 - 1 of 1
No Thumbnail Available
- Name:
- Croft2023_PhD.pdf
- Size:
- 10.52 MB
- Format:
- Adobe Portable Document Format
- Description:
- Thesis