Image captioning and visual question answering based on attributes and external knowledge

Wu, Q.; Shen, C.; Wang, P.; Dick, A.; van den Hengel, A.

doi:10.1109/TPAMI.2017.2708709

Image captioning and visual question answering based on attributes and external knowledge

Files

hdl_115071.pdf (45.68 MB)

(Accepted version)

Date

2018

Authors

Wu, Q.

Shen, C.

Wang, P.

Dick, A.

van den Hengel, A.

Type:

Journal article

Citation

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018; 40(6):1367-1381

Statement of Responsibility

Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton van den Hengel

DOI

10.1109/TPAMI.2017.2708709

Abstract

Much of the recent progress in Vision-to-Language problems has been achieved through a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). This approach does not explicitly represent high-level semantic concepts, but rather seeks to progress directly from image features to text. In this paper we first propose a method of incorporating high-level concepts into the successful CNN-RNN approach, and show that it achieves a significant improvement on the state-of-the-art in both image captioning and visual question answering. We further show that the same mechanism can be used to incorporate external knowledge, which is critically important for answering high level visual questions. Specifically, we design a visual question answering model that combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions. It particularly allows questions to be asked where the image alone does not contain the information required to select the appropriate answer. Our final model achieves the best reported results for both image captioning and visual question answering on several of the major benchmark datasets.

Rights

Grant ID

http://purl.org/au-research/grants/arc/FT120100969

Published Version

https://doi.org/10.1109/tpami.2017.2708709

Persistent link to this record

http://hdl.handle.net/2440/115071

Full item page

Image captioning and visual question answering based on attributes and external knowledge

Files

Date

Authors

Editors

Advisors

Journal Title

Journal ISSN

Volume Title

Type:

Citation

Statement of Responsibility

Conference Name

DOI

Abstract

School/Discipline

Dissertation Note

Provenance

Description

Access Status

Rights

License

Grant ID

Published Version

Call number

Persistent link to this record