Graph-structured representations for visual question answering

dc.contributor.authorTeney, D.
dc.contributor.authorLiu, L.
dc.contributor.authorvan den Hengel, A.
dc.contributor.conference30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017) (21 Jul 2017 - 26 Jul 2017 : Honolulu, HI)
dc.date.issued2017
dc.description.abstractThis paper proposes to improve visual question answering (VQA) with structured representations of both scene contents and questions. A key challenge in VQA is to require joint reasoning over the visual and text domains. The predominant CNN/LSTM-based approach to VQA is limited by monolithic vector representations that largely ignore structure in the scene and in the question. CNN feature vectors cannot effectively capture situations as simple as multiple object instances, and LSTMs process questions as series of words, which do not reflect the true complexity of language structure. We instead propose to build graphs over the scene objects and over the question words, and we describe a deep neural network that exploits the structure in these representations. We show that this approach achieves significant improvements over the state-of-the-art, increasing accuracy from 71.2% to 74.4% in accuracy on the abstract scenes multiple-choice benchmark, and from 34.7% to 39.1% in accuracy over pairs of balanced scenes, i.e. images with fine-grained differences and opposite yes/no answers to a same question.
dc.description.statementofresponsibilityDamien Teney, Lingqiao Liu, Anton van den Hengel
dc.identifier.citationProceedings / CVPR, IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2017, vol.2017-January, pp.3233-3241
dc.identifier.doi10.1109/CVPR.2017.344
dc.identifier.isbn9781538604588
dc.identifier.issn1063-6919
dc.identifier.orcidTeney, D. [0000-0003-2130-6650]
dc.identifier.orcidvan den Hengel, A. [0000-0003-3027-8364]
dc.identifier.urihttp://hdl.handle.net/2440/111388
dc.language.isoen
dc.publisherIEEE
dc.publisher.placeOnline
dc.relation.ispartofseriesIEEE Conference on Computer Vision and Pattern Recognition
dc.rights© 2017 IEEE
dc.source.urihttp://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=8097368
dc.titleGraph-structured representations for visual question answering
dc.typeConference paper
pubs.publication-statusPublished

Files