Temporal pyramid pooling based convolutional neural network for action recognition

Wang, P.; Cao, Y.; Shen, C.; Liu, L.; Shen, H.T.

Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/108774

Scopus	Web of Science®	Altmetric
Citations
?	?

Full metadata record

DC Field	Value	Language
dc.contributor.author	Wang, P.	-
dc.contributor.author	Cao, Y.	-
dc.contributor.author	Shen, C.	-
dc.contributor.author	Liu, L.	-
dc.contributor.author	Shen, H.T.	-
dc.date.issued	2016	-
dc.identifier.citation	IEEE Transactions on Circuits and Systems for Video Technology, 2016; 27(99):1-8	-
dc.identifier.issn	1051-8215	-
dc.identifier.issn	1558-2205	-
dc.identifier.uri	http://hdl.handle.net/2440/108774	-
dc.description.abstract	Encouraged by the success of Convolutional Neural Networks (CNNs) in image classification, recently much effort is spent on applying CNNs to video based action recognition problems. One challenge is that video contains a varying number of frames which is incompatible to the standard input format of CNNs. Existing methods handle this issue either by directly sampling a fixed number of frames or bypassing this issue by introducing a 3D convolutional layer which conducts convolution in spatial-temporal domain. In this paper we propose a novel network structure which allows an arbitrary number of frames as the network input. The key of our solution is to introduce a module consisting of an encoding layer and a temporal pyramid pooling layer. The encoding layer maps the activation from previous layers to a feature vector suitable for pooling while the temporal pyramid pooling layer converts multiple frame-level activations into a fixed-length video-level representation. In addition, we adopt a feature concatenation layer which combines appearance information and motion information. Compared with the frame sampling strategy, our method avoids the risk of missing any important frames. Compared with the 3D convolutional method which requires a huge video dataset for network training, our model can be learned on a small target dataset because we can leverage the off-the-shelf image-level CNN for model parameter initialization. Experiments on three challenging datasets, Hollywood2, HMDB51 and UCF101 demonstrate the effectiveness of the proposed network.	-
dc.description.statementofresponsibility	Peng Wang, Yuanzhouhan Cao, Chunhua Shen, Lingqiao Liu, Heng Tao Shen	-
dc.language.iso	en	-
dc.publisher	Institute of Electrical and Electronics Engineers	-
dc.rights	© IEEE	-
dc.source.uri	http://dx.doi.org/10.1109/tcsvt.2016.2576761	-
dc.subject	Temporal pyramid pooling; action recognition; convolutional neural network	-
dc.title	Temporal pyramid pooling based convolutional neural network for action recognition	-
dc.type	Journal article	-
dc.identifier.doi	10.1109/TCSVT.2016.2576761	-
pubs.publication-status	Published	-
dc.identifier.orcid	Shen, C. [0000-0002-8648-8718]	-
Appears in Collections:	Aurora harvest 8 Electrical and Electronic Engineering publications

Files in This Item:

File	Description	Size	Format
RA_hdl_108774.pdf Restricted Access	Restricted Access	6.02 MB	Adobe PDF	View/Open

Show simple item record

Adelaide Research & Scholarship