Design and implement a neural network that recognizes actions within first-person cooking videos.
Figure 1. Recognition results for one exemplary input video. A person is making a cheeseburger throughout the video, which contains 1360 frames in total. Selected frames of this input video are shown in the middle. Colors denote predicted and ground truth actions of each frame in the video.
Action recognition is an important research problem that aims to classify human actions into different categories. This research problem is critical for enhancing the context-awareness of smart homes, Internet of Things (IoT), gaming, etc. In this project, I build a model that extracts image features using ResNet-101 and interprets temporal information using bidirectional RNN. The output of RNN is later fed into two fully connected layers followed by a softmax layer. The whole network architecture is illustrated in Figure 2 below.
Figure 2. The network architecture.
Figure 3. Training and validation history.