The recent advancements in Convolutional Neural Networks (CNNs) along with the abundance of image and video content across the web have presented opportunities for the analysis and classification of image and video content using deep learning and neural network architectures.

Significant advancements have been made in image insight analysis and classification algorithms. These advancements owe a great deal to the publicly available datasets such as ImageNet. Production level image analysis algorithms are becoming readily available in a suite of applications in e-commerce, interior design, healthcare and other domains.

Following the success in image analysis, the next frontier is classification and analysis of videos. More specifically human activity recognition and classification within these videos. The primary challenge of video analysis remains understanding the spatiotemporal elements within a video clip. Whereas an image is solely made up of spatial elements (various pixels and colors that occupy space within an image), video analysis poses the challenge of understanding and analyzing the temporal aspects of a video. Temporal elements within a video are the relationship of time with respect to the spatial aspects of an image. For example, if we have an image of a human and a door knob, it is feasible to extract insights such as “Door”, “Human”, and “Human interacting with the door” through image analysis algorithms. However, by adding and analyzing the subsequent frames and incorporating the motion of the door we could then leverage temporal elements to indicate whether the action being performed is “closing of the door” or “opening of the door.”

 

Three fundamental challenges exist in video classification problems today; annotated datasets, computational power and efficient machine learning models.

Having a clean, accurate and annotated dataset is a fundamental aspect of any supervised learning problem. The recent research in the field of video classification has spun out multiple datasets in the past couple of years, most notably UFC-101, HMDB-51, Sports1M and Kinetics. These video datasets are the standard benchmarks for training video classification and activity recognition algorithms in this area. Each present a specific class of actions with varying data quality. Most advancements in the field of activity classification have been possible because of the existence of these datasets and the effort put in by their curators. Although comprehensive, a large gap still remains in collecting and annotating the universe of human activity in various environments and camera settings.

The next challenge in developing state-of-the-art video analysis algorithms remains the expensive computational resources required for the training and inference of such models. Whereas training image analysis on GPUs may take a couple of days, training a video classification algorithm can take weeks and months given the amount of temporal data (video frames) for each video clip. Hence increasing the cost and the time required to perform meaningful experiments and modeling on neural network architectures. With the continual advancements in GPU technology and computer vision specific chips for inference such as TensorRT from Invidia, the barrier to training and deployment of these models is lowering.

Lastly, the modeling of efficient and effective neural network architectures is the holy grail of video analysis. With the introduction of CNNs, a great leap was made in the domain of image analysis and classification. CNNs have empowered scientists to capture and analyze images on a per pixel level. They provide a comprehensive spatial analysis of an image. Current neural net architectures for video analysis rely heavily on CNNs because of the similarities of spatial aspects of videos to static images. The latest research in this field is incorporating the temporal aspects of a video using techniques such as optical flow vectors, time information fusion, LSTMs, and Recurrent Neural Networks (RNNs). These approaches coupled with CNNs have made tremendous progress. The research done by Karpathy et al. and Karen Simoyan and Andrew Zisserman have paved the way for the development of a new generation of spatiotemporal algorithms. New learning algorithms such as few shot learning and meta-learning have also displayed signs of optimism in dealing with data scarcity and model efficiency.

 

In recent years, the pace and the progress in the field of video analysis and classification has been tremendous. Great progress has been made through the curation of well annotated datasets, development of state-of-the-art GPUs and large body of high quality papers published by industry thought leaders. Major technology companies have also developed their own proprietary video analysis API as a service. The most notably are Google AutoML Video Intelligence, Amazon Rekognition and Microsoft Azure’s Cognitive Services. The race to enhance the quality of such algorithms continues through the curation of comprehensive datasets, advancements in computer processing technology and state-of-the-art computer vision algorithms.