Object Detection Part 1: Basic Concepts

Let’s take a look at what object detection is expected to solve first.

Object detection deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos.

Mask rcnn sample

Fig. 1. An example of object detection. (source: link)

As defined, object detection is expected to identify the instances of semantic objects and localize them in digital images and videos. Naturally, we expect the computer to firstly detect whether an instance occurs on the image, if then predict the localization of the detected object, normally with a bounding box.

Object detection as an essential computer vision branch has a variety of applications in real world. Facebook detects our faces when we or our friends upload a photo containing our faces, which is in fact achieved by the face detection. Autonomous car identifies pedestrians who are crossing the road, which is based on the pedestrian detection. Also, object detection has significant applications in video surveillance, drone scene analysis, and robotics vision tasks. Besides, object detection forms the basis of some other computer vison tasks like image segmentation, image captioning, and object tracking. Normally, we categorize the object detection tasks into two kinds according to the application scenarios, generic object detection, and domain-specific detection. The former one aims at developing a unified framework which could detect different types of objects, while the later one detects objects under specific scenario such as pose detection, and text detection.

wind walk travel video (source: <https://towardsdatascience.com/object-detection-using-deep-learning-approaches-an-end-to-end-theoretical-perspective-4ca27eee8a9a>)

Fig 2. Wind walk travel video (source: link)

Image

An image is seen as a grid of numbers between [0, 255] for computers. A grey image has only one channel of grid while a colored image has 3 channels of the grids corresponding to RGB. Here are some concepts in terms of digital image.

Due to the natural properties of digital images, computer visional tasks face challenges like viewpoint variation, scale variation, background clutter, illumination, deformation, occlusion, intra-class variation. All these challenges rise the difficulty for computers recognizing objects in digital images.

computer vision challenges(source: Stanford CS231n)

Fig. 3. Computer vision challenges (source: Stanford CS231n slides)

Features

Both traditional processing or machine learning need an efficient numerical representation of an image for computation. We normally do this by extracting useful information and discarding extraneous information. Features refer to these extracted information and good features are supposed to be descrimitive. For example, an image with a size of width x height x channels could be directly converted to a vector/array of intensities.

There are many algorithms designed to construct the feature vector. Constructed feature vectors could be fed to different applications like image classifier as training data. The simplest way to extract image features is a flatten array of the image pixels. In early years, people use rule-based features like Haar wavelet, rectangle feature, and shape context, etc. However, deep neural networks could extract features automatically without explicitly specifying rules.

Reference

[1] Stanford CS231n

[2]Object detection using deep learning approaches - an end-to-end theoretical perspective


Page maintained by Hang Zhao