Introduction and Motivation: -
We are already familiar with the image classification
task where an algorithm looks at this picture and might be responsible for
saying this is a car shown in fig.1.(a). This is what we mean by
classification.
In fig.1.(b)., classification with
localization means not only do we have to label this as say a car but the
algorithm also is responsible for putting a bounding box, or drawing a red
rectangle around the position of the car in the image. The term localization
refers to figuring out where in the picture is the car, we've detective. The
detection problem where now there might be multiple objects in the picture as
sown in fig.1.(c)., and we have to detect them all and localized them all.
So, the ideas we've learned about for image
classification will be useful for classification with localization. And that
the ideas we learn for localization will then turn out to be useful for
detection.
Mandatory face mask rules are becoming more
common in public settings around the world. There are growing scientific
evidence supporting the effectiveness of face mask wearing on reducing the
spread of Covid-19.
Wearing a face mask
will help prevent the spread of infection and prevent the individual from
contracting any airborne infectious germs. When someone coughs, talks, sneezes
they could release germs into the air that may infect others nearby. Face masks
are part of an infection control strategy to eliminate cross-contamination.
The objective of this work is to develop a
deep learning model which detects whether a person is wearing a face mask, or
not.
For this we are going to use a multitude of algorithms, which will ne performed by us as a team, and our main aim being the YOLO algorithm.
As, YOLO is performed under a DARKNET architecture which is written in C, there is not much work in implementing this algorithm as it only involves some basic imports and yaml files, therefore we have backed up by implementing other algorithms, which helped us code, they are: -
·
Hard coding a simple CNN
·
Implementing VGG16 and ResNet from importing them from the keras
library.
·
Finally, implementing the YOLO algorithm.
All, these algorithms are performed on the same
dataset which we have downloaded from Kaggle.
1. Dataset and Software Used: -
· Dataset: -
Kaggle Face mask detection dataset.
(License; cc0: Public domain).
NUMBER OF IMAGES: 3833 IMAGES
NUMBER OF CLASSES: 2 – with mask AND
without mask.
Size: 399 mb.
Fig. Sample images of dataset used
· Software Used: -
Most of the work was done on juypter notebook (Anaconda) and we used the NumPy, keras and pytorch libraries in the process.
· Data Augmentation: -
Fig. Snap of code used for data augmentation
In, the above ImageDataGenerator function, it augments our data using the following parameters to increase our dataset.
Fig. Snap of the augmented images for the original image
1. Simple CNN and it’s Results: -
Here we have hardcoded simple CNN model which consist of two convolution layers of filter 3X3 and two dense layers. We applied max pooling after each convolutional layer taking ReLU as an activation function. We have used categorical cross entropy as a loss function along with Adam as an optimizer for gradient descent. Finally, we have used the SoftMax layer to get out predicted output.
The validation accuracy came
out to be 94% calculated for 20 epochs.
Fig. Architecture of hardcoded simple CNN model
Fig. Snap of code used for building hardcoded simple CNN
model
Sequential() – It implements our layers as a stack.
Conv2D() - This layer initializes our weights and filters. It can
also resize our data.
Activation() – This specifies the
activation function used.
Model. Compile() - compiles our defined model, also mentions the optimizer and loss function used,
Results : -
Fig. Accuracy and Loss graph
plotted for the basic CNN model
Training accuracy of 98.1% and validation accuracy of 94.5% was obtained.
VGG 16 and it’s Results: -
VGG16 is a convolutional neural network model proposed by K. Simonyan
and A. Zisserman from the University of Oxford in the paper “Very Deep
Convolutional Networks for Large-Scale Image Recognition”. The model
achieves 92.7% top-5 test accuracy in ImageNet, which is a dataset of over
14 million images belonging to 1000 classes. It was one of the famous model
submitted to ILSVRC-2014. It makes the improvement over AlexNet by replacing large kernel-sized
filters (11 and 5 in the first and second convolutional layer, respectively)
with multiple 3×3 kernel-sized filters one after another. VGG16 was trained for
weeks and was using NVIDIA Titan Black GPU’s.
Fig. Snap of code used for building VGG16 model
Fig. Snap of code used for compiling and fitting the model
VGG16 (): - This function, initializes our weights and also removes
the last layer to be adjusted based on our requirements
Flatten (): - It flattens the last layer.
model.fit_generator () - it fits our model according to the model
variable defined earlier, we can define the epochs etc…here.
1. ResNet 152 and it’s Results: -
Here I have used the residual network with 152 layers.
The basic key features that resnet bought with it were: -
A. Skip connections
B. Heavy batch normalization
This allowed us to reduce the problem of degradation of performance due to vanishing gradient in deeper layers. Ensuring many skip connections can lead to less degradation, as if excepted output is not achieved, at least the input will be present at the output. Below is shown the architecture of resnet 152. Also the dataset has been manually split into a 70-30 ratio and each of training and testing consisted of two classes, namely with and without mask.
Fig. Basic architecture of ResNet 152
Results: -
I have used the Adam optimizer, and the binary cross entropy as our loss function for implementing back propagation. We have also played around with the learning rate and batch size to observe the changes that took place.
Code: -
Fig. Snap of code used for building ResNet152 model
Fig. Snap of code used for fitting the model and plotting the results
Resnet152 (): - This function, initializes our weights and also
removes the last layer to be adjusted based on our requirements
Flatten (): - It flattens the last layer.
model.fit_generator () - it fits our model according to the model variable defined earlier; we can define the epochs etc…here.
Result: -
Fig. Accuracy graph plotted for the ResNet152 CNN modelWe have obtained a training accuracy of 99.8% and
validation accuracy of 98.1 %.
1. YOLO (You Only Look Once)
Algorithm: -
All
of the YOLO models are object detection
models. Object detection models are trained to
look at an image and search for a subset of object classes. When found, these
object classes are enclosed in a bounding box and their class is identified.
Object detection models are typically trained and evaluated on the COCO dataset which contains a broad range of 80 object classes. From
there, it is assumed that object detection models will generalize to new object
detection tasks if they are exposed to new training data.
The original YOLO (You Only Look Once) was written by
Joseph Redmon (now retired from CV) in a custom framework called Darknet.
Darknet is a very flexible research framework written in low level languages
and has produced a series of the best real-time object detectors in computer
vision: YOLO, YOLOv2, YOLOv3, YOLOv4 and YOLOv5
The Original YOLO - YOLO was the first object detection network
to combine the problem of drawing bounding boxes and identifying. For our
project we have use the YOLOv5, which is the latest state of the art in object
detection.
Before, YOLO even came into existence, there were
object detection algorithms which were not as efficient. The most popular
approach being sliding window.
The sliding window algorithm, came with
flaws, that it was computationally expensive and very time taking. Here, each
and every pixel was given into the pretrained convolutional network to give the
particular class output, therefore the kernel takes in all the respective one
by one, hence taking lot of time.
YOLO (You only look once), as the name suggests takes in the entire image at once and performs what is called a convolutional sliding window, and the entire image is compressed into a small matrix and depth of that matrix gives the class information, probability, parameters of the bounding box etc.
Fig. Working of YOLO model
Also, in YOLO, the output of the SoftMax layer must include few other parameters as shown: -
Here, pc denotes whether the object is present or not,
if (pc =1, object is present; pc = 0, if object is not present),
also bx, by, bh denotes the parameters of the
bounding box and c1, c2 and c3 denotes the
classes of interest (here we consider only 2 classes as we have only two
classes). Also, the yolo algorithm does
not use any max pooling layers and uses simple convolution to scale down the
dimension, which avoids the loss of information.
Each bounding box of the grid cells of the image will constitutes of many bounding boxes and by the principles of non – suppression and IoU (intersection over union), the irrelevant boxes will be removed or suppressed.
Fig. Non-max suppression working with car example
Let's say we want to detect cars in the fig.7.3.(a). We might place a grid over this (19 by 19
grid) as shown in fig.7.3.(b). Now, while technically the car shown in
the fig.7.3.(b). has just one midpoint, so it should be
assigned just one grid cell. And the car on the left also has just one
midpoint, so technically only one of those grid cells should predict that there
is a car. In practice, we're running an object classification and localization
algorithm for every one of these split cells. So, it's quite possible that this
split cell might think that the center of a car is in it, and for the car on
the left as well. Maybe not only this box, if this is a test image you've seen
before, not only that box might decide things that's on the car, maybe this
box, and this box and maybe others as well will also think that they've found
the car. Let's step through an example of how non-max suppression will work.
So, because you're running the image
classification and localization algorithm on every grid cell. So, when we run
your algorithm, we might end up with multiple detections of each object. So,
what non-max suppression does, is it cleans up these detections. So, they end
up with just one detection per car, rather than multiple detections per car. So
concretely, what it does, is it first looks at the probabilities associated
with each of these detections. But for now, let's just say is Pc with the
probability of a detection. And it first takes the largest one, which in this
case is 0.9 shown in
fig.7.3.(c). Having done that the non-max suppression part
then looks at all of the remaining rectangles and all the ones with a high
overlap, with a high IOU, with this one that you've just output will get
suppressed. So those two rectangles with the 0.6 and the 0.7. Both of those
overlap a lot with the light blue rectangle. So those, we are going to suppress
and darken them to show that they are being suppressed. Next, we then go
through the remaining rectangles and find the one with the highest probability,
the highest Pc, which in this case is this one with 0.8. So, let's commit to
that and then, the non-max suppression part is to then get rid of any other
ones with a high IOU. So now, every rectangle has been either highlighted or
darkened. And if you just get rid of the darkened rectangles, you are left with
just the highlighted ones, and these are your two final predictions. So, this
is non-max suppression. And non-max means that you're going to output your
maximal probabilities classifications but suppress the close-by ones that are
non-maximal. Hence the name, non-max suppression.
The principle of non-max suppression will keep the
highest probability box and suppress the remaining boxes based on the IoU
ratio, and finally the desired object will be enclosed in the box.
Fig. YOLO detection of cycle and dog using bounding box
The yolov5 algorithm consists of the following
components: -
· Backbone
· Neck
· Head
Basically, all object
detectors take an image in for input and compress
features down through a convolutional neural network backbone. In image classification, these backbones are
the end of the network and prediction can be made off of them. In object detection, multiple bounding
boxes need to be drawn around images along with classification, so the feature
layers of the convolutional backbone need to be mixed and held up in light of
one another. The combination of backbone feature layers happens in the
neck.
It is also useful to split object detectors into two categories: one-stage detectors and two stage detectors. Detection happens in the head. Two-stage detectors decouple the task of object localization and classification for each bounding box. One-stage detectors make the predictions for object localization and classification at the same time. YOLO is a one-stage detector, hence, You Only Look Once.
The backbone network
for an object detector is typically pretrained on ImageNet classification.
Pretraining means that the network's weights have already been adapted to
identify relevant features in an image, though they will be tweaked in the new
task of object detection.
The authors
considered CSPDarknet53 backbone for the YOLOv5 object detector. The backbone
architecture is shown below: -
Fig. Architecture of YOLO model
The
CSPResNext50 and the CSPDarknet53 are both based on Dense Net. Dense Net was
designed to connect layers in convolutional neural networks with the following
motivations: to alleviate the vanishing gradient problem (it is hard to
backprop loss signals through a very deep network), to bolster feature
propagation, encourage the network to reuse features, and reduce the number of
network parameters.
Fig. Architecture of CSPResNext50 model
The next step in object
detection is to mix and combine the features formed in the ConvNet backbone to
prepare for the detection step. YOLOv5 considers PANet.
The components
of the neck typically flow up and down among layers and connect only the few
layers at the end of the convolutional network.
Each one of the
P(i) above represents a feature layer in the CSPDarknet53 backbone.
The image above comes from
YOLOv4's predecessor, EfficientDet. Written by Google Brain, EfficientDet uses
neural architecture search to find the best form of blocks in the neck portion
of the network, arriving at NAS-FPN. The EfficientDet authors then tweak it
slightly to the make the more architecture more intuitive (and probably perform
better on their development sets).
Additionally,
YOLOv5 adds a SPP block after CSPDarknet53 to increase the receptive
field and separate out the most important features from the backbone.
Fig. Architecture of PaNet model
YOLOv5 deploys
the same YOLO head as YOLOv3 for detection with the anchor-based
detection steps, and three levels of detection granularity.
Code: -
The imported yaml file for yolov5 from github,
implemented by python: -
# parameters
nc: {num_classes} # number of classes
depth_multiple: 0.33 # model depth multiple
width_multiple: 0.50 # layer channel multiple
# anchors
anchors:
- [10,13, 16,30, 33,23] # P3/8
- [30,61, 62,45, 59,119] # P4/16
- [116,90, 156,198, 373,326] # P5/32
# YOLOv5 backbone
backbone:
# [from, number, module, args]
[[-1, 1, Focus, [64, 3]], # 0-P1/2
[-1, 1, Conv, [128, 3, 2]], # 1-P2/4
[-1, 3, BottleneckCSP, [128]],
[-1, 1, Conv, [256, 3, 2]], # 3-P3/8
[-1, 9, BottleneckCSP, [256]],
[-1, 1, Conv, [512, 3, 2]], # 5-P4/16
[-1, 9, BottleneckCSP, [512]],
[-1, 1, Conv, [1024, 3, 2]], # 7-P5/32
[-1, 1, SPP, [1024, [5, 9, 13]]],
[-1, 3, BottleneckCSP, [1024, False]], # 9
]
# YOLOv5 head
head:
[[-1, 1, Conv, [512, 1, 1]],
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[[-1, 6], 1, Concat, [1]], # cat backbone P4
[-1, 3, BottleneckCSP, [512, False]], # 13
[-1, 1, Conv, [256, 1, 1]],
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[[-1, 4], 1, Concat, [1]], # cat backbone P3
[-1, 3, BottleneckCSP, [256, False]], # 17 (P3/8-small)
[-1, 1, Conv, [256, 3, 2]],
[[-1, 14], 1, Concat, [1]], # cat head P4
[-1, 3, BottleneckCSP, [512, False]], # 20 (P4/16-medium)
[-1, 1, Conv, [512, 3, 2]],
[[-1, 10], 1, Concat, [1]], # cat head P5
[-1, 3, BottleneckCSP, [1024, False]], # 23 (P5/32-large)
[[17, 20, 23], 1, Detect, [nc, anchors]], # Detect(P3, P4, P5)
]
{num_classes} – defines the number of classes, in our cases there are two classes namely “with mask” and without mask”
anchors – these boxes are used to multiple object detection and the number of anchor boxes are given by the user.
conv and bottleneckCSP – these two commands refer to the convolutional layers
and the 1x1 convolutions implemented by the CSPDarknet53 architecture (the
layer size is mentioned near each line).
Fig. Training YOLO on custom data
for 100 epochs
Here, we are able to pass a number of arguments:
- img: define input image size
- batch: determine batch size
- epochs: define the number of training
epochs. (Note: often, 3000+ are common here!)
- data: set the path to our yaml file
- cfg: specify our model configuration
- weights: specify a custom path to weights.
- name: result names
- no
save: only
save the final checkpoint
- cache: cache images for faster training
Results obtained from YOLOv5 :-
From tensor board command, we get the following results:-
Fig. This command gives continuous callbacks and traces the data, in the tensor board.Fig. Results obtained from YOLOv5
mAP (mean average precision) is the average of AP. In some context, we compute the AP for each class and average them. But in some context, they mean the same thing. mAP[0.5 to 0.95], gives the average precision of (average of IoU’s), of IoU > 0.5 to 0.95 with steps of 0.05, which basically gives us 10 levels.
Predictions Obtained:-
If
the IoU >= 0.5, then the images are predicted as true positives.
If
the IoU < 0.5, then the images are labelled True negative.
Also,
if the ground truth image is present and our prediction is negative, then it is
regarded as a false negative.
- G.V.S. Lohit (lohit0399@gmail.com)
- Rhea Aswal (rhea.aswal@gmail.com)
1. References :-
[1]. C. Liu, Y. Tao, J. Liang, K. Li and Y. Chen,
"Object Detection Based on YOLO Network," 2018 IEEE 4th
Information Technology and Mechatronics Engineering Conference (ITOEC),
Chongqing, China, 2018, pp. 799-803, doi: 10.1109/ITOEC.2018.8740604.
[2] J. Redmon, S. Divvala, R. Girshick and A. Farhadi,
"You Only Look Once: Unified, Real-Time Object Detection," 2016
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las
Vegas, NV, 2016, pp. 779-788, doi: 10.1109/CVPR.2016.91.
[3] H. Qassim, A. Verma and D. Feinzimer,
"Compressed residual-VGG16 CNN model for big data places image
recognition," 2018 IEEE 8th Annual Computing and Communication
Workshop and Conference (CCWC), Las Vegas, NV, 2018, pp. 169-175, doi:
10.1109/CCWC.2018.8301729.
[4] K. He, X. Zhang, S. Ren and J. Sun, "Deep
Residual Learning for Image Recognition," 2016 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778, doi:
10.1109/CVPR.2016.90.
Comments
Post a Comment