[Paper Review & Implementation] You Only Look Once: Unified, Real-Time Object Detection (YOLOv1, 2016)

Outlines

Reference
Implementation with PyTorch
YOLO : Single-State Object Detection
YOLOv1 Backbone Architecture : Feature Extractor and Region Proposals
Customize PASCAL VOC Dataset
Process the Predicted Bounding Boxes from YOLO Model and Obtain Final Target Boxes
- Step 1 : Convert bounding boxes relative to cell ratio into entire image ratio
- Step 2 : Non-Maximal Suppression onto Predicted Bounding Boxes
Calculate Mean Average Precison (mAP) between Predicted Boxes and Ground Truths
Computing Loss and Optimizing the Model
Training Result & Display Predicted Anchors from Trained Model onto Image

Reference

Implementation with PyTorch

Full codes here
I referenced the entire code from here, but I made a few modifications to optimize performance, such as replacing explicit for loops with vectorization for matrix operations.

YOLO : Single-State Object Detection

(a) : Two-stage object detection models like faster R-CNN consists of two main components : a region proposal network (RPN) and a classifier.
- RPN generates region proposals (potential bounding box locations), and these proposals are then filtered and refined to be the final target RoIs by the classifier.

(b) : In contrast, YOLO is a single-stage object detection model where region proposal stage is incorporated into a feature extractor architecture.
- It divides the input image into a set of grid cells and predicts bounding boxes and class probabilities directly from each cell instead of explicitly generating region proposals.
- Uses a CNN to extract features from the entire image at once and predict object detections.

YOLOv1 Backbone Architecture : Feature Extractor and Region Proposals

Figure 3. YOLOv1 Architecure (Darknet framework) : 24 conv layers followed by 2 fc layers

import torch
import torch.nn as nn

architecture_configs = [
    (7, 64, 2, 3),
    "M",
    (3, 192, 1, 1),
    "M",
    (1, 128, 1, 0),
    (3, 256, 1, 1),
    (1, 256, 1, 0),
    (3, 512, 1, 1),
    "M",
    [(1, 256, 1, 0), (3, 512, 1, 1), 4],
    (1, 512, 1, 0),
    (3, 1024, 1, 1),
    "M",
    [(1, 512, 1, 0), (3, 1024, 1, 1), 2],
    (3, 1024, 1, 1),
    (3, 1024, 2, 1),
    (3, 1024, 1, 1),
    (3, 1024, 1, 1),
]


class CNNBlock(nn.Module):
    def __init__(self, in_channels, out_channels, **kwargs):
        super(CNNBlock, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, bias=False, **kwargs)
        self.batchnorm = nn.BatchNorm2d(out_channels)
        self.leakyrelu = nn.LeakyReLU(0.1)

    def forward(self, x):
        return self.leakyrelu(self.batchnorm(self.conv(x)))


class YOLOv1(nn.Module):
    def __init__(self, in_channels=3, **kwargs):
        super(YOLOv1, self).__init__()
        self.architecture = architecture_configs
        self.in_channels = in_channels
        self.darknet = self._create_conv_layers(self.architecture)
        self.fcs = self._create_fcs(**kwargs)

    def forward(self, x):
        x = self.darknet(x)
        return self.fcs(torch.flatten(x, start_dim=1))

    def _create_conv_layers(self, architecture):
        layers = []
        in_channels = self.in_channels

        for x in architecture:
            if type(x) == tuple:
                layers += [
                    CNNBlock(
                        in_channels, x[1], kernel_size=x[0], stride=x[2], padding=x[3],
                    )
                ]
                in_channels = x[1]

            elif type(x) == str:
                layers += [nn.MaxPool2d(kernel_size=(2, 2), stride=(2, 2))]

            elif type(x) == list:
                conv1 = x[0]
                conv2 = x[1]
                num_repeats = x[2]

                for _ in range(num_repeats):
                    layers += [
                        CNNBlock(
                            in_channels,
                            conv1[1],
                            kernel_size=conv1[0],
                            stride=conv1[2],
                            padding=conv1[3],
                        )
                    ]
                    layers += [
                        CNNBlock(
                            conv1[1],
                            conv2[1],
                            kernel_size=conv2[0],
                            stride=conv2[2],
                            padding=conv2[3],
                        )
                    ]
                    in_channels = conv2[1]

        return nn.Sequential(*layers)

    def _create_fcs(self, split_size, num_boxes, num_classes):
        S, B, C = split_size, num_boxes, num_classes

        return nn.Sequential(
                nn.Flatten(),
                nn.Linear(1024*S*S, 4096),
                nn.LeakyReLU(0.1),
                nn.Linear(4096, S*S*(B*5+C))
            )

Input : images with shape (3, 448, 448)
- pretrained with ImageNet (224 x 224) and double the resolution of input at detection.
Inspired by Inception Net, but simply used the 1 x 1 conv layer followed 3 x 3 conv layer (basically just a bottleneck block) instead of cocatenating them in parallel.
Output : (n_batch, SS(B*5+C) = 1470)
- S : grid size (here, total 49 cells from 7 x 7 grid)
- B : the number of bounding boxes attached to each cell. (in paper, 2)
- C : the number of classes (20)
- 5 predictions: x, y, w, h, and confidence score (objectness)
  - x, y, w, h all normalized to fall between 0 and 1, relative to cell ratio.

Customize PASCAL VOC Dataset

class VOCDataset(torch.utils.data.Dataset):
    '''
    Args:
        - csv_file : contain annotations for each img and label file.
        - img_dir : directory that contains all img files (3 x 224 x 224)
        - label_dir : directory that contains all label files
            - each line holds [class_label, x, y, width, height]
    '''
    def __init__(self, csv_file, img_dir, label_dir, S=7, B=2, C=20, transform=None):
        self.annotations = pd.read_csv(csv_file)
        self.img_dir = img_dir
        self.label_dir = label_dir
        self.transform = transform
        self.S = S
        self.B = B
        self.C = C

    def __len__(self):
        return len(self.annotations)

    def __getitem__(self, index):
        label_path = os.path.join(self.label_dir, self.annotations.iloc[index, 1])
        
        boxes = []
        with open(label_path) as f:
            for label in f.readlines():
                class_label, x, y, width, height = [
                    float(x) if float(x) != int(float(x)) else int(x)
                    for x in label.replace("\n", "").split()
                ]
                boxes.append([class_label, x, y, width, height])

        img_path = os.path.join(self.img_dir, self.annotations.iloc[index, 0])
        image = Image.open(img_path)
        boxes = torch.tensor(boxes)

        if self.transform:
            # image = self.transform(image)
            image, boxes = self.transform(image, boxes)

        # Convert To Cells
        label_matrix = torch.zeros((self.S, self.S, self.C + 5 * self.B))   # label matrix (S, S, C+B*5)
        for box in boxes:
            class_label, x, y, width, height = box.tolist()
            class_label = int(class_label)

            # i,j represents the cell row and cell column
            i, j = int(self.S * y), int(self.S * x)
            x_cell, y_cell = self.S * x - j, self.S * y - i   # relative to cell ratios

            width_cell, height_cell = (
                width * self.S,
                height * self.S,
            )

            # restrict only one object per each cell
            if label_matrix[i, j, 20] == 0:
                label_matrix[i, j, 20] = 1     # indicating that an oject is alreay assigned in cell i, j 

                # Box coordinates
                box_coordinates = torch.tensor(
                    [x_cell, y_cell, width_cell, height_cell]
                )
                label_matrix[i, j, 21:25] = box_coordinates
                label_matrix[i, j, class_label] = 1     # Set one hot encoding for class_label

        return image, label_matrix

Read the img and label files in ‘img_dir’ and ‘label_dir’, respectively.
All img and label files are annotated by the file in the indicated path ‘csv_file’
Read all lines in the label file using a for loop, where each line contains the information about class_label, x, y, width, height of each bounding box.
Transform the image and boxes with a defined transformer.
- resize 224 x 224 images into 448 x 448 resolution and convert them in tensors.
```
  transform = Compose([transforms.Resize((448, 448)), transforms.ToTensor()])
```
Convert the coordinates of each bounding box from entire image ratio into cell ratio (entire image is divided to 7 x 7 cells of grid).
Assign an object (bounding box) to each cell of grid. (one per a cell)
Generate a 7 x 7 label matrix for the given image, where each cell (i, j) holds the labels (class, x, y, width, height) of the assigned bounding box.

Process the Predicted Bounding Boxes from YOLO Model and Obtain Final Target Boxes

def get_bboxes(n_batch, loader, model, iou_threshold, threshold, device, box_format='center'):
    pred_boxes, gt_boxes = [], []
    model = model.to(device)
    model.eval()   # make sure to turn off the train mode before getting final bboxes. 

    for batch_idx, (img, labels) in enumerate(loader):
        # add training index (later will be used to calculate mAP)
        train_idxs = (torch.arange(n_batch) + (batch_idx*n_batch)).unsqueeze(-1).unsqueeze(-1).repeat(1, 49, 1)   

        img = img.to(device)
        labels = labels.to(device)

        with torch.no_grad():
            preds = model(img)   # (n_batch, S*S*(C+B*5))

        S = 7
        n_batch = len(preds)
        pred_bboxes = convert_cellboxes(preds).reshape(n_batch, S*S, -1)
        gt_bboxes = convert_cellboxes(labels).reshape(n_batch, S*S, -1)

        pred_bboxes = torch.concat([train_idxs, pred_bboxes], dim=-1)
        gt_bboxes = torch.concat([train_idxs, gt_bboxes], dim=-1)

        pred_boxes += non_max_suppression(pred_bboxes, iou_threshold, threshold, "center")
        for bb in [box[np.where(box[:, 2] > threshold)[0]] for box in gt_bboxes]:
            gt_boxes += [b for b in bb]
        # if batch_idx == 10: break

    return pred_boxes, gt_boxes

Step 1 : Convert bounding boxes relative to cell ratio into entire image ratio

def convert_cellboxes(predictions, S=7):
    """
    Convert output of yolo v1
    (n_batch, 7*7*(5*B+C)) -> (n_batch, 7, 7, [pred_clss, best_confidence_score, center coordinates])
    - Convert bounding boxes with grid split size S relative to cell ratio into entire image ratio.
    """

    predictions = predictions.to("cpu")
    batch_size = predictions.shape[0]
    predictions = predictions.reshape(batch_size, 7, 7, 30)
    bboxes1 = predictions[..., 21:25]    # 1 bbox 
    bboxes2 = predictions[..., 26:30]    # 2 bbox 
    
    # select best bounding box with highest confidence score among 2 candidates
    scores = torch.cat((predictions[..., 20].unsqueeze(0), predictions[..., 25].unsqueeze(0)), dim=0)   # 2 x (n_batch x 7 x 7)
    best_box = scores.argmax(0).unsqueeze(-1)     # n_batch x 7 x 7 x 1
    best_boxes = bboxes1 * (1 - best_box) + best_box * bboxes2  

    cell_indices = torch.arange(7).repeat(batch_size, 7, 1).unsqueeze(-1)
    x = (1/S) * (best_boxes[..., :1] + cell_indices)    # (0 + x0, 1 + x1, 2 + x2, ...  6 + x6)*(1/S)
    y = (1/S) * (best_boxes[..., 1:2] + cell_indices.permute(0, 2, 1, 3))  
    w_y = (1/S) * best_boxes[..., 2:4]  
     
    converted_bboxes = torch.cat((x, y, w_y), dim=-1)
    pred_class = predictions[..., :20].argmax(-1).unsqueeze(-1)      # class with best pred scores. 
    best_conf = torch.max(predictions[..., 20], predictions[..., 25]).unsqueeze(-1)    # best confidence score 
    converted_preds = torch.cat((pred_class, best_conf, converted_bboxes), dim=-1)
    return converted_preds

Only select the anchor with best confidence score out of the two attached anchors per cell and mask the other.
Normalize the coordinates, width, and height of the bounding boxes to be bounded between 0 and 1, all relative to entire image ratio.
Set the label of each bounding box as the one with best predicted score.
Output : (n_batch, 7, 7, [best_pred_clss, best_confidence_score, (x, y, w, h)])

Step 2 : Non-Maximal Suppression onto Predicted Bounding Boxes

Detect multiple detections for one object and suppress all except the one with the highest confidence score.

def non_max_suppression(bboxes, iou_threshold, threshold, box_format='center'):
    '''
    bboxes : (n_batch, 7*7, [pred_clss, best_confidence_score, coordinates (x, y, w, h)])
    iou_threshold != threshold
    threshold here refers to the minimum confidence score to be selected as true bbox.
    '''
    n_batch = len(bboxes)
    post_nms_bboxes = []
    for j in range(n_batch):
        j_bboxes = bboxes[j]
        j_bboxes = j_bboxes[np.where(j_bboxes[:, 2] > threshold)[0]]
        
        if not j_bboxes.size(0):
            continue
        
        confidence_scores = j_bboxes[:, 2]
        order_idxs = confidence_scores.ravel().argsort(descending=True)    # sort bboxes by confidence scores in a descending order.

        keep_bboxes = []
        order_idxs = order_idxs.argsort(descending=True)
        while (order_idxs.size(0) > 0):
            target_idx = order_idxs[0]
            keep_bboxes.append(target_idx.item())

            ious = compute_ious(box_format, j_bboxes[order_idxs[1:]], j_bboxes[target_idx])
            keep_idxs = np.where(ious <= iou_threshold)[0]
            order_idxs = order_idxs[keep_idxs+1]

        post_nms_bboxes += [box for box in j_bboxes[keep_bboxes]]

    return post_nms_bboxes

Initially, filter out only the bounding boxes with confidence scores higher than a specified threshold. (here, 0.35)
Sort the bounding boxes in descending order according to the objectness confidence scores.
Suppress the bounding boxes that captures the same object with the selected box.
- Compute IoU score with the target box and remove the boxes with IoU higher than an IoU threshold.

Calculate Mean Average Precison (mAP) between Predicted Boxes and Ground Truths

We now get filtered predicted bounding boxes that are about to used to calculate mAP score with ground truth boxes.
Before doing that, let’s figure out what mAP score is.

Mean Average Precison (mAP)

mAP is mean “Averaged Precision” (AP) over all classes, which is determined by plotting the precision-recall curve for the class and calculating the area under this curve (AUC).

$\large \text{mAP} = \frac{1}{n} \sum_{i=1}^{n} \text{AP}_{i}$

AUC : Area underneath the Precision-Recall Curve

Precision-Recall Curve : Shows the trade-off between precision and recall at different classification thresholds. (to label the given data as Positive or Negative for a specific class)

Precision
- TP / (TP + FP)
- The ratio of true positives out of all positivie predictions.
Recall
- TP / (TP + FN)
- The ratio of true positives out of all positive labels. (ground truth)
AUC : Integral of precision scores with respect to recall scores by varying classification threshold from 0 to 1.

Implementation

def mean_average_precision(pred_boxes, gt_boxes, iou_threshold, num_classes, box_format="center"):
    '''
    Args:
    pred_boxes = tensor([train_idx, pred_class, best_confidence_score, x, y, w, h])
    gt_boxes = same as pred_boxes
    '''
    average_precisions = []
    for c in range(num_classes):
        c_pred_boxes = [box.unsqueeze(0) for box in pred_boxes if box[1] == c]    # check if class matched, shape : (num_detections, 7)
        c_gt_boxes = [box.unsqueeze(0) for box in gt_boxes if box[1] == c]

        num_T, num_P = len(c_gt_boxes), len(c_pred_boxes)
        if not num_T: continue
        if not num_P:
            average_precisions.append(0)
            continue

        c_pred_boxes = torch.concat(c_pred_boxes, dim=0)
        FP = torch.zeros(num_P)
        TP = torch.zeros(num_P)

        gt_cnt = Counter([int(gt[:, 0].item()) for gt in c_gt_boxes])   # counting gt boxes by train_idx (per image) within a class, gt_cnt = {0:3, 1:5}
        gt_cnt_graph = {}
        # set 1 if detected -> only one predicter is responsible for each object (gt box)
        for idx, cnt in gt_cnt.items():
            gt_cnt_graph[idx] = torch.zeros(cnt)

        c_gt_boxes = torch.concat(c_gt_boxes, dim=0)
        for pred_idx, box in enumerate(c_pred_boxes):
            train_idx = box[0].item()      # check if train_idx matched
            target_gt_idxs = np.where(c_gt_boxes[:, 0] == train_idx)[0]

            if not len(target_gt_idxs):
                FP[pred_idx] = 1
                continue

            ious = compute_ious('center', c_gt_boxes[target_gt_idxs, 3:], box[3:])
            best_iou_idx = ious.argmax(dim=0)
            best_iou = ious[best_iou_idx]

            # check 1. whether current gt_box is already detected by other pred box and 2. object detection clearly captures ground truth.
            if gt_cnt_graph[int(train_idx)][best_iou_idx] == 0 and best_iou >= iou_threshold:
                TP[pred_idx] = 1
                gt_cnt_graph[int(train_idx)][best_iou_idx] = 1
            else:
                FP[pred_idx] = 1

        average_precisions.append(get_average_precision(FP, TP, num_T))

    return sum(average_precisions)/len(average_precisions)

def get_average_precision(TP, FP, num_T):
    eps = 1e-8
    TP_cumsum = torch.cumsum(TP, dim=0)
    FP_cumsum = torch.cumsum(FP, dim=0)
    recalls = TP_cumsum / (num_T + eps)
    precisions = torch.divide(TP_cumsum, (TP_cumsum + FP_cumsum + eps))
    precisions = torch.cat((torch.tensor([1]), precisions))
    recalls = torch.cat((torch.tensor([0]), recalls))
    # torch.trapz for numerical integration
    ap = torch.trapz(precisions, recalls)

    return ap.item()

Compute mAP between the predicted and ground truth boxes with matched class and training index.
Each ground truth box is taken over by one predictor (predicted anchor).
- checked by gt_cnt_graph
TP vs FP
- TP : a box whose best IoU score with gt boxes (which are not pre-occupied by former predictors) is higher than the iou threshold (0.5)
- FP : a box that is not the case of TP.
Avearge Precision (AP)
- Numerical intergraion of precisions with respect to recalls

Computing Loss and Optimizing the Model

prev_loss = 9999
early_stopping = 0

for e in range(epochs):

    pred_boxes, gt_boxes = get_bboxes(n_batch, train_loader, model, iou_threshold, threshold, device)

    mAP = mean_average_precision(pred_boxes, gt_boxes, iou_threshold, num_classes)
    
    print(f"Train mAP: {mAP}")

    mean_loss = compute_loss(train_loader, model, optimizer, yolo_loss)

    if mean_loss <= prev_loss:
        checkpoint = {
            "state_dict": model.state_dict(),
            "optimizer": optimizer.state_dict(),
        }
        save_checkpoint(checkpoint, filename=chkpt_dir)
        time.sleep(7)
    else:
        early_stopping += 1
    
    if early_stopping == 2 :
        print("----- Early Stopping -----")
        break

    prev_loss = mean_loss

def compute_loss(loader, model, optimizer, yolo_loss):
    model.train()
    model = model.to(device)

    loop = tqdm(loader, leave=True)
    loss_history = []
    for batch_idx, (img, labels) in enumerate(loop):
        img, labels = img.to(device), labels.to(device)
        preds = model(img)
        loss = yolo_loss(preds, labels)
        loss_history.append(loss.item())
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # update progress bar
        loop.set_postfix(loss=loss.item())

    mean_loss = sum(loss_history)/len(loss_history)
    print(f"  Mean loss : {mean_loss}")
    return mean_loss

Running through all epochs, compute the loss between predicted anchors and ground truth.

Multi-task loss consists of coordinate regression loss (x, y, w, h), oject loss, no object loss, and classfication loss (not softmax, just MSE).

class YoloLoss(nn.Module):
    """
    Calculate the loss for yolo (v1) model
    """

    def __init__(self, S=7, B=2, C=20):
        super(YoloLoss, self).__init__()
        self.mse = nn.MSELoss(reduction="sum")

        self.S = S
        self.B = B
        self.C = C

        self.lambda_noobj = 0.5
        self.lambda_coord = 5

    def forward(self, predictions, target):
        predictions = predictions.reshape(-1, self.S, self.S, self.C + self.B * 5)
        
        ## coordinate regression loss 

        # calculate IoU for the two predicted bounding boxes with target bbox
        iou_b1 = compute_ious('center', predictions[..., 21:25], target[..., 21:25])
        iou_b2 = compute_ious('center', predictions[..., 26:30], target[..., 21:25])
        ious = torch.cat([iou_b1.unsqueeze(0), iou_b2.unsqueeze(0)], dim=0)

        # Take the box with highest IoU out of the two prediction -> one bounding box prediction per each object 
        iou_maxes, bestbox = torch.max(ious, dim=0)     # val, idx (argmax) = 0 (1st bbox) or 1 (2nd bbox) 
        obj_mask = target[..., 20:21]                   # object / no object (whether ground truth box holds object or not)

        # Set boxes with no object in them to 0. Only select the box with max iou with ground truth 
        box_predictions = obj_mask*(bestbox * predictions[..., 26:30] + (1 - bestbox) * predictions[..., 21:25])
        box_targets = obj_mask*target[..., 21:25]

        # Take sqrt of width, height of boxes
        box_predictions[..., 2:4] = torch.sign(box_predictions[..., 2:4]) * torch.sqrt(torch.abs(box_predictions[..., 2:4] + 1e-6))
        box_targets[..., 2:4] = torch.sqrt(box_targets[..., 2:4])

        box_loss = self.mse(torch.flatten(box_predictions, end_dim=-2),
                            torch.flatten(box_targets, end_dim=-2))
        
        ## object loss 
        pred_box = bestbox * predictions[..., 25:26] + (1 - bestbox) * predictions[..., 20:21]   
        object_loss = self.mse(
            torch.flatten(obj_mask * pred_box),
            torch.flatten(obj_mask * target[..., 20:21]),
        )
        
        ## no object loss 
        no_object_loss = self.mse(torch.flatten((1 - obj_mask) * predictions[..., 20:21], start_dim=1),
                                  torch.flatten((1 - obj_mask) * target[..., 20:21], start_dim=1))

        no_object_loss += self.mse(torch.flatten((1 - obj_mask) * predictions[..., 25:26], start_dim=1),
                                   torch.flatten((1 - obj_mask) * target[..., 20:21], start_dim=1))
        
        ## classification loss 
        class_loss = self.mse(torch.flatten(obj_mask * predictions[..., :20], end_dim=-2,),
                              torch.flatten(obj_mask * target[..., :20], end_dim=-2,))

        loss = ( self.lambda_coord * box_loss  
               + object_loss  
               + self.lambda_noobj * no_object_loss 
               + class_loss)

        return loss

Out of the two predicted bounding boxes, only the one with highest IoU with corresponding ground truth is selected to compute all subsequent losses.
Except the no objectness error (fourth line of defined loss formula), all losses are computed only if an object is present in the target grid cell.
- $\large \mathbb{1}^{obj}_{ij}$ : an indicator function that returns 1 only if the cell of ith row and jth coloumn contains object, otherwise 0.
Multiply a weighting factor to differentially scale the effect of each type of error.
- $\large \lambda_{coord}$ : set as 5, increasing the loss from bounding box coordinate predictions.
- $\large \lambda_{noobj}$ : set as 0.5, decreasing the loss from confidence score for no oject boxes.
All losses are simply computed using MSE.

Training Result & Display Predicted Anchors from Trained Model onto Image

from train import main

main()

Train mAP: 0.0
100%|██████████| 12/12 [02:40<00:00, 13.35s/it, loss=87.3]
  Mean loss : 348.82359250386554
=> Saving checkpoint
Train mAP: 0.0
100%|██████████| 12/12 [02:40<00:00, 13.38s/it, loss=122]
  Mean loss : 158.5302308400472
=> Saving checkpoint
Train mAP: 0.0
100%|██████████| 12/12 [02:40<00:00, 13.38s/it, loss=104]
  Mean loss : 122.02847099304199
=> Saving checkpoint
Train mAP: 0.8965692117810249
100%|██████████| 12/12 [02:38<00:00, 13.20s/it, loss=68.3]
  Mean loss : 91.6347204844157
=> Saving checkpoint
Train mAP: 0.9108292050659657
100%|██████████| 12/12 [02:38<00:00, 13.20s/it, loss=56.3]
  Mean loss : 81.93021202087402
=> Saving checkpoint
Train mAP: 0.8115305352956057
100%|██████████| 12/12 [02:38<00:00, 13.22s/it, loss=45.2]
  Mean loss : 71.95312404632568
=> Saving checkpoint
Train mAP: 0.8349191606044769
100%|██████████| 12/12 [02:36<00:00, 13.06s/it, loss=86.1]
  Mean loss : 66.83295917510986
=> Saving checkpoint
Train mAP: 1.2271075576543808
100%|██████████| 12/12 [02:38<00:00, 13.20s/it, loss=80.2]
  Mean loss : 59.71648979187012
=> Saving checkpoint
Train mAP: 0.752258376032114
100%|██████████| 12/12 [02:38<00:00, 13.20s/it, loss=115]
  Mean loss : 51.03803793589274
=> Saving checkpoint
Train mAP: 1.038465577363968
100%|██████████| 12/12 [02:38<00:00, 13.18s/it, loss=41.7]
  Mean loss : 49.17437616984049
=> Saving checkpoint
Train mAP: 1.2061315879225731
100%|██████████| 12/12 [02:38<00:00, 13.18s/it, loss=32.8]
  Mean loss : 44.50646352767944
=> Saving checkpoint
Train mAP: 0.9527697516115088
100%|██████████| 12/12 [02:36<00:00, 13.06s/it, loss=30.3]
  Mean loss : 41.781076431274414
=> Saving checkpoint
Train mAP: 1.3203784927725792
100%|██████████| 12/12 [02:37<00:00, 13.10s/it, loss=35]
  Mean loss : 40.97672208150228
=> Saving checkpoint
Train mAP: 1.0776823602224652
100%|██████████| 12/12 [02:39<00:00, 13.26s/it, loss=30.2]
  Mean loss : 39.26619656880697
=> Saving checkpoint
Train mAP: 1.008251892030239
100%|██████████| 12/12 [02:38<00:00, 13.18s/it, loss=73.9]
  Mean loss : 37.97222876548767
=> Saving checkpoint
Train mAP: 0.9584985218942166
100%|██████████| 12/12 [02:38<00:00, 13.22s/it, loss=33.9]
  Mean loss : 32.922889391581215
=> Saving checkpoint
Train mAP: 0.8698069430887699
100%|██████████| 12/12 [02:39<00:00, 13.29s/it, loss=27.5]  Mean loss : 34.81727981567383

Train mAP: 0.9338142840485824
100%|██████████| 12/12 [02:37<00:00, 13.11s/it, loss=34.8]
  Mean loss : 34.44208208719889
=> Saving checkpoint
Train mAP: 0.927067095511838
100%|██████████| 12/12 [02:38<00:00, 13.17s/it, loss=35]
  Mean loss : 34.73020267486572

Somewhat, some of mAP scores computed are bigger than 1. (I didn’t figure why this time, but I’ll try in future implementations of different models)

from train import main, plot_image

model, optimizer = main(load_model=True)
plot_image(idx=2, model=model)

Note there are two plot_image functions, each in train.py and utils.py file.
Coordinates of Predicted Bounding Boxes

[tensor([0.4892, 0.5136, 0.7015, 0.6358]), tensor([0.5608, 0.6186, 0.6383, 0.6278]), tensor([0.8117, 0.7309, 0.3865, 0.5740])]