When You Only Look Once (YOLO) came out in 2015, it changed the world of computer vision by giving us a new way to find objects that was faster and better than anything else that had come before it. YOLOv1 could find and sort through many items with just one run through a neural network, unlike other approaches that needed to go over an image several times.
The Problem YOLOv1 Solved
Before YOLO, finding objects was a long and complicated procedure that took a lot of steps. R-CNN (Region-based Convolutional Neural Networks) and other traditional approaches operated by first finding areas in an image that potentially have objects in them and then running a classifier on each of these areas separately. This method needed hundreds of forward runs through the network for only one image, which made it almost impossible to detect in real time.
The approaches that are already out there see object detection as a classification challenge. They would scan an image with a sliding window, cut off small parts, and then ask a classifier, "What object is in this box?" for each part. This method might have worked, but it was quite slow. These older approaches were like needing to look at each piece of a jigsaw puzzle one at a time before you could see the whole picture.
The researchers didn't see object detection as a classification challenge; instead, they saw it as a single regression problem.
How YOLOv1 Really Works
The Grid System Method
YOLOv1 is so smart because it is so simple. The technique breaks an input image up into a 7Γ7 grid, which makes 49 cells. Every cell in this grid is in charge of finding items whose center points are inside that cell. This is like cutting a picture into squares and giving each one the responsibility of figuring out what's in its region.
First, it guesses bounding boxes, which are rectangular areas that should hold things. Each cell can predict more than one bounding box (usually two in YOLOv1). For each bounding box, the network predicts five values: the box's center's x and y coordinates, its width and height, and a confidence score that shows how sure the network is that there is an object in that box.
Class Prediction and Scores of Confidence
Each grid cell does more than just find things; it also predicts class probabilities. The network gives each possible object class (like "car," "person," "dog," etc.) a chance of being in that cell. The PASCAL VOC dataset had 20 different object classes that were used to train the first version of YOLOv1.
There are two things that go into the final confidence score for each detection: how sure the network is that there is an object in the anticipated bounding box and how sure it is about what class that object belongs to.
Single Neural Network Design
The fact that all of this happens in a single neural network with only one forward pass is what makes YOLO so special.. The network takes information from the full image to guess each bounding box.
The design itself is really simple: it has 24 convolutional layers and then 2 fully connected layers. Some of the techniques that were common in 2015 that the network uses are 1Γ1 convolutions to reduce the number of dimensions, max pooling to down-sample, leaky ReLU activation functions, and dropout to regularize.
Advantages Over Previous Methods
Speed and Performance in Real Time
The best thing about YOLOv1 was how fast it was. Previous state-of-the-art approaches, such R-CNN, could take 47 seconds to process one image. The speed stemmed from the basic design choice to employ a single network pass instead of the thousands that region-proposal approaches need. This level of efficiency made it possible to use object detection in real-life situations where speed is important, such self-driving cars, live video monitoring, and interactive apps.
Understanding the Global Context
Sliding window methods only see little parts of a picture at a time, but YOLO sees the whole image during training and testing. This global context helps the algorithm figure out how things are related and cuts down on errors in the background. YOLO also makes fewer false positive mistakes than approaches that merely look at small parts of an image because it can reason about the whole picture.
Optimization from Start to Finish
Because the entire detection pipeline consists of a single neural network, YOLOv1 can be optimized end-to-end directly on detection performance. Previous methods often involved multiple stages that were optimized separately, which could lead to suboptimal overall performance.
Understanding YOLOv1's Limitations
Problems with the accuracy of localization
The biggest problem was that YOLO wasn't as good at figuring out exactly where things were as slower approaches like Fast R-CNN. It was fast, but it wasn't as good at figuring out where things were. This restriction was caused by a number of architectural decisions. First, because down-sampling layers were used throughout the network, YOLO was only able to function with coarser features instead of fine details. This makes it tougher to find the exact edges of objects.
Finding small and grouped objects
YOLOv1 had trouble finding little things and those that were in groups. Because of the 7Γ7 grid design, each grid cell could only see a small number of items (at most 2 different classes). It was hard for the algorithm to find all of the items of the same class when they were near together since they might all fit in the same grid cell.
Generalization of Aspect Ratio
Another big problem was that the algorithm had trouble with items that had aspect ratios (width-to-height ratios) that weren't properly represented in the training data.
If YOLO was mostly trained on pictures of automobiles that were about twice as wide as they were tall, it might have trouble finding very long, thin cars or things that were fashioned in a strange way.
The technical innovation that makes it fast
Creating a Loss Function
The loss function design of YOLOv1 was one of the smartest things about it. It had to manage a lot of duties at once, like predicting bounding box coordinates, confidence scores, and class probabilities. The loss function gave varying weights to different types of predictions, knowing that not all mistakes are equally important.
The loss function put a lot of weight on mistakes in bounding box predictions when objects were present and less weight on confidence forecasts for grid cells that didn't have objects. This careful balancing exercise enabled the network learn to generate accurate predictions without getting false positives.
Strategy for extracting features
The architecture of YOLOv1's convolutional neural network was based on GoogLeNet but changed for detecting tasks. The network uses alternating 1Γ1 and 3Γ3 convolutional layers to get features quickly without making the computing requirements too high. This design choice was very important for getting the speed benefits that made real-time detection possible.
When making detection decisions, YOLO could use information from the whole image instead of just the local features around each expected object. This universal feature utilization helped the algorithm run faster and cut down on false positives.
Conclusion
YOLOv1 had some problems, but it was the first detection method and set the stage for many others. The fundamental ideas that guided the development of YOLOv2, YOLOv3, YOLOv4, and beyond were to treat detection as a single regression problem, use global image context, and optimize for performance.
If you're interested in computer vision, learning about YOLOv1 is a great way to see how questioning basic assumptions and looking at challenges from different angles may lead to groundbreaking research. The algorithm's success shows that sometimes the best new ideas come from thinking about old problems in new ways instead of making small changes to how things are done now.