The ‘Why’ Behind a Cuckoo’s Egg

drawn image of calendar icon

February 12, 2024

hand drawn writing icon

Anastassia Lauterbach

illustration of Romy and Roby stealing a cuckoo's egg from a nest in a tree.

Human vision is impressive and complex. Our kind of eye – that is common across vertebrates – evolved in less than 100 million years. It started when tiny organisms developed mutations that made them sensitive to light 600 million years ago. Optically and neurologically sophisticated organs – the eyes – occurred 500 million years ago. If you are curious about how an eye came to be, I would like to recommend an article at Scientific American ‘Evolution of the Eye‘.

Humans have closely mimicked how the human eye can capture light and colour a long time ago. The first type of photographic camera was invented around 1816, when Joseph Nicéphore Niépce used a light-sensitive material, silver chloride. When the camera shutter was open, the silver chloride darkened where it was exposed to light. Now, 200 years later, we have much more advanced system versions that can capture photos in digital form. But it’s turning out that replicating the eye’s mechanics was the easy part. Making a machine recognise what’s in the photo is much more difficult. Consider this picture: a human brain can look at it and immediately know it’s a cat. Our brains have the advantage of a couple of million years’ worth of evolutionary context to help us immediately understand what this is. A robot – or a software program – doesn’t have that same advantage.

To an algorithm, the image looks like a massive array of integer values representing intensities across the colour spectrum. There’s no context here, just an enormous pile of data. The context is the crux of getting algorithms to understand image content like the human brain does.

Machine learning allows scientists and engineers to effectively train the context for a dataset so that an algorithm can understand what all those numbers in a specific organization or order actually represent.

Yann LeCun, Chief AI Scientist at Meta, introduced a technique called convolution in the late 1980s that is widely adopted today and allows object-recognition systems to be more efficient by building in an array of connections that would enable a machine to recognize an object, no matter where it appears in the picture. In the book, our Roby masters the so-called Convolutional Neural Networks (CNN) perfectly. CNN is worked by breaking an image down into smaller groups of pixels called a filter. Each filter is a matrix of pixels. The network does a series of calculations on these pixels, comparing them against pixels in specific patterns the network is looking for. The first layer of a CNN is able to detect high-level patterns like rough edges and curves. As the network performs more convolutions, it can begin to identify specific objects like faces, animals, flowers, or birds’ eggs.

How does CNN know what to look for? And if its prediction is accurate? This is done through a large amount of labelled training data. When the CNN starts, all the filter values are randomised. As a result, its initial predictions make little sense. Each time CNN makes a prediction against labelled data, it uses an error (or loss) function to compare how close this prediction is to the actual images. The CNN updates its filter values and starts the process again. Ideally, each iteration performs with slightly more accuracy.

Most probably, Roby applies the YOLO 8 algorithm (You Only Look Once) for real-time detection of a bird’s egg. Even if he uses another algorithm, he sticks to the same input and output values as any other computer vision machine. Data is Roby’s input. His output is threefold:

  1. a value as a bounding box or location (in this case, a bird’s nest and the positioning of an egg in it)
  2. a confidence score (that ranges from ‘0’ to ‘1’)
  3. an object category (or class name, e.g., the cuckoo’s egg.)

There is another difficulty Roby must overcome to correctly differentiate between the eggs of an intruder and the eggs of the expecting birds’ families.

What if we want to analyse a video using machine learning instead of a single image? Our Roby gets to the birds’ nests in real time; he does not look at photographs of eggs that someone presents. At its core, a video is just a series of image frames. When we move to video, things get more difficult since the items we identify might change over time, or more likely, there’s a context between the video frames that’s highly important to labelling.

For example, suppose there’s a picture of a half-full cardboard box. In that case, we might want to label it ‘packing a box’ or ‘unpacking a box,’ depending on the frames before and after. This is where CNNs can’t perform well. They can only consider spatial features of the visual data in an image. They can’t handle temporal or time features, how a frame relates to the one before it. To address this issue, our Roby takes the output of his CNN and feeds it into another model that can handle the temporal nature of the videos. This type of model is called a Recurrent Neural Network, or RNN. While CNN treats groups of pixels independently, an RNN can retain information about what it’s already processed and use that in its decision-making.

RNNs can handle many types of input and output data. In the example of the boxes, we train the RNN by exposing it to an empty box, open box, closed box, and finally a label ‘packing.’ RNN processes each sequence using a loss or error function to compare its predicted output with the correct label. Then, it adjusts the weights and processes the sequence again until it achieves a higher accuracy.

In the book, Roby relies on Romy, who needs to take him up the tree, help him look into the nest, and do his object detection work. We know that Romy would never leave the house in bad weather. In this context, our Roby is working under perfect conditions. Nothing can spoil his dataset. Eggs are dry; no wind would bring leaves into the nest and spoil Roby’s vision. He deals with a clean dataset, indeed!

As computer vision Professor and probably the most famous woman in Artificial Intelligence, Fei Fei Li, said, just to hear is not the same as listening, to take pictures is not the same as seeing. Computer vision systems have progressed phenomenally in the past fifteen years. Still, recognising and detecting objects does not mean that robots and software programs achieve an understanding of what is in front of them. Does it matter? I leave it to my audience to decide. I can only disclose what is essential to me.

Computer Vision is taking on increasingly complex challenges. It is discovering things that rival humans performing the same image recognition task. I would celebrate every machine capable of correctly identifying cancer cells in a tissue biopsy and freeing doctors from the repetitive and tiresome tasks of looking at thousands of X-ray images. I applaud drones that can see cracks in a bridge and pass information to engineers so they can repair the damages before an accident happens. My Roby is imaginary, but myriads of robots and software programs work alongside humans daily.

romy and roby and the secrets of sleep book cover

Book 1

Romy & Roby And the Secrets Of Sleep.


Submit a Comment

Your email address will not be published. Required fields are marked *

Select your currency