Studio – VITAL CAPACITIES

With the video pipeline established, I turned my attention to processing the visual data using a combination of powerful tools. The core of this stage is the YOLO (You Only Look Once) object detection model. YOLO is a state-of-the-art, real-time object detection system that identifies and classifies objects in a single pass of an image, making it incredibly fast and efficient. For this project, I am intentionally using the model with the pre-trained COCO (Common Objects in Context) dataset. The COCO dataset is a large-scale collection of images depicting common objects in everyday scenes and is a standard benchmark for training and evaluating computer vision models.

My goal is not to achieve flawless object recognition but rather to play with the inherent “mistakes” and misinterpretations the machine makes. The default COCO dataset is perfectly suited for this, as its generalised training can lead to incorrect predictions when applied to novel or ambiguous scenes. To manipulate the image data, which is essentially a collection of pixels, I am using NumPy (Numerical Python). NumPy is a fundamental library for scientific computing in Python that allows for efficient manipulation of large, multi-dimensional arrays and matrices—the very structure that represents digital images.

What is Object Detection?

Object detection is a field of computer vision and image processing concerned with identifying and locating instances of objects within images and videos. Unlike simple image classification, which assigns a single label to an entire image, object detection models draw bounding boxes around each detected object and assign a class label to it, providing more detailed information about the scene.

What are NumPy and the COCO Dataset?

NumPy: A Python library that provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays. In image processing, an image is treated as a 3D array (height, width, colour channels), making NumPy an indispensable tool for any pixel-level manipulation.
COCO Dataset: Standing for “Common Objects in Context,” this is a massive dataset designed for object detection, segmentation, and captioning tasks. It contains hundreds of thousands of images with millions of labelled object instances across 80 “thing” categories and 91 “stuff” categories, providing a rich foundation for training computer vision models.

Objects Detectable by the COCO Dataset:

The COCO dataset can identify 80 common object categories, including:

People: person
Vehicles: bicycle, car, motorcycle, airplane, bus, train, truck, boat
Outdoor: traffic light, fire hydrant, stop sign, parking meter, bench
Animals: bird, cat, dog, horse, sheep, cow, elephant, bear, zebra, giraffe
Accessories: backpack, umbrella, handbag, tie, suitcase
Sports: frisbee, skis, snowboard, sports ball, kite, baseball bat, baseball glove, skateboard, surfboard, tennis racket
Kitchen: bottle, wine glass, cup, fork, knife, spoon, bowl
Food: banana, apple, sandwich, orange, broccoli, carrot, hot dog, pizza, donut, cake
Furniture: chair, couch, potted plant, bed, dining table, toilet
Electronics: tv, laptop, mouse, remote, keyboard, cell phone
Appliances: microwave, oven, toaster, sink, refrigerator
Indoor: book, clock, vase, scissors, teddy bear, hair drier, toothbrush

Weaving a World Model with Reinforcement Learning Concepts

With a large dataset of generated policies, the next step is to import them back into the primary software application that displays the 360-degree video. This integration allows the dynamically generated rules to influence the visual output or behaviour of the system in real-time. My use of the term “policy” is a deliberate nod to its origins in the field of Reinforcement Learning (RL), a concept dating back to the 1990s. In RL, a policy is the strategy an agent employs to make decisions and take actions in its environment. It is the core component that dictates the agent’s behaviour as it learns through trial and error to maximise cumulative reward. By generating policies based on visual input, my system is, in a sense, creating its own world model—a simplified, learned representation of its environment and the relationships within it. This process echoes the fundamental principles of how an AI agent learns to react to and make sense of the real world, a topic I have delved into in more detail in some of my earlier writings.

Generating AI Policies from Object Proximity

In a more experimental turn, I developed a separate piece of software to explore the concept of emergent behaviour based on the object detection output. This program uses a Large Language Model (LLM) to generate “policies” when objects from the COCO dataset are detected in close proximity on the screen. The system calculates the normalised distance between the bounding boxes of detected objects. This distance value is then fed to the LLM, which has been prompted to generate a policy or rule based on the perceived danger or interaction potential of the objects being close together. For instance, if a “person” and a “car” are detected very close to each other, the LLM might generate a high-alert policy, whereas a “cup” and a “dining table” would result in a benign, functional policy. This creates a dynamic system where the AI is not just identifying objects, but also creating a narrative or a set of rules about their relationships in the environment.

Setting the software with 360-Degree Vision

The initial phase of this project involved tackling the technical groundwork required to process 360-degree video. I began by using OpenCV, a powerful open-source computer vision library, to stitch together the two separate video feeds from my 360-degree camera. OpenCV is an essential tool for real-time image and video processing, providing the necessary functions to merge the hemispheric views into a single, equirectangular frame. After successfully connecting the camera to my computer, I set up a basic Python workspace within my integrated development environment (IDE). The next step was to write a script that could access the camera’s video stream and display it in a new window, confirming that the foundational hardware and software were communicating correctly. This setup provides the visual canvas upon which the subsequent layers of AI-driven interpretation will be built.

Embracing the Algorithmic Uncanny

https://docs.ultralytics.com

I am revisiting a creative process that has captivated my interest for some time: enabling an agent to perceive and learn about its environment through the lens of a computer vision model. In a previous exploration, I experimented with CLIP (Contrastive Language-Image Pre-Training), which led to the whimsical creation of a sphere composed of text, a visual representation of the model’s understanding. This time, however, my focus shifts to the YOLO (You Only Look Once) model. My prior experiences with YOLO, using the default COCO dataset, often yielded amusingly incorrect object detections—a lamp mistaken for a toilet, or a cup identified as a person’s head. Instead of striving for perfect accuracy, I intend to embrace these algorithmic errors. This project will be a playful exploration of the incorrectness and the fascinating illusions generated by an AI model, turning its faults into a source of creative inspiration.

* visualization using CLIP and Blender for artwork “Golem Wander in Crossroads”

Ultralytics YOLO

Ultralytics YOLO is a family of real-time object detection models renowned for their speed and efficiency. Unlike traditional models that require multiple passes over an image, YOLO processes the entire image in a single pass to identify and locate objects, making it ideal for applications like autonomous driving and video surveillance. The architecture divides an image into a grid, and each grid cell is responsible for predicting bounding boxes and class probabilities for objects centered within it. Over the years, YOLO has evolved through numerous versions, each improving on the speed and accuracy of its predecessors.
(Text from Gemini-2.5-Pro and edited by artist)

CLIP

https://github.com/openai/CLIP

CLIP (Contrastive Language-Image Pre-Training), developed by OpenAI, is a neural network that learns visual concepts from natural language descriptions. It consists of two main components: an image encoder and a text encoder, which are trained jointly on a massive dataset of 400 million image-text pairs from the internet. This allows CLIP to create a shared embedding space where similar images and text descriptions are located close to one another. A key capability of CLIP is “zero-shot” classification, meaning it can classify images into categories it wasn’t explicitly trained on, simply by providing text descriptions of those categories.

(Text from Gemini-2.5-Pro and edited by artist)

COCO

https://cocodataset.org/#home

https://docs.ultralytics.com/datasets/detect/coco

COCO (Common Objects in Context), is a large-scale object detection, segmentation, and captioning dataset. It is designed to encourage research on a wide variety of object categories and is commonly used for benchmarking computer vision models. It is an essential dataset for researchers and developers working on object detection, segmentation, and pose estimation tasks.

(Text from Ultralytics YOLO Docs)

Studio

More moments

A hand hangs out of the car window and is reflected in the wing mirror. The road behind doesn't look too busy.

A hand hangs out of the car window and is reflected in the wing mirror. The fingers are stretched out as if trying to reach or perhaps gently asking for attention. The road behind gets busier. There is a double decker bus and lorry approaching.

A hand hangs out of the car window and is reflected in the wing mirror. The fingers are straight but relaxed as if trying to gently asking for attention or maybe feeling a breeze. The road behind gets busier. There is a double decker bus and the lorry passes by.

Sometimes accidents can be so difficult to recreate. Sometimes in trying to recreate them, they can lead to new things. I loved the gentleness of the hand compared to all the noise and traffic in the background. Not sure if the gesture is quite what I’d like it to be yet.

Studio

Lullaby

Rebekah Ubuntu has been encouraging us to consider the process of this residency to be ‘the thing’. A big part of my process has been working around looking after my son, so to mark this here is a lullaby that I’ve been singing to him since he was born, recorded on my phone a while ago so he can hear it even on the rare occasion I’m not there.

Video Description: The subtitles of the song are in white text at the bottom of the screen. They overlay a background of refracted light ripples moving leisurely across sand at the bottom of the ocean. Occasionally they bounce backwards and reverse their direction.

Audio Caption: Leah sings in a bathroom – slightly echoey but small-sounding. The mic is just a phone, and sometimes distorts with the sound of breathing. The song is slow, with lots of space in between lines.

Song to the Siren | Tim Buckley

(Lyrics are slightly altered by Leah from the original)

All afloat on the shipless ocean

I did all my best to smile

‘Til your singing eyes and fingers

Drew me loving to your isle

For you sang

Sail to me

Sail to me let me unfold you

Here I am

Waiting to hold you

Did I dream

You dreamed about me?

Were you hare while I was fox?

Now my foolish boat is leaning

Torn lovelorn on your rocks

For you sang

Touch me not

Touch me not come back tomorrow

Oh my heart

Shies from the sorrow

I’m as puzzled as a newborn child

I’m as riddled as the tide

Should I stand amid the breakers?

Should I lie with death for my bride?

And you sang

Swim to me

Swim to me let me unfold you

Here I am

Waiting to hold you.

Studio

Had a little moment earlier in the week feeling the breeze on my hand while in traffic. I wanted to recreate this image or at least experiment with this idea a little more but then some barriers got in the way (broken lifts) so I’ve been stuck indoors for a couple days.

An arm hangs out a car window t on a tree-lined street In London. The car's side mirror reflects the hand feeling a gentle breeze. It is a sunny day.

Studio

A solid muted coral-pink color background

A solid warm beige or tan color background

Some colours I noticed dominating previous images I have taken. I had it in my head that I wanted to use more colour in any potential new work. I wondered if some previous work could inform what those colours could be…?

Studio

Experimenting…

Today I’ve been testing out an experimental way of filming, which I’ll use to make part of the work. Here is a little peek at what’s to come…

ID: Out of the darkness appears a stream of white light. It refracts into rainbow colours, then reassembles its original colour, slipping in and out of pink yellow and blue and back to white again. Waves of more defined lines flow like ripples, but smokey – lighter than water.

Studio

At the Altar

‘Those seeking divine help for an illness or affliction might rest overnight in special temple buildings. On waking, priests of the Roman god of healing, Aesculapius, helped them interpret their dreams or visions’

I made it to the temple.

A museum banner in a pillared 18th century interior. On it is a Roman sculpted face, obscured on shadow and picked out on dramatic light. It reads ‘the goddess awaits you at the temple of Sulis Minerva’

A stone sculpted head in a dark space. It’s on a plinth and its mouth and nose have eroded away, leaving blank eyes and and impressive plaited hair arrangement like a crown over her head.

A coffin underfoot, under glass. It is small and yellowish, made of an unknown material. It is enclosed at one end, and warped by its two thousand years.

Behind a statue, its cape hanging in folds, we look down upon the green bath from a height. The statue’s counterparts face it opposite, along the walkway which follows around the edge of the bath. Each of them is permanently posed, guarding or adorning the watery centre.

Up close at the corner of the bath. The cut stone corner descends in steps, the water consumes them in its milky green opacity.

A central view of the bath from the bathside: a green rectangular body of water, Roman pillars surrounding it. A small walkway runs behind the pillars, ending at ancient walls. Above, statues line a balcony, and the windows of other old (but perhaps less ancient) buildings surround it.

Image IDs in Alt Text, Video IDs here:

A green body of water, edged in stone. At this corner, a flat rock – perhaps an ancient seat – is laid over a stream trickling underneath it, from a source behind, into the milky green pool. We zoom out and see more of the walkway behind, and the length of the bath. Pillars surround the edge of the pool, receding into darkness behind. We zoom back in to the gentle trickling.
Hot steaming water gushes out of an arched hole. Dark and underground, its surfaces stained orange by sulphur or some mysterious element.
A hot, bubbling green thermal spring. It is contained by straight stone edges, cut into a square with a corner lopped off where the wall of a building in the same material meets it. We zoom into the bubbles, becoming consumed by it.
Water rippling gently in the sun, down its shallow stone path. The surface underneath is stained orange by something, something invisible in the clear water. It flows underneath a stone slab, and into its destination: the large body of green water. We follow its small journey.

REST without AI

This week, Hong Kong was battered by heavy rain, and I took the chance to take a breather and recharge. The last few weeks have been manic. I’ve been working on three software projects at once. The non-stop pace had left me totally overloaded, so this rain break was just what I needed. I decided to visit my wife home village, a recharging place in the middle of the city’s forests. The air smelt of earth, and the quiet beauty of the landscape was a nice change. I could feel the tension of my tightly wound days begin to unravel, replaced by a sense of calm that felt long overdue. The mountains were like silent guards, making me think about the balance between creativity and rest.

I might have got myself a little stuck in my searches. I tried a few online image libraries… trawling the many pages of the Wellcome Collection’s catalogue which I still haven’t reached the end of.

I already have examples of what I would like from some of my previous work that I have shared so I’m giving myself a reminder that the task isn’t impossible. I am considering the thought that maybe I’m already surrounded by the images I’m looking for. For example, I have a mug with this John William Waterhouse painting on it.

A 19th-century painting depicting Saint Cecilia seated in a garden, eyes closed in serene contemplation, with an open book resting on her lap. Two angels kneel before her, one playing a violin and the other holding an instrument. Behind them, a stone balustrade overlooks a harbor with ships and distant mountains. The scene is filled with lush roses and greenery, evoking a peaceful, spiritual atmosphere. — *Saint Cecilia* (1895) by John William Waterhouse

Books on Drawing™️

Wanted to include some phone images I took from a book on drawing people that I found in the local library. There are loads of them on drawing people, cats, dogs, flowers, buildings… It’s all very Drawing™️.

An image of an open book. The left page shows a man's trunk sketched and his head turned to the side. The right page includes a few sketches each focusing on different sections of the trunk.

A little fascinated by the eery “perfection” of it all. Especially in this book which was full of sketches and descriptions of muscles that make up a body part and how to combine it all together on a super athletic male body. It’s quite the opposite to what I was hoping to find when I set out on this search for images. It’s almost too healthy and tense. There’s no ease.

An image of an open book. The left pag has text explaining how different parts of the head come together. The right page includes a few sketches each focusing on different aspects of a head such as the skull and different perspectives.

A page from a book showing a drawing of a shoulder with every muscle clearly highlighted.

Studio

The Temple of Sulis Minerva

Daytime, a central view across a green rectangular body of water, Roman pillars surrounding it. A small walkway runs behind the pillars, ending at ancient walls. Above, statues line a balcony, and the windows of other old (but perhaps less ancient) buildings surround it.

I’m pilgrimaging to the Temple of Sulis Minerva, otherwise known as the Roman Baths. More soon…

The same view but from the opposite side. The main two differences are: there is only sky behind the statues lining the balcony, and the sky reflects a warm setting sun. And: torches flame around the pool, illuminating the walkways behind and reflecting in the water.

The same place but at night. The colour of the water is now totally obfuscated by reflections - the torches burn brighter in the dark. We're at the corner of the water, looking across it diagonally. The statues at the top seem to project shadows onto the building behind them, but perhaps these are from living people who are out of view.

On the left, a vintage ad shows a woman lighting a "Metro" gas burner in a classic interior. On the right, a modern photo in a black frame depicts a hand holding a rain-soaked handrail.

I was digging through old images and enjoyed how, upon opening the image on the left, it brought up the one on the right which was buried under windows and tabs on my screen.

Rough selection of images

A detailed black-and-white engraving of two hands preparing food. The left hand holds a spoon inside a cooking pot, stirring, while the right hand pours liquid from a small jug into the pot. The pot sits on a flat surface, and the image is captioned below: “Position of hands mixing the liaison.” — Position of hands mixing the liaison. Pouring the mixture from one saucepan into another

A traditional Japanese woodblock print depicting a person wearing a red kimono decorated with pink and white flowers and green inner layers. The figure is seated and holds a small tea bowl delicately in one hand, with their gaze directed toward it. The style is minimal yet detailed, with soft colors and fine outlines typical of ukiyo-e art. — Close up of Seated lady holding cup in one hand, leaning on other hand

Vintage black-and-white photo of a bedroom with two women at a vanity—one seated in a gown, the other standing and assisting her. Large curtained windows fill the background. Caption reads: “A Bedroom – The Show House, Telford Avenue, Streatham Hill, S.W.”

The Question: 11AUG2025

The modern AI, most prominently represented by the Large Language Models (LLMs), prompts a fundamental question: Does it contain consciousness? To pose the question another way, the original wellspring of the AI concept is found where brain scientists, computer scientists and mathematicians began to explore if consciousness itself could be understood as a mathematical or computational process, as a system. This inquiry delves into whether today’s advanced automation is merely sophisticated mimicry or a genuine step towards the sentient machines envisioned by pioneers of the field. As Noam Chomsky openly criticises the GPT model as a fake intelligence, a copycat only. Or is there an even deeper question: is there any form of computing that can capture the differences between intelligence, awareness and consciousness? Or we simply don’t understand our kind. Those three words are just a game of our language, a misconception; they never exist.

Studio

Searching for the right images