What a Machine Learns When It Learns to See
There is a quiet sleight of hand in the phrase computer vision. We say a model "sees," and the word smuggles in a whole philosophy — one we rarely stop to examine. Seeing, for us, is bound up with attention, intention, and a body that has spent a lifetime bumping into the world. A model has none of that. It has pixels and labels.
The map is not the territory
A model trained on captioned images learns the correlation between pixels and the words we chose to attach to them. When it labels a photograph "a dog on a beach," it is not recognizing a dog or a beach. It is recovering the statistical residue of a thousand people who once agreed to call certain arrangements of light by those names. It is, in a precise sense, learning us — our categories, our blind spots, the things we found worth naming and the things we walked past.
The danger is not that machines will think like people, but that people will agree to think like machines.
This is why a vision model can be confidently, fluently wrong. It has no world to check itself against — only the consensus of its training set. Show it something no one bothered to photograph, and it will reach for the nearest familiar word and hand it to you with the same calm certainty it gives to everything else.
Perception is compression
What we call perception is closer to compression than to contact. Both eyes and models throw most of the signal away and keep a useful summary. The difference is that we evolved our discards over millions of years of consequences, and the model inherited its discards from a dataset assembled in an afternoon.
That should make us humble in both directions. Humble about the machine, which sees only what we taught it to find. And humble about ourselves — because once you notice how much seeing is really deciding what to ignore, you start to wonder how much of your own clear-eyed view of the world is just a compression you've stopped questioning.