Artificial Inspiration – Debating The Implications Of Training AI To Create Images

There is growing debate over the ethical and legal implications of using millions of images drawn from the internet to train AI powered software to create ‘new’ images. It feels like the beginning of a journey which could have profound implications for the creative industries – so it is perhaps predictable that the legal battles have begun.

One of the more alarming technological predictions about the 21st century is the potential automation of more and more fields of employment. Some qualified opinions – perhaps most famously at the Oxford Martin School – have suggested that almost half of all work in the United States might eventually be automated, a figure that sounds like a deliberate exaggeration, but isn’t.

That idea is worrying people well beyond the blue-collar fields traditionally most threatened by automation. It’s already visibly happening even in the creative industries, where people might have felt less exposed. An early skirmish in that fight is a lawsuit targeting AI developers Midjourney and Stability AI, as well as the vastly popular consumer-facing site DeviantArt, launched by three artists who claim that the use of their work as training data for a machine learning system infringes copyright. The group’s lawyer has suggested that the Stable Diffusion text-to-image processor “contains unauthorized copies of millions – and possibly billions – of copyrighted images.” In parallel Getty Images have also initiated action against Stability AI.

There are certainly a lot of ways to implement machine learning – or artificial intelligence, depending on the details – but claims that they literally rely on keeping copies of the training data seem likely to be questioned. Exactly what a system like Stable Diffusion absorbs from its training data is difficult to express. A full technical description of what’s going on in systems like Stable Diffusion is beyond the scope of this article, but a simple example of modern machine learning is reading handwritten characters, a favorite early application. The input data (the brightness of dozens of pixels) is transformed into output data (one of twenty-six characters).

That transformation is done by setting up connections between those dozens of input and output nodes via (usually) several layers of intermediate nodes, with interconnections between the layers. Some of the connections have more influence over the result than others; they’re weighted. Those weights are set such that pixels from input images of (say) a handwritten letter A will tend to activate the output node representing the letter A. The process of setting those weights is how the system learns, using many different images of known characters and adjusting the weights for the desired result. Experience shows that the system is then likely to be able to interpret previously unknown images showing handwritten characters with good accuracy.

This is a neural network, and while there are a few ways to implement software that can reasonably be described as machine learning or AI, a neural network is the prototypical example. Relating that to the court case, we now know that the things learned by the system are represented in the configuration of weighted interconnections. Analysis of that internal state is an advanced research topic. In practical situations they often behave as an impenetrable black box. It’s notoriously difficult to interpret the connection configuration of a well-trained AI. That’s why, for instance, it’s very difficult to find out why certain types of machine learning might have made a particular decision.

Given all that, it’s tricky to claim that any machine learning device can realistically be said to contain copies of an image, or to describe what it does as “collage,” as has been said. It’s just as difficult, though, to discuss what they do contain without resorting to vague allusions. Inevitably, they contain something of the distilled essence of the training data. Represented somewhere in that hard-to-interpret miasma of information is, ideally, some kind of understanding of the subject the system is intended to handle.

It’s certainly enough for an AI to duplicate the style of a particular artist, as one particular artist has complained. The AI might not contain image data, but if it contains enough information to create works that look like they were created by a specific person, it’s difficult to claim that there is nothing of that person’s work embedded in that AI. The way that happens may be complicated and poorly understood, but we can be confident it does happen. As such, the fundamental objection to this application of AI might simply be that it’s being used as a way to skirt copyright law by a process of diffusion.

The output of the AI is based on the training data – it can’t be based on anything else – but there’s a huge number of people represented in that training data, and because of the problems with determining why any AI made any particular decision, it’s very hard to associate any particular feature of a generated image with any particular part of the training data.

Crucially, this is exactly how human beings work. That’s why we use the word “intelligence” in the initialism “AI”. We are all a sum of our life experiences. The only way anyone learns how to draw pictures, compose music or form any opinion on anything at all is by experiencing what other people have done. We use words like “inspiration” to describe situations where the work of one person has influenced the work of another. Even that is fraught with court cases intended to decide whether one piece of music is too directly derivative of another. It’s hard to imagine we can make that decision for AI if we can barely do it with humans.

And even if we could, most of us probably don’t apply the same moral relativity to AIs as we would to human artists. What we’re talking about here is potentially the work of an individual, possibly part-time or professional artist posting amateur works on a user-generated content website. Having a large corporation extract the essence of that work as part of an automatic process which might become highly profitable while circumventing the artist entirely is something that instinctively seems wrong to a lot of people in a way it wouldn’t if it were a human being.

One problem is the sheer scale of what’s possible. The AI can work day and night at very low cost to flood the world with material much faster than any real competing artist could ever hope to achieve. Conversely, a human capable of competently creating something someone wants to buy has at least put time into gaining that competence. We can argue about the exact cash value of that kind of effort, but it is at least a limiting step on how much material can be published. An AI, meanwhile, can be retooled to create anything, or to duplicate any artist, or many artists, at any time, in huge volume and with few resources.

The other way to look at this is to consider not whether it’s desirable or not, but instead, ponder what might be done to prevent it. The process of exposing an AI to training data inevitably involves duplicating that data, but then again so does sending it across a network and keeping it in a browser’s cache, so whether that’s something we can reasonably control with existing copyright law is dubious.

Even if we could control it, though, the issue is whether we’d want to. Current AI research often relies on exposing new systems to databases of training information so unimaginably vast that they could barely exist without the modern internet. Attempts could be made to restrict that, but some of those systems might plausibly be capable of genuinely world-changing things and any serious restriction on doing that might damage something which has enormous potential to help solve society-level problems.

Some sort of compromise is probably needed here. Back in the more prosaic world of film and TV production, there are certainly stakes both for established content creators, and for people who want to use AI to create content. Those are already groups which cross over quite significantly, so many people will have a foot in both camps of this issue, and the morality is at least as influential as the technology.

On one hand, the DeviantArt lawsuit will inevitably be the first of many as society figures out how it will interact with ascendant AI. On the other, new technologies have frequently been heralded as much more influential than they turn out to be. As we spend the early twenty-first century struggling to clean up the various deadly messes left by the dawn of the nuclear age, there’s searing irony in a 1952 quote from the then chairman of the Atomic Energy Commission who claimed that nuclear energy would be too cheap to meter. Reality, it seems, has a way of making things mundane.

The website CNET Money recently found this out when it began publishing articles described cautiously as “AI assisted,” and discovered that some of those articles contained clearly counterfactual statements. Self-driving cars are already establishing a chequered history, although possibly only because people are deliberately defeating the safety interlocks. Science fiction has often shown us what the worst possible case of AI could be. In an ideal world, we’ll spend the next few decades discovering what the best case might be. Perhaps the most realistic expectation is that we’ll get the usual complicated mix of both.

You might also like...

HDR & WCG For Broadcast: Part 3 - Achieving Simultaneous HDR-SDR Workflows

Welcome to Part 3 of ‘HDR & WCG For Broadcast’ - a major 10 article exploration of the science and practical applications of all aspects of High Dynamic Range and Wide Color Gamut for broadcast production. Part 3 discusses the creative challenges of HDR…

The Resolution Revolution

We can now capture video in much higher resolutions than we can transmit, distribute and display. But should we?

Microphones: Part 3 - Human Auditory System

To get the best out of a microphone it is important to understand how it differs from the human ear.

HDR Picture Fundamentals: Camera Technology

Understanding the terminology and technical theory of camera sensors & lenses is a key element of specifying systems to meet the consumer desire for High Dynamic Range.

Demands On Production With HDR & WCG

The adoption of HDR requires adjustments in workflow that place different requirements on both people and technology, especially when multiple formats are required simultaneously.