The battle between copyright holders and generative AI platforms is heating up. What happens when I ask Open AI’s ChatGPT for some copyright advice?
Copyright holders are starting to sue AI vendors
Whether it’s screeds of text or pretty pictures, generative AI needs a large bank of training data to work from. In some instances, we know what data a system is using, such as Stable Diffusion’s use of the LAION-5B dataset. In other instances, such as Chat GPT, we have no idea what data was used to train the system… or where OpenAI got it from. All we know is that the dataset is large and “scraped” from the internet. This means there’s a very good chance that it contains material protected by copyright and whether or not OpenAI has the right to use this data and, furthermore, to monetise it is highly questionable.
People are starting to question AI vendors on where they get their data; from publications that have seen their work plagiarized by AI to large corporations like Getty Images which is taking the makers of Stable Diffusion to court. A group of artists have even started a class action against AI art makers that you join here.
Is Generative AI stealing from copyright holders?
The question of whether or not an AI vendor has breached copyright seems to hang on two things; the definition of “fair use” and the definition of a “derivative work”.
In the United States, for example, copyright rights are limited by the doctrine of “fair use,” under which certain uses of copyrighted material for, but not limited to, criticism, commentary, news reporting, teaching, scholarship, or research may be considered fair. A derivative work is a work based on or derived from one or more already existing works. Common derivative works include translations, musical arrangements, motion picture versions of literary material or plays, art reproductions, abridgements, and condensations of preexisting works.
Arguably what ChatGPT has produced is an impressive and elaborate piece of research work, answering the question of “can this be done with AI technology”. However, with investment now flooding into the company a monetisation strategy is clearly not far behind. In the meantime, whilst OpenAI continues to provide ChatGPT for free, there are already lots of people monetising it for themselves.
Every time this happens, it is impacting the ability of a working human creator to sell their content or services. Furthermore, it is devaluing the work of others by driving costs down towards zero.
Regardless of whether or not ChatGPT is “fair use”, the content being created with it, if agreed to be a derivative work, are subject to copyright law. That means the original creators should be paid to license their work.
It’s a complex question and would be pretty tricky to untangle for a layperson… unless I had access to a well-trained generative AI that I could ask to sort it all out for me.
I decided to ask ChatGPT what it thought the answer was…
Why text generated by ChatGPT should be considered a derivative work for copyright
That all sounds pretty compelling, right? Gold start for ChatGPT on that one. In the interests of balance, I also asked for an argument against text generated by ChatGPT being a derivative work…
This may be some bias on my part, but I find this far less compelling than the previous argument. It’s also revealed another problem – as the work generated does not have a human author, it’s not eligible for copyright. This came up recently in the case of an AI-generated comic book that was denied copyright protection in the US.
As far as I can see, ChatGPT seems to agree – these are derivative works.
But whatever is to be done about it? Well, perhaps we could start by paying the creators of the content that was used to train the model…
Five reasons OpenAI should pay the creators of content used to train ChatGPT
Wow. Powerful stuff there. But we know that AI vendors have already argued that’s not feasible for them to pay creators or even to track them down in the first place. What should we do about that?
If copyright holders cannot be contacted, should their work be excluded from the training dataset?
ChatGPT clearly thinks that copyright material should be excluded from the training dataset if the copyright holder cannot be contacted. This reminds me a debate I was involved in recently online regarding StabilityAI introducing support for an “opt-out” tag that could be used to prevent your content from being scraped into their dataset. Personally, though, I don’t think we should have to “opt-out” of being stolen from.
What should AI vendors be saying to copyright holders?
Working from the assumption that copyright holders did need to be contacted and their permission sought to include their work in any training dataset, I decided to ask ChatGPT to write me a letter to a copyright holder to ask them that permission and include fair compensation. Although ChatGPT didn’t put a numeric figure on the compensation, it did offer some stark warnings to the copyright holder about how they should value their own content.
Another interesting feature of this letter is that it sets out that OpenAI would properly credit the copyright holder and inform the user of the work being used. This is something that ChatGPT currently refuses to do – if you ask it for a citation it often responds that it has drawn from “many sources” and can’t tell you exactly where any information came from. In that respect, it sometimes reminds me of Del Boy in “Only Fools and Horses” who would say of his somewhat questionable merchandise “Where it all comes from is a mystery…”.
The warnings and caveats in this letter from Chat GPT gave me pause. What other problems might a copyright holder face if they decided to allow their content to be included in an AI training dataset?
There was only one way to find out…
Five reasons copyright holders should refuse permission for OpenAI to use their material in its training dataset
Yikes. Based on that, I don’t think I’d ever hand over a license to an AI vendor without major compensation. Maybe that’s why they haven’t been asking…
Whether or not OpenAI or any other AI vendor will ever proactively reach out to copyright holders seems to now rest on the outcome of court cases that will test some of the assumptions and beliefs that people have about the material that has gone into the AI training datasets and the material that is coming out of them.
The problem with this whole approach is, of course, that it is placing the onus back on individual copyright holders to protect their copyrights by opting out and chasing down companies that they think may have used their work without permission. That’s one thing if you’re Getty Images, it’s quite another for individual writers and artists.
Thankfully, ChatGPT can come to the rescue once again…
What should you do if you think OpenAI has used your copyrighted material without permission?
And, just in case you were wondering…
How do I write a DMCA request asking OpenAI to confirm if my copyright material is included in their dataset and, if so, remove it?
And on that note…
I’ll leave you with this final thought. Everything and anything you get from ChatGPT started out somewhere else on the internet in some form. Perhaps just as ingredients waiting to be pulled together, bricks just waiting to be put into the right arrangement, perhaps as a fully-fledged answer to your question that merely needs to take a spin through the ChatGPT machine to come out looking shiny and new.
The point is – humans made that content. Not a machine.
Don’t let ChatGPT do your thinking and your research for you. Search online, read stuff, and maybe leave a comment or two on the websites that you find most helpful. If you don’t, then there’s a chance that the future includes a lot more members-only paywalled content, even from humble little bloggers like me.
Until then, here’s The Streets and some of their Original Pirate Material. Seems apt.