Artificial intelligence firm Anthropic hits out at copyright lawsuit filed by music publishing corporations, claiming the content ingested into its models falls under ‘fair use’ and that any licensing regime created to manage its use of copyrighted material in training data would be too complex and costly to work in practice
GenAI tools ‘could not exist’ if firms are made to pay copyright::undefined
It likely doesn’t break the law. You should check out this article by Kit Walsh, a senior staff attorney at the EFF, and this one by Katherine Klosek, the director of information policy and federal relations at the Association of Research Libraries.
Headlines like these let people assume that it’s illegal, rather than educate people on their rights.
The Kit Walsh article purposefully handwaves around a couple of issues that could present larger issues as law suits in this arena continue.
He says that due to the size of training data and the model, only a byte of data per image could be stored in any compressed format, but this assumes all training data is treated equally. It’s very possible certain image artifacts are compressed/stored in the weights more than other images.
I think some of the points he makes are valid, but they’re making a lot of assumptions about what is actually going on in these models which we either don’t know for certain or have evidence to the contrary.
I didn’t read Katherine’s article so maybe there is something more there.
I’m not sure she does, just read the article and it focuses primarily what models can train on. However, the real meat of the issue, at least I think, with GenAI is what it produces.
For example, if I built a model that just spit out exact frames from “Space Jam”, I don’t think anyone would argue that would be a problem. The question is where is the line?
It’s not surprising that the complaints don’t include examples of substantially similar images. Research regarding privacy concerns suggests it is unlikely it is that a diffusion-based model will produce outputs that closely resemble one of the inputs.
According to this research, there is a small chance that a diffusion model will store information that makes it possible to recreate something close to an image in its training data, provided that the image in question is duplicated many times during training. But the chances of an image in the training data set being duplicated in output, even from a prompt specifically designed to do just that, is literally less than one in a million.
The linked paper goes into more detail.
On the note of output, I think you’re responsible for infringing works, whether you used Photoshop, copy & paste, or a generative model. Also, specific instances will need to be evaluated individually, and there might be models that don’t qualify. Midjourney’s new model is so poorly trained that it’s downright easy to get these bad outputs.
It likely doesn’t break the law. You should check out this article by Kit Walsh, a senior staff attorney at the EFF, and this one by Katherine Klosek, the director of information policy and federal relations at the Association of Research Libraries.
Headlines like these let people assume that it’s illegal, rather than educate people on their rights.
The Kit Walsh article purposefully handwaves around a couple of issues that could present larger issues as law suits in this arena continue.
He says that due to the size of training data and the model, only a byte of data per image could be stored in any compressed format, but this assumes all training data is treated equally. It’s very possible certain image artifacts are compressed/stored in the weights more than other images.
These models don’t produce exact copies. Beyond the Getty issue, nytimes recently released an article about a near duplicate - https://www.nytimes.com/interactive/2024/01/25/business/ai-image-generators-openai-microsoft-midjourney-copyright.html.
I think some of the points he makes are valid, but they’re making a lot of assumptions about what is actually going on in these models which we either don’t know for certain or have evidence to the contrary.
I didn’t read Katherine’s article so maybe there is something more there.
She addresses both of those, actually. The Midjourney thing isn’t new, It’s the sign of a poorly trained model.
I’m not sure she does, just read the article and it focuses primarily what models can train on. However, the real meat of the issue, at least I think, with GenAI is what it produces.
For example, if I built a model that just spit out exact frames from “Space Jam”, I don’t think anyone would argue that would be a problem. The question is where is the line?
This part does:
The linked paper goes into more detail.
On the note of output, I think you’re responsible for infringing works, whether you used Photoshop, copy & paste, or a generative model. Also, specific instances will need to be evaluated individually, and there might be models that don’t qualify. Midjourney’s new model is so poorly trained that it’s downright easy to get these bad outputs.