Why training AI can't be IP theft

blog.giovanh.com

45 points by OuterVale 3 months ago

blagie 3 months ago

I asked AI to complete an AGPL code file I wrote a decade ago. It did a pretty good job. What came out wasn't 100% identical, but clearly a paraphrased copy of my original.

Even if we accept the house-of-cards of shaky arguments this essay is built on, even just for the sake of argument, where Open AI breaks my copyright is by having a computer "memorize" my work. That's a form of copy.

If I've "learned" Harry Potter to the level where I can reproduce it verbatim, the reproduction would be a copyright violation. If I can paraphrase it, ditto. If I encode it in a different format (e.g. bits on magnetic media, or weights in a model), it still includes a duplicate.

On the face of it, OpenAI, Hugging Face, Anthropic, Google, and all other companies are breaking copyright law as written.

Usually, when reality and law diverge, law eventually shifts; not reality. Personally, I'm not a big fan of copyright law as written. We should have a discussion of what it should look like. That's a big discussion. I'll make a few claims:

- We no longer need to encourage technological progress; it's moving fast enough. If anything, slowing it down makes sense.

- "Fair use" is increasingly vague in an era where I can use AI to take your picture, tweak it, and reproduce an altered version in seconds

- Transparency is increasingly important as technology defines the world around us. If the TikTok algorithm controls elections, and Google analyzes my data, it's important I know what those are.

That's the bigger discussion to have.

protimewaster 3 months ago

> If I've "learned" Harry Potter to the level where I can reproduce it verbatim, the reproduction would be a copyright violation. If I can paraphrase it, ditto.
Yeah, that's something that I've not seen a good answer to from the "everything AI does is legal" people. Even if the training is completely legal, how do you verify that the generated output is not illegally similar to a copyrighted work that was ingested? Humans get in legal trouble if they produce a work that's too similar. Does AI not? If AI doesn't, can I just write an AI whose job is to reproduce copyrighted content and now I have a loophole to reproduce copyrighted content?
- the_snooze 3 months ago
  
  Seems like so much tech "innovation" these days is really just to sneak around laws and social norms in pursuit of a rent-seeking position.
  - mrbadguy 3 months ago
    
    Nailed it. All of the “progress” in the last, say, 20 years is exactly this. They call it “disruption” and wear the title “disruptor” as a badge of honour.
  - backWurdz 3 months ago
    
    [dead]
- bko 3 months ago
  
  I think you have to be practical. It would be difficult to train an AI to consume Harry Potter and compress it but prevent it from recreating it. You can try and people do, but there are always ways around it.
  But it's on an individual prompt basis. It's not like ChatGPT can produce the entirety of its text and sell it as a pdf. It's just a device that could reproduce it much like a word processor is a device that you can read the book and type out the contents.
  So the question is one of practicality. Do we ensure that no copyrighted material is in the training data? Difficult but probably not impossible. But what you can't do is target the content in all its various other forms, from descriptions of the plot, reviews, fan fiction, etc. So in the end its pretty much a lost cause.
  So what to do about it? I don't know. In the utilitarian sense, I think the world in which this technology exists in a non-crippled form is a better richer world than one in which there are all these procedural steps to try to prevent this (and ultimately failing).
  Whats the harm here? Are people not buying Harry Potter books and just having an LLM painfully recreate the plot? I would imagine Harry Potter fans would be able to explore their love of the media through LLMs and that would drive more revenue to Harry Potter media, much like fan fiction and pirated music lead to more engagement and concert sales.
  In the case of new art, maybe fewer artists get commissioned, but let's be real, Mike Tyson wasn't going to contract out an artist to create a ghibli style animation of him anyway, so there's really little harm in LLMs here to artists. If anything it expands the market and interest.
  - thankyoufriend 3 months ago
    
    I'm just going to briefly respond to the part you wrote about art in particular.
    We may not have a way to actually quantify the harm that GenAI is doing to creative industries because some of the damage is long-term. Choices are being made right now based on the state of the world. Why would anyone start an art career in this climate? What does art as a profession look like in 5 years? 15 years?
    Art is not just the final artifact, and I feel we're surrendering part of our humanity in service of enriching big tech companies.
  - mdp2021 3 months ago
    
    > So what to do about it
    We proceed towards AGIs that implement proper understanding, and have them read all of the masterpieces and essays and textbooks - otherwise they will be useless -, as is fully legitimate in any system that foresees libraries.
- mdp2021 3 months ago
  
  > Humans get in legal trouble if they [???] a work that's too similar
  If they sell a work that's too similar.
  What intellectuals do is quoting. Of course legal.
- morkalork 3 months ago
  
  If gen-AI had been used to produce Nosferatu there still would have been a case, right?
naming_the_user 3 months ago

Cleanroom implementation comes to mind.
If I just remember the source code of a 100 line program and then reproduce it verbatim a week later that doesn’t suddenly make it a new work.
- franktankbank 3 months ago
  
  This is why I don't believe in restrictive licensing of open work.
SilasX 3 months ago

>Even if we accept the house-of-cards of shaky arguments this essay is built on, even just for the sake of argument, where Open AI breaks my copyright is by having a computer "memorize" my work. That's a form of copy.
No, it isn't, unless you're also going to call it copyright infringement when search engines store internal, undistributed copies of websites for purposes of helping answer your search queries.
Edit: or, for that matter, accessing a website at all, which creates a copy on your computer.
- blagie 3 months ago
  
  You're making a classical error programmers make when encountering legal systems: Treating law with the same rules of logic as code.
  I can cite case law carving out lines around things like copy in RAM/SDD/etc. accessing a web site (very clearly legal) or internal copies for search engines (very close to the border, but still legal), but courts -- at least in systems based on common law -- don't work like a computer program or a mathematical argument.
  So the general argument "X is like Y" doesn't work if you're looking purely at the technical action. The exact same action at the level of bits and bytes can be legal in one context and illegal in another.
  What matters to courts are things which are less easy to formalize: intent, impact, etc.
  Just as with computer science, there is a rigorous and formal analysis, but very different from the one you're making, and operating under very different rules.
  - SilasX 2 months ago
    
    That's an unfair reading of my comment.
    For one thing, I was replying to your own attempt to do a programmer style legalism, of "welp, you instantiated a copy, therefore it must be copyright infringement, case closed". I was already basing that on the principle you're now appealing to, to say that that isolated fact alone doesn't settle the matter (which is a much lower bar than the broader point that, all things considered, generative AI isn't infringement).
    It's great that you recognize that programmer-style legalisms don't work in the law! It just wish you'd had that insight in mind when making your original comment.
    Second, to the extent that I endorsed the general argument (generative AI not infringing), I wasn't claiming that the point worked because of some mechanistic technicality, but because of the very broad trend of accepting internal copies for a wide range of saleable data products. That is, in fact, how analogical reasoning works in law: courts compare the case at hand to similar ones.
    Is any one point definitive? Of course not. But brushing off all analogies is unjustified deflection. If you think there are more pressing considerations that override the pattern I've pointed to -- like why the "intent, impact, etc" matter here -- great! I'm happy to have that conversation. But you can't just gesture at "legal rigor is hard" as if that's some kind of counterargument.
- toshinoriyagi 3 months ago
  
  Correct, the above posted ignored the four prongs of fair use:
  1. Is the work more educational or commercial.
  2. What is the nature of the underlying work (creative works are more protected).
  3. Is the work transformative.
  4. What will be the impact on the underlying work's market.
  Search engines do not make their internal copies available, compete in an entirely different market (that benefits the makers of the underlying works) and are considered quite transformative because they enable discovering vast information on the internet.
  On the other hand, almost zero LLMs/Text-to-Image generators are educational in nature (certainly none of the ones being sued for copyright infringement). They frequently are trained on highly creative works like art and writing. Some of the work could be transformative, depending on where the learned data manifold the output of your request lies on, but a huge amount is similar to the training data. Lastly, these models have an outsized negative impact on the underlying markets, by being vastly cheaper compared to a human's labor, and for dubious quality at that.
  - SilasX 3 months ago
    
    What is that replying to? I was addressing the argument that "well, you stored the work in memory, therefore you infringed copyright". It seems like you agree that that jump doesn't follow, just as I argued?
    
    toshinoriyagi 3 months ago
    
    You are totally correct, I misread your original comment and thought you took the opposite stance.
mdp2021 3 months ago

> If I've "learned" Harry Potter to the level where I can reproduce it verbatim, the reproduction would be a copyright violation
Of course not: in fact, memorizing has always been a right. (Edit: that would depend on what is meant with "reproduction" though. As written elsewhere, properly done quoting is of course fair use.)
> If I can paraphrase it, ditto
Even more so: people have lost legal actions because they sued authors of parodies.
- mdp2021 3 months ago
  
  Afterthought: I realize my reading of the original post should have been more attentive;
  nonetheless, I would still object that the ideas proposed (e.g. "reproduction" and "paraphrasing") cannot be taken without further specification. The fault is not at that generic level. And the context we should be after should not be this provisional one, of an unfinished technology.
HPsquared 3 months ago

Maybe the infringement occurs when a user uses the model to produce the facsimile output.
- SilasX 3 months ago
  
  I agree that you should call the output copyright infringement (or not) without regard to how you got there. So, if it produces an identical copy of the text, or of e.g. Indiana Jones, that you then distribute it, sure, that is copyright infringement.
  But the mere act of using them for training and producing new works shouldn't be! In fact, until 2022, pretty much no one regarded it as a copyright violation to "learn from copyrighted works to create new ones" -- just the opposite! That's how it's supposed to work!
  Only when hated corporations did it with bots, did the internet hive mind suddenly decide that's stealing, and take this expansive view of IP rights (while, of course, having historically screamed bloody murder about any attempts to fight piracy).
  - LocalH 3 months ago
    
    Scale matters. There's a difference between individual humans taking time to study copyrighted works, and so-called "AI" doing it on the scale of the equivalent of millions of man hours.
    
    SilasX 3 months ago
    
    But what about that kind of scale matters to this particular area? The article spends a lot of time showing the substantive parallels between how humans and AIs learn e.g. in how they never store an exact copy, but a high-level understanding that rarely produces anything verbatim.
- throwaway173738 3 months ago
  
  Good idea. Let’s make it a minefield of copyright infringement for the user so they never know whether it’s emitting something novel or it’s emitting AGPL code.
  - formerly_proven 3 months ago
    
    That's what the legal department at my employer (huge multinational corp) came up with: When an employee uses one of the approved gen-ai models it's on the employee to check the output isn't infringing - legal argues that not doing so would be grossly negligent, making the employee personally liable.
    
    throwaway173738 3 months ago
    
    If my company had a policy like that I would just not use AI tools at all for anything.
    
    mtndew4brkfst 3 months ago
    
    Best possible outcome. Unlikely to be, but maybe even the intended outcome from legal's POV.
  - CuriouslyC 3 months ago
    
    Why the heavy handed internet sarcasm? Youtube has handled this exact issue with fingerprints for a while.
pitaj 3 months ago

> If I've "learned" Harry Potter to the level where I can reproduce it verbatim, the reproduction would be a copyright violation.
Yes.
> If I can paraphrase it, ditto.
Not necessarily. Summarizing, for instance, is typically free use.
- dfawcus 3 months ago
  
  How about not simply summarising, but creating a broadly similar novel, with similar characters which undergo broadly similar story arcs?
  I can see how it could end up as infringement, but also how one could avoid infringement. The issue would seem to be if the original author wishes to sue.
  As an example, compare Brooks "The Sword of Shannara" with Tolkein's "Lord of the Rings". It is widely accepted that the former is heavily derivative for the latter; in my view essentially being an example of the above broadly similar novel.
  Yet AFAIK, Tolkein's estate didn't sue Brooks. However if they had, how likely would a victory have been?
  Now in the case of "Generative LLM", we may have an even closer "derivative work". However the LLM users could well get away with it if they consume a wide enough canon, and the source authors either do not learn of the derivation, or do not have the means to sue.
  - blagie 3 months ago
    
    One of the key lessons learned in the past half-century is that lawsuits like that are:
    (a) Often won
    (b) Pyrrhic victories
    Right now, it's the policy of Paramount not to sue over very clear violations of Star Trek copyrights (see: Star Trek Continues), Rowling to not do the same for Harry Potter (see: Methods of Rationality), etc.
    A lawsuit like that results in a modest payout and take-down, together with a loss of brand worth many times that.
    On the other hand, the derivative works tend to drive brand value, recognition, and sales.
    That's why many works have clear guidelines supporting and encouraging derivative works, with specific boundaries to make sure they don't undermine those works.
    I don't know The Sword of Shannara, but each time someone says "inspired by Tolkein" or "building on themes from," Tolkein increases in brand value as a classic and then THE classic. There is a legal line as to where inspiration becomes copying, but if we were to assume, arguendo, that it's clearly over that line, it still probably wouldn't make sense to sue. At best, that would have a chilling impact on more works in the "inspired by" category.
wiseowise 3 months ago

> If I've "learned" Harry Potter to the level where I can reproduce it verbatim, the reproduction would be a copyright violation.
By your logic anyone with good enough memory violates copyright law just by act of remembering something.
- Apreche 3 months ago
  
  No, because you don’t actually violate copyright law until you produce and distribute copies.
  It’s perfectly legal to memorize a book, type up a copy from memory, and to even print out a copy that you keep for yourself. But as soon as you start trying to sell or distribute copies, even for free, now you’re breaking the law as written.
- great_wubwub 3 months ago
  
  No - it's the reproduction, not the memorization.
  - wiseowise 3 months ago
    
    So training AI on copyrighted data isn’t a problem unless it spits out the data verbatim. Correct?
    
    lelanthran 3 months ago
    
    Close.
    It isn't a problem until it spits out a similar enough copy.
    Copyright violation doesn't have to be verbatim; taking a 4k movie and reencoding it to 320x200 before distribution isn't legal.
  - greyface- 3 months ago
    
    It's reproduced in the memorizer's neural connectome.
- earthnail 3 months ago
  
  That was actually the case in music until a decade ago or so. Led to ridiculous lawsuits (for example Ed Sheeran’s).
  Previously, artists needed to prove they hadn’t heard the song that they were accused of infringing. That was virtually possible because there’s a lot of music you could hear anywhere, even just a car driving by. Artists continuously lost these court cases.
  Nowadays the burden of proof is luckily no longer on the defendant. But I think that only changed a decade ago or so, thanks to some efforts by music industry lawyer Damien Riehl. I know, ridiculous.
- shakna 3 months ago
  
  Memory? No.
  Reproduction? Yes.
  If you write down a sizeable quote, that can be a copyright violation.
- techpineapple 3 months ago
  
  This doesn’t seem true. I mean, it might be true if memory could be seen or manipulated, but what would you bring into a court of law to prove that I remembered something too clearly?
- ArinaS 3 months ago
  
  I don't see that point in the original comment. Remembering copyrighted content ≠ reproducing a verbatim of it.
dietr1ch 3 months ago

What's stopping me from paraphrasing movies by peppering the least significant color bits? Would that make copying them legal?
- wongarsu 3 months ago
  
  By that reasoning any VHS copy would be legal. Pretty sure Hollywood takes a slightly broader view on what constitutes a copy
  - dietr1ch 3 months ago
    
    I'm sure that Hollywood would charge me for randomly remembering scenes if they could. What makes them the ones to dictate the rule?
tempodox 3 months ago

Even if we define “AI” as lossy storage with random error insertions, it still amounts to unlicensed reproduction.
- mysterydip 3 months ago
  
  Like saving an image as a JPEG doesn't make it a new work
- mdp2021 3 months ago
  
  > if we define “AI” as
  something completely oblivious of the past, it would be a great disservice.
  > as lossy storage
  well that would just be a great misunderstanding of what "learning" is.
  - Jensson 3 months ago
    
    > well that would just be a great misunderstanding of what "learning" is.
    You need to assume a malicious actor here, its trivial to use an AI as lossy storage, if that is legal copyright washing then large corporations will do so at industrial scale.
    
    mdp2021 3 months ago
    
    > its trivial to use an
    You mean, "a Neural Network as function approximator for the reproduction of texts". But in fact, AI cannot be confused with that idea - and
    be it AI ("system for the replacement of a professional"),
    or AGI ("system implementing intelligence"),
    or LLM (which can only be useful as "system with emergent properties from text generation"),
    "learning" is strongly opposite to "memorizing" (as would be evident in a student - trivial example, one that would memorize a multiplication but be unable to perform it).
    So, that somebody places a sticker on his apparel to fake a profession for «malicious act[ion]» cannot make us forget the real thing: "learning" is what is sought, and for learning we need the teaching material.
    ...Which, societally, is rightly decided as free (i.e. stored in libraries).
stuaxo 3 months ago

Exactly- the information wasn't created from whole cloth, but rather by reading (copying inside a computer from disk to ram to gpu) the copyrighted information in the first place.

basch 3 months ago

"I think the unambiguous answer to this question is that the act of training is viewing and analysis, not copying. There is no particular copy of the work (or any copyrightable elements) stored in the model. While some models are capable of producing work similar to their inputs, this isn’t their intended function, and that ability is instead an effect of their general utility. Models use input work as the subject of analysis, but they only “keep” the understanding created, not the original work."

The author just seems to have decided the answer and worked backwards. When in reality this is very much a ship of theseus type problem. At what point does a compressed jpeg not become the original image but a transformation? The same thing applies. If i ask a model to recite frankenstein and it largely does, is that not a lossy compression of the original. Would the author argue an mp3 isnt a copy of a song because all the information isnt there?

Calling it "training" instead of compression lets the author play semantic games.

Retr0id 3 months ago

There clearly is a point when a compressed jpeg becomes a transformation, even if the precise point is ambiguous.
Take 'The Bee Movie at 3000% speed except when they say "bee"', for example - https://www.youtube.com/watch?v=7apltfVJBwU. It hasn't been taken down for over 5 years so I'm going to assume it's considered acceptable/transformational use.
Personally, I'd say what matters is whether you'd plausibly use the transformed/compressed version as a drop-in substitute for the original. ChatGPT can probably reproduce the complete works of shakespeare verbatim if prompted appropriately, but is anyone seriously going to read it that way?
- kemotep 3 months ago
  
  Do you know that YouTube hasn’t reassigned the video’s ad revenue to the IP holder and the IP holder hasn’t requested it be taken down because they now receive compensation for it from YouTube?
- basch 3 months ago
  
  Agreed. So not EVERY LLM is automatically copying in principle, but most of the current implementations probably retain TOO MUCH of the original sources to NOT be copies.
mdp2021 3 months ago

The author in the quote wrote «understanding», whereas the poster here is talking of «compress[ion]», and the two are very different.
Understanding a text is not memorizing a details-fuzzy version of it.
- basch 3 months ago
  
  at some threshold understanding becomes memorization which becomes the ability to recite it. it's not two different things, its points on a spectrum.
  - mdp2021 3 months ago
    
    > becomes
    Why should it? "The artist sees in the wild horse a metaphor of death, as expressed in the fleeting image painted with solemnity, stillness and evanescence". That could be a (sketchy) example of understanding - and it does not need a verbatim or interpolated text to be there.
Calwestjobs 3 months ago

yeah, but is exact wording/data IP or is IP those patterns underneath?

TimorousBestie 3 months ago

The assumption that human learning and “machine learning” are somehow equivalent (in a physical, ethical, or legal sense—the domain shifts throughout the essay) is not supported with evidence here. They spend a long time describing how machine learning is different from human learning on a computational level, but that doesn’t seem to impact the rest of the argument.

I wish AI proponents would use the plain meaning of words in their persuasive arguments, instead of muddying the waters with anthropomorphic metaphors that smuggle in the conclusion.

armoredkitten 3 months ago

Exactly. In particular, when I train a model, I have a defined process for training, and I can flip the switch between "learning" and "not learning" to define exactly when the model adjusts its weights as a result of inputs. Humans can't do that with their brains. Thus, for humans, learning can't be decoupled from viewing, but it absolutely can be for AI.

hyperman1 3 months ago

In the EUCD, a copy in RAM falls under copyright, but there is an exception defined (art 5) if the copy is transitory and the target use is legal under copyright. Neither is true for AI, so this article is probably wrong in the EU.

Apart from that, I wonder uf an AI is learning in the legal sense of the word. I'd suspect removing copyright trough learning is something only humans can do, seen trough legal glasses. An AI would be a mechanical device creating a mashup of multiple works, and be a derived work of all of them.

Main problem with this rebuttal is how you prove the AI copied your work specifically, and finding out which of the zillions of creative works in that mashup are owned by who.

gavinhoward 3 months ago

Copyright laws (in the US) added fair use, which has four tests. Not all of the tests need to fail for fair use to disappear. Usually two are enough.

The one courts love the most is if the copy is used to create something commercial that competes with the original work.

From near the top of the article:

> I agree that the dynamic of corporations making for-profit tools using previously published material to directly compete with the original authors, especially when that work was published freely, is “bad.”

So essentially, the author admits that AI fails this test.

Thus, if authors can show the AI fails another test (and AI usually fails the substantive difference test), AI is copyright infringement. Period.

The fact that the article gives up that point so early makes me feel I would be wasting time reading more, but I will still do it.

Edit: still reading, but the author talks about enumerated rights. Most lawsuits target the distribution of model outputs because that is reproduction, an enumerated right.

Edit 2: the author talks about sunstantive differences, admits they happen aboit 2% of the time, but then seems to argue that means they are not infringing at all. No, they are infringing in those instances.

Edit 3: the author claims that model users are the infringing ones, but at least one AI company (Microsoft?) had agreed to indemnify users, so plaintiffs have full right to go after the company instead.

djoldman 3 months ago

There are a few stages involved in delivering the output of a LLM or text-to-image model:

1. acquire training data

2. train on training data

3. run inference on trained model

4. deliver outputs of inference

One can subdivide the above however one likes.

My understanding is that most lawsuits are targeting 4. deliver outputs of inference.

This is presumably because it has the best chance of resulting in a verdict favorable to the plaintiff.

The issue of whether or not it's legal to train on training data to which one does not hold copyright is probably moot - businesses don't care too much about what you do unless you're making money off it.

mdp2021 3 months ago

> businesses don't care too much
Not really so, since the deranged application of an idea of "loss of revenue" decades ago.

EdwardDiego 3 months ago

That's a lot of words to justify what I presume to be the author's pre-existing viewpoint.

Given that "training" on someone else's IP will lead to a regurgitation of some slight permutation of that IP (e.g., all the Studio Ghibli style AI images), I think the author is pushing shit up hill with the word "can't".

mdp2021 3 months ago

> Given that "training" on someone else's [would] lead to a regurgitation of some slight permutation
That is not necessary. It may happen with "bad" NNs.
seanhunter 3 months ago

Yup. Nothing quite like someone who clearly has no legal background trying to use first principles reasoning + bullshit to make a quasi-legal argument that justifies their own prior opinion.

ConspiracyFact 3 months ago

The problem is that model outputs are wholly derivative. This is easy to see if you start with a dataset of one artistic work and add additional works one at a time. Clearly, at the start the outputs are derivative. As more inputs are added, there’s no magical transformation from derivative to non-derivative at any particular point. The output is always a deterministic function of the inputs, or a deterministic output papered over with randomness.

“But,” you say, “human art is derivative too in that case!”

No. A human artist is influenced by other artists, yes, but he is also influenced by the totality of his life experience, which amounts to much more in terms of “inputs”.

prophesi 3 months ago

I think it can be IP theft, and also require labor negotiations. And global technical infrastructure for people to opt-in to having their data trained on. And a method for creators to be compensated if they do opt-in and their work is ingested. And ways for their datasets to be audited by third parties.

It sounds like a pipedream, but ethical enforcement of AI training across the globe will require multifaceted solutions that still won't stamp out all bad actors.

tete 3 months ago

Like any kind of copyright law?

light_hue_1 3 months ago

This it totally the wrong analysis.

Think of AI tools like any other tools. If I include code I'm not allowed to use, like reading a book I pirated, that's copyright infringement. If I include an image as an example in my image editor, that's ok if I am allowed to copy it.

If someone decides to use my image editor to create an image that's copyrighted or trademarked, that's not the fault of the software. Even if my software says "hey look, here are some cool logos that you might want to draw inspiration from".

People are getting too hung up on the AI part. That's irrelevant.

This is just software. You need a license for the inputs and if the output is copyrighted that's on the user of the software. It's a significant risk of just using these models carelessly.

alganet 3 months ago

That's a lot of text.

Where is AI disruptive? If it is disruptive in some area, should we apply old precedents to a thing so radically new? (rethorical).

Good fresh training data _will end_. The entire world can't feed this machine as fast as it "learns".

To make a farming comparison, it's eating the seeds. Any new content gets devoured before it has a chance to grow and give fruit. Furthermore, people are starting to manipulate the model instead of just creating good content. What exactly will we learn then? No one fucking knows. It's a power grab free for all waiting to happen. Whoever is poor in compute resources will lose (people! the majority of us).

If I am right, we will start seeing anemic LLMs soon. They will get worse with more training, not better. Of course they will still be useful, but not as a liberating learning tool.

Let's hope I am not right.

bionhoward 3 months ago

Did the article mention the part about how these companies turn around and say you’re not allowed to use the output to develop competitive models? I couldn’t find mention of this

Calwestjobs 3 months ago

look, quickest example if it IS or it IS NOT ip theft is - go to any image generation ML wizardry prompt machine and ask it this :

"generate image of jack ryan investigating nuclear bomb. he has to look like morgan freeman."

(and do it quickly before someone in FAANGM manually plays with something altering result of that prompt)

problem is opposite, is "original" work IP a original in itself or is it just remix

or someone just gave lawyer some generic text and make it arbitrarily protected for adding 0.000000001% to previous work.

EPWN3D 3 months ago

I couldn't get through it, did he actually make an argument eventually?

yniopper 3 months ago

[dead]

fithisux 3 months ago

[flagged]

mdp2021 3 months ago

That only teaches us your opinion: too little information and too much.
- fithisux 3 months ago
  
  If you want to protect interests of the rich feel free, it is good if you are rich. I will not play your game.
  - mdp2021 3 months ago
    
    Again: anybody can come and say "A is B". It is gratuitous. The argument is all that counts. You are speaking out of your inner representation and interpretaton of "A". We are those who are outside your mind.

hulitu 3 months ago

> Why training AI can't be IP theft

Because Microsoft is part of BSA. /s

If you steal our software, it is theft. If we still your software, it is fair use. Can we train AI on leaked Windows source code ?

techpineapple 3 months ago

“If humans were somehow required to have an explicit license to learn from work, it would be the end of individual creativity as we know it“

What about text books, in order to train on a textbook, I have to pay a licensing fee.

pitaj 3 months ago

If you pirate a text book, learn from it, and then apply that knowledge to write your own textbook: your textbook would not be a copyright violation of the original, even though you "stole" the original.
realharo 3 months ago

Even if you accept the premise, what does it matter? AI are not humans.
Laws were made up by people at a specific time for a specific purpose. Obviously our existing laws are "designed" around human limitations.
As for future laws, it's just a matter of who is powerful and persuasive enough to push through their vision of the future.
Ylpertnodi 3 months ago

>What about text books, in order to train on a textbook, I have to pay a licensing fee.
Would that also apply if you bought the text books second-hand (or were given it)?
- Ekaros 3 months ago
  
  First sale doctrine applies in those cases. That is original buyer can transfer the license that is sell the book.
  With AIs I think it is clear it would fall under some other limitations like you cannot broadcast CD over radio. And with movies not start movie theatre and play movies without paying creators...
mdp2021 3 months ago

> I have to pay
Fortunately, others have libraries. There is no need to pay for the examination of material stored in libraries (and similar).

re-thc 3 months ago

The argument in the article breaks down by taking marketing by definition and try to apply it to a technical argument.

You might as well start by saying that the "cloud" as in some computers really float in the sky. Does AWS rain?

This "AI" or rather program is not "training" or "learning" - at least not the way these laws conceived by humans were anticipated or created for. It doesn't fit the usual dictionary term of training or learning. If it did we'd have real AI, i.e. the current term AGI.

Latty 3 months ago

I agree you can't just say it's learning and be done with it, but I think there is a discussion to be had about what training a model is.
When they made the MP3 format, for example, they took a lot of music and used that to create algorithms that are effective for reproducing real-world music using less data. Is that a copyright violation? I think the answer is obviously no, so there is a way to use copyrighted material to produce something new based on it, that isn't reproduction.
The obvious answer is that MP3 doesn't replace the music itself commercially, it doesn't damage the market, while the things produced by an AI model can, but by that logic, is it a copyright violation for an instrument manufacturer to go and use a bunch of music to tailor an instrument to be better, if that instrument could be used to create music that competes with it? Again, no, but clearly there is a difference in how much that instrument would have drawn from the works. AI Models have the potential to spit out very similar works which makes them much more harmful to the original works' value.
I think looking at it through the lens of copyright just isn't useful: it's not exactly the same thing, and the rules around copyright aren't good for managing it. Rather, we should be asking what we want from models and what they provide to society. As I see it, we should be asking how we can address the artists having their work fed into something that may reduce the value of their work, it's clearly a problem, and I don't think pushing the onus onto the person using the model not to create anything that infringes is a strategy that will actually work.
I do think the author correctly calls out gatekeeping as a huge potential issue. I think a reasonable route is that models shouldn't be copyrightable/patentable themselves, companies should not be allowed to rent-seek on something largely based on other people's work, they should be inherently in the public domain like recipes. Of course, legislating something like that is hard at the best of times, and the current environment is hostile to passing anything, let alone something pro-consumer.
- re-thc 3 months ago
  
  > they took a lot of music and used that to create algorithms that are effective for reproducing real-world music using less data
  That's redefining history. MP3 didn't evolve like this. There was a series of studies, experiments etc and it took many steps to get there.
  MP3 was not created by dumping music somewhere to get back an algorithm.
  > I think a reasonable route is that models shouldn't be copyrightable/patentable themselves
  Why? Why can't they just pay. AI companies have the highest valuation and so have the most $$$ and yet they can't pay? This is the equivalent of the rich claiming they are poor and then stealing from the poor.
  - Latty 3 months ago
    
    > MP3 was not created by dumping music somewhere to get back an algorithm.
    This wasn't what I was trying to suggest, clearly I wasn't clear enough given the context, but my point was to give a very distant example where humans are using copyrighted works to test their algorithms as a starting point, as I later go on to say, I think the two cases are fundamentally different, but the point was to make the case that there are different types of "using copyrighted works to create tools", which is distinct from "learning".
    > Why? Why can't they just pay. AI companies have the highest valuation and so have the most $$$ and yet they can't pay? This is the equivalent of the rich claiming they are poor and then stealing from the poor.
    I don't think them paying solves the problem.
    1) These are trained on such enormous amounts of data that is sourced unreliably, how are these companies going to negotiate with all of the rights holders?
    2) How do you deal with the fact the original artists who previously sold rights to companies will now have their future work replaced in the market by these tools when they sold a specific work, not expecting that? Sure, the rights owners might make some money, but the artists end up getting nothing and suffering the impact of having their work devalued.
    3) You then create a world where only giant megacorps who can afford to get the training rights can make models, they can then demand all work made with them (potentially necessary to compete in future markets) give them back the rights, creating a viscous cycle of rent-seeking where a few companies control the tools necessary to be a commercial artist.
    Paying might, at best, help satisfy current rights holders, which is a fraction of the problems at hand, in my opinion. I think making models inherently public domain solves far more of them.
    
    re-thc 3 months ago
    
    > 1) These are trained on such enormous amounts of data that is sourced unreliably, how are these companies going to negotiate with all of the rights holders?
    i.e. the business was impossible? Then don't do it. That's like saying I can't do the exam reliably sir, so I cheated and you should accept.
    > 2) How do you deal with the fact the original artists who previously sold rights to companies will now have their future work replaced
    Charge differently / appropriately. This has already been done, e.g. there are different prices / licenses for once-off individual use, vs business vs unlimited use vs SaaS etc.
    > 3) You then create a world where only giant megacorps who can afford to
    Isn't this currently the case? Who can afford the GPUs? What difference does that make? These AI companies are already getting sky high valuations with the excuse of cost...
    > I think making models inherently public domain solves far more of them.
    How is that even enforceable? Companies just don't have to even announce there is a new model or the model they used and life progresses.
    
    Latty 3 months ago
    
    > i.e. the business was impossible? Then don't do it. That's like saying I can't do the exam reliably sir, so I cheated and you should accept.
    Sure, I think that's also a potential option, like I say we need to ask what the benefit to society is and if it is worth the cost.
    > Charge differently / appropriately. This has already been done, e.g. there are different prices / licenses for once-off individual use, vs business vs unlimited use vs SaaS etc.
    If we are just bundling it into copyright, you can't retroactively set that up, people have already given away copyright licenses, and their work will then be sold/used to train and they can't change that.
    > Isn't this currently the case? Who can afford the GPUs? What difference does that make? These AI companies are already getting sky high valuations with the excuse of cost...
    Right, but currently they get to own the models. If we forced the models to be public domain, anyone could run inference with them, inference hardware isn't cheap, but it's not anywhere near the barrier paying the rights for all the data to train a model is.
    > How is that even enforceable? Companies just don't have to even announce there is a new model or the model they used and life progresses.
    Patents require a legal framework and co-operation and enforcement. Yeah, enforcing it isn't trivial but that doesn't mean it can't be done. If you want to have a product relying on an ML model, you have to submit your models to some public repository, if you don't, then the law can force you. Of course a company can claim their product isn't using a model or whatever, or try and mislead, but if that's illegal, no company is watertight, a solid kickback to whistleblowers you pull from the hefty fine when it's found out, etc... This is off the top of my head, obviously you'd apply more thought to it if you were implementing it.
    
    re-thc 3 months ago
    
    > we need to ask what the benefit to society is
    Business doesn't operate on the benefit of society. It's not a charity.
    > people have already given away copyright licenses, and their work will then be sold/used to train and they can't change that.
    They have? No they haven't. There are already ongoing law suits.
    > If we forced the models to be public domain
    How do you? Send the NSA and FBI to everyone's house? How do you know I have a model?
    > enforcing it isn't trivial but that doesn't mean it can't be done
    Just like stopping drugs? Good luck.
    
    Latty 3 months ago
    
    > Business doesn't operate on the benefit of society. It's not a charity.
    Which is why we have legislation to force them to act in given ways. I'm not suggesting companies should choose to do it, I'm saying we should pass laws to force them.
    > They have? No they haven't. There are already ongoing law suits.
    That was in the context of the hypothetical where training models is just considered a type of copying and handled by existing copyright law. You seem to be misunderstanding my points as a cohesive argument about the current state of the world: I was describing possible ways to handle legislating around machine learning models.
    > How do you? Send the NSA and FBI to everyone's house? How do you know I have a model? > Just like stopping drugs? Good luck.
    I answered this in the rest of my comment? I don't believe the comparison to drugs is at all reasonable: it's much easier to manage something large companies who have to operate in the commercial space do than something people do on the street corner.
    It appears you aren't really engaging with what I'm saying, but giving knee-jerk responses, so I don't think this is going anywhere.
    
    re-thc 3 months ago
    
    > I'm saying we should pass laws to force them
    To leave the country? That's what would happen. They can develop elsewhere and "license" back to the US as needed or even leave the US market completely. The US no longer has a 10-200x GDP compared to the rest of the world it once did. This notion needs to stop and to see the world has changed.
- lelanthran 3 months ago
  
  > When they made the MP3 format, for example, they took a lot of music and used that to create algorithms that are effective for reproducing real-world music using less data
  Your entire argument is predicated on this incorrect assertion. Your argument, as is, is therefore completely invalid.
  - Latty 3 months ago
    
    Even if we accept it's wrong (which I suspect is me being unclear: I wasn't suggesting MP3 is some kind of trained algorithm, just that humans developed it while testing on a range of music—which is well documented, with Tom's Diner famously being the first song encoded—which is how any such product gets developed, I accept the context makes it read like I was implying something else, my bad), I give a separate examples with varying degrees of similarity to training and then make my own comments, I explicitly say after this that I don't think the MP3 example is very comparable.
    While I get why you'd read what I said that way given context, I wasn't clear, maybe don't reject my entire post immediately after making an assumption about my point barely any way into it.