WW 862: One Click to Malware

Beep boop - this is a robot. A new show has been posted to TWiT…

What are your thoughts about today’s show? We’d love to hear from you!

I find @Leo’s attitude a little over bearing, with regard to AI.

My view is, if the information is public domain it can be used. If it is an orphaned work, it can be used.

But, if the information is copyrighted, the wishes of the owner should be respected and permission obtained. Just because something is freely available on the Internet doesn’t mean it is in the public domain or the creator has given their permission. Gobbling up illegal content doesn’t suddenly make it legal, two wrong don’t make a right and all that.

That is why I fall on the side of the authors in their case, for example. If the AI companies had obtained permission, many would probably have let them use their works, others would want to license it, and that is their right, and if the AI companies don’t want to pay the licenses, that is their right as well, and they can leave the material out of their training model.

But the AI companies are being “typical Big Tech” or “typical Silicon Valley”, in that they ignore the law, until the lawyers and fines cost more than actually obeying the law & doing the right thing. Often by the time the law catches up, they complain that they are too big and “doing it properly” would be too expensive, so the people being wronged have to come up with another way of being compensated that suits Big Tech.

YouTube is the classic example of this, they knew that they were using copyrighted material, but they turned a blind eye to it, because it got them more views and made their platform popular. The copyright holder came along and sued, but as long as the fines and lawyers were less than the money being made on the platform, it wasn’t worth complying with the law. Once it got expensive, they were too big (too much content), that they couldn’t obey the law, even if they wanted to & they came up with automated takedowns (which Leo has been on the receiving end of) and pitance payments for infringing content to the copyright owners - even to the extent that they tried to make them put ads on the illegal content and get a share of the revenue, or they couldn’t make any more claims!

What they should have done was, when they were still small, to put in place ways of filtering the illegal content out and scaled that system up as the company grew. That way, the company wouldn’t have expanded as quickly as it did, but it would be in compliance with the law and could grow organically. But they just ignored the problem, until implementing a proper solution would have crippled the company, crashed profits (because all the money earned on illegal content would be gone and the system to check content, before it went online would have cost a one-off fortune to develop, as opposed to it having been written and expanded and incorporated into the cost model from the beginning).

Security is the same, many startups ignore security, becuase they are “growing fast”, “moving quickly and breaking things”, then the security issues, which would have been easy to build in when the system was small and still growing, is suddenly a huge, expensive headache, because the system is so big and ungainly and the security has to somehow be tacked on as an afterthought.

(This is one of the areas that GDPR helps, the Data Protection Officer has to be included at the planning stage and they will have to ensure the security of the data is taken into account from the planning stage on - if you have a new project and you don’t include the DPO from the planning stage on, the new system is automatically out of compliance!)

Now AI is trying to do the same thing. It is easier to point the LLM at the whole web and let it get on with it, as opposed to work out what information is of poor quality, is innacurate, is illegal content, is copyright infringement. (Oh, we, look, we found a full set of Harry Potter on this warez site, should be exclude it? Nah, we’ll see what else in on the warez site!)

We need to get away from this attitude of “laws are there to be ignored, until they become too expensive to ignore any more,” laws should be laws and you have to follow them.

To go to the analogy of Google and ChatGPT, the difference is, Google shows you a snippet and a link to the original material, so the content creator has some benefit from Google scraping their sites (even if some publishers think otherwise). But ChatGPT and other AIs don’t display a snipped and a link, they are reciting sections of text verbatim and summarising the rest with enough information, that the reader doesn’t need to visit the original sources, so they aren’t getting any revenue at all - no subscription, no advertising.

Yes, on the one hand, we need this information in the LLMs, on the other side, if the AI companies “steal” this information wholesale, there soon won’t be any information for them scour to get current news, because the news companies have all gone out of business and nobody is creating news any more… Good a bit drastic and the reality is probably somewhere in the middle.

A new solution needs to be found for compensating creators for the information they are providing to LLMs. But until that new model is found, the AI companies should follow the existing laws and license the material, if they want/need it and if it is too expensive, they exclude it from their models.

5 Likes

(not a lawyer but) Paul’s argument about NYT having to sue to protect their copyright so they don’t lose protection if they don’t is false.

https://copyright.byu.edu/copyright-myths

Yes, that applies to trade marks, although if it is such a blatant attack on their copyright and licensing terms (i.e. Open AI & Co. aren’t paying a subscription for a normal reader, let alone an agreement as a commercial entity that wants to reuse their works, there may be something in the small print, and the answer says they should take action when they discover material has been stolen, to prevent even more widespread copying).

1 Like

I think Leo’s argument was based in the question - are these LLMs actually violating copyright by training on the material? As he said during the show, you’d be hard pressed to make a query and have the LLM spit out the exact article. You get a derivation of the content rather than a copy of it.

When I was at University, every single paper I wrote was created in this fashion. I would review copyrighted works and then create my own paper based on what I read, being careful to avoid plagiarism. So whats the difference between what I did and what LLMs are doing?

I think there’s an onus on the content owners to protect works they don’t want to be consumed publicly; place the content behind a paywall.

I really don’t know where I fall in this argument. We need to decide how machine learning models fit into copyright law. Should ML programs be treated the same way humans are? Or should there be different rules for such programs?

Yes, but those copyrighted works were either purchased by yourself or the university/library and therefore you were allowed to read them.

If you had walked into the local bookshop & stolen a copy of the book to read for your paper, that would have been illegal, even if you weren’t copying the book word for word and the contents of the paper were in fair use, the way you obtained the book was still wrong.

If we say, we can ignore copyright and licensing, just so AI companies can grow rich quick, we are going to end up with nobody bothering to actually research and write new material (specifically news, in this case), because they won’t get paid for it, because the AI gobbles it up straight away and people don’t need to read the original article, so the creator doesn’t get paid.

While the LLMs might need the material now to be useful, they will require new material on an ongoing basis, going forward, if they want to remain relevant. The problem is, if they “steal” all the content now, without compensating the creators, there won’t be any creators left in a few years to steal information from, so we won’t have any new information and the AIs won’t be able to produce any current information about what is happening in the world, so they will become redundant, but, because there is nobody reporting on the news, because nobody was paying for them to do their jobs.

At the moment, the AI companies are parasites, intent on destroying their host. If they want to remain relevant, they have to learn not to be destructive parasites and how to become symbiotic with the entities that provide them with their sustenance…

2 Likes

I went in the early 2010s, so probably 90% of my source material was from publicly available internet sources much in the same way these LLMs pull content.

This is interesting. Yale-Stanford did a review of LLMs on legal topics. The results were interesting.

ChatGPT came “top of the class”, beating PaLM-2 (Google Bard( and Llama 2 (MetA), but it didn’t come up smelling of roses, it still got a failing mark. It was wrong or hallucinated 69% of the time! But was head and shoulders above the other two, which were wrong 72% (Llama) and 88% (Bard) of the time.

We all saw the case last year, where two lawyers were reprimanded for using ChatGPT, which gave them totally made-up information.

They always try to side with the user, wanting to please the user and making stuff up, when it doesn’t have a valid answer, for example.

https://forums.theregister.com/forum/all/2024/01/10/top_large_language_models_struggle/

I agree with you fully. One small note-- when I use the Copilot function in Bing, it does provide links and footnotes.

That an LLM is doing it. The impact on any potential market due to your copying is necessarily limited. You have to eat, sleep, do your taxes. An LLM can churn out info forever based on the copied work.

Given that one of the components determining fair use is the impact on potential future markets, I feel that “perpetual ability to riff on the original work for the purposes of individual entertainment” is a pretty large detriment to the potential future market.

4 Likes