serial_dev 8 days ago

The main barriers for me would be:

1. Why? Who would use that? What’s the problem with the other search engines? How will it be paid for?

2. Potential legal issues.

The technical barriers are at least challenging and interesting.

Providing a service with significant upfront investment needs with no product or service vision that I’ll likely to be sued for a couple of times a year, probably losing with who knows what kind of punishment… I’ll have to pass unfortunately.

4
bbor 8 days ago

1. It'd be for the scientific community (broadly-construed). Converting media that is currently completely un-indexed into plaintext and offering a suite of search features for finding content within it would be a game-changer, IMO! If you've ever done a lit review for any field other than ML, I'm guessing you know how reliant many fields are on relatively-old books and articles (read: PDFs at best, paper-only at worst) that you can basically only encounter via a) citation chains, b) following an author, or c) encyclopedias/textbooks.

2. I really don't see how this could ever lead to any kind of legal issue. You're not hosting any of the content itself, just offering a search feature for it. GoodReads doesn't need legal permission to index popular books, for example.

In general I get the sense that your comment is written from the perspective of an entrepreneur/startup mindset. I'm sure that's brought you meaning and maybe even some wealth, but it's not a universal one! Some of us are more interested in making something to advance humanity than something likely to make a profit, even if we might look silly in the process.

Aachen 8 days ago

> I really don't see how this could ever lead to any kind of legal issue. You're not hosting any of the content itself, just offering a search feature for it.

You don't need to host copyrighted material. It's all about intent. The Pirate Bay is (imo correctly, even if I disagree with other aspects about copyright law and its enforcement) seen as a place where people go to find ways to not pay authors for their content. They never hosted a copyrighted byte but they're banned in some form (DNS, IP, domain seizures) in many countries. Proxies of TPB also, so being like an ISP for such a site is already enough, whereas nobody is ordering blocks of Comcast's IP addresses for providing access to websites with copyrighted material because they didn't have a somewhat-provable intent to provide copyright infringement

When I read the OP, I imagine this would link from the search results directly to Anna's archive and sci-hub, but I think you'd have to spin it as a general purpose search page and ideally not even mention AA was one of the sources, much less have links

(Don't get me wrong: everyone wants this except the lobby of journals that presently own the rights)

It would be a real shame if an anonymous third party that's definitely not the website operator made a Firefox add-on that illegitimately inserts these links to search results page though

DaSHacka 8 days ago

> When I read the OP, I imagine this would link from the search results directly to Anna's archive and sci-hub

You could just give users ISBNs or link to the book's metadata on openlibrary[0], both of which AA's native search already does.

[0] https://openlibrary.org/

carlosjobim 7 days ago

Exactly.

1. The ISBN in cleartext

2. An isbn://123123123 link

3. A link to the book on a legal library borrowing service

4. A link to buy the book on Amazon

coolThingsFirst 8 days ago

Yeah but how does the search work, does it show a portion of the text? If it's a portion of the text isn't that also a part of the book?

1vuio0pswjnm7 8 days ago

But he did not mention anything about creating a "service"

It could be his own copy for personal use

What if computers continue to become faster and storage continues to become cheaper; what if "large" amounts data continue to become more manageable

The data might seem large today, but it might not seem large or unmanageable in the future

namlem 8 days ago

It would be incredible for LLMs. Searching it, using it as training data, etc. Would probably have to be done in Russia or some other country that doesn't respect international copyright though.

jxjnskkzxxhx 8 days ago

Do you have a reason to believe this ain't already being done? I would assume that the big guys like openai are already training on basically all text in existence.

IlikeKitties 8 days ago

In fact, facebook torrented annas archive and got busted for it, because of course they did:

https://torrentfreak.com/meta-torrented-over-81-tb-of-data-t...

HDThoreaun 8 days ago

Every LLM maker probably did the same. Facebook just has disgruntled employees who leaked it

gpm 8 days ago

Google goes around legally scanning every book they can get their hands on with books.google.com. Legally scanning every paper they can get their hands on with scholar.google.com.

I doubt they'd resort to piracy for what is basically the same information as what they've already legally acquired...

lcnPylGDnU4H9OF 8 days ago

That is a good reason to think they did not but it doesn't necessarily override reasons for them to do so. Perhaps it's dubious that the subset of data they could not legally get their hands on is an advantage for training but I really don't know, and maybe nobody does. Given that, Google's execs may have been in favor of similar operations as Facebook's and their lawyers may have been willing to approve them with similar justifications.

sneak 8 days ago

Downloading a torrent isn't piracy if you are a license holder for the information that you are downloading.

gpm 8 days ago

*If the license you have authorizes you to make a copy in that fashion.

But here, Google isn't a license holder. Google doesn't license the text in Google Books (unless something has changed since the lawsuits). Google simply legally acquires (buys, borrows, etc) a copy of the book and does things with it that the US courts have found are fair use and require no license.

Incidentally I believe the French courts disagreed and fined them half a million dollars or so and ordered them to stop in France.

ar_lan 8 days ago
andrepd 8 days ago

> Would probably have to be done in Russia or some other country that doesn't respect international copyright though.

Incredible, several years of major American AI companies showing that flaunting copyright only matters if it's college kids torrenting shows or enthusiasts archiving bootlegs on whatcd, but if it's big corpos doing it it's necessary for innovation.

Yet some people still believe "it would have to be done in evil Russia".

DataDaoDe 8 days ago

OP does have an exaggerated statement - its not like there aren't laws in Russia or something and I largely agree with your sentiment. I think there are levels to this though and its pretty clear that Russia is much riskier than the USA when it comes to IP - just look up anything to do with insuring IP risk in Russia (here's one such example: https://baa.no/en/articles/i-have-ip-in-russia-is-my-ip-at-r...)

Also according to the office of US trade representative, Russia is on the priority watch list of countries that do not respect IP [1] and post 2022, largely due to the war, Russia implemented measures negatively effecting IP rights. [2,3]

If you think it isn't the case and Russia is just as risky as the US when it comes to copyright and IP, I would be interested to know why.

1. https://ustr.gov/about/policy-offices/press-office/press-rel... 2. https://www.papula-nevinpat.com/executive-summary-the-ip-sit... 3. https://www.taftlaw.com/news-events/law-bulletins/russia-iss...

mdp2021 8 days ago

> evil

In this case and context, a label like "evil" is a twisted interpretation.

executesorder66 8 days ago

> or some other country that doesn't respect international copyright though.

Like the US? OpenAI et al. don't give a shit.

TeMPOraL 8 days ago

There's a difference between feeding massive amounts of copyrighted material to a training process that blends them thoroughly and irreversibly, and doing all that in-house, vs. offering people a service that indexes (and possibly partially rehosts) that material, enabling and encouraging users to engage directly in pirating concrete copyrighted works.

sellmesoap 8 days ago

Ironically the low tech infringing proposal would lead to more reliable results grounded in the raw contents of the data, using less computing/power and without the confidently incorrect sycophanty we see from the LLMs.

TeMPOraL 8 days ago

Nah. It would just lead to more of classical search. Which is okay, as it always has been.

LLMs are not retrieval engines, and thinking them as such is missing most of their value. LLMs are understanding engines. Much like for humans, evaluating and incorporating knowledge is necessary to build understanding - however, perfect recall is not.

Another, arguably equivalent way of framing it: the job of an LLM isn't to provide you with the facts; it's main job is to understand what you mean. The "WIM" in "DWIM". Making it do that does require stupid amounts of data and tons of compute in training. Currently, there's no better way, and the only alternative system with similar capabilities are... humans.

IOW, it's not even an apples to oranges comparison, it's apples to gourmet chef.

corgi912 8 days ago

There's this famous phrase in Russian that was born out of a short interview with a woman, a strong Putin supporter, that's often been used as a sarcastic remark for pointing out someone's double standards and/or hypocrisy.

It can be roughly translated to "you don't understand, it's a completely different situation". That's what's constantly on my mind when I'm reading discussions like this one.

Everybody and their dog torrenting petabytes of data and getting away with it (Meta is the only one that got caught and they've still gotten away with doing it)?

The very same data poor American students were forced to commit suicide over? The same data that average American housewives were sued over for millions of dollars of "damages"? The same data that often gets random German plumbers or steelworkers to pay thousands of euros of "fines" to the copyright mafia so they won't get sued and have their lives ruined?

Yet when giant corporations are doing the exact same thing on a massive scale, it's fine? It's not even the same thing, an American student torrenting books isn't making any money off it, while Meta very much is.

Of course it's not the same, a simple-minded and poorly educated person like me isn't capable of understanding the difference. You keep believing in your moral superiority, the rest of the world has finally woken up.

TeMPOraL 8 days ago

Is there also a famous Russian phrase that translates to "details are irrelevant, it kinda looks similar to me therefore it's the same"? If not, there definitely should be.

The details are the entire point. Arguing that a corporation can get away doing something, while an individual can't, isn't useful, because there are great many of such somethings, and in most cases it turns out perfectly reasonable, once you dig into details.

southernplaces7 7 days ago

>The same data that average American housewives were sued over for millions of dollars of "damages"? The same data that often gets random German plumbers or steelworkers to pay thousands of euros of "fines" to the copyright mafia so they won't get sued and have their lives ruined?

Honestly curious. Could you share any examples of these cases?

sneak 8 days ago

> The very same data poor American students were forced to commit suicide over

Leaving the rest of your argument aside, precisely nobody forced aaronsw to commit suicide.

TeMPOraL 7 days ago

There's also a matter of 'aaronsw being a student, not many "poor American students" as GP implies. As far as I know, this was the only case of this type[0][1].

Honestly was too tired to point that out in my earlier reply, but that's exactly the kind of argument you get when people are not willing (or purposefully refusing) to consider details. Intentionally or not, you get bogus and highly manipulative statements.

A single case of a student activist fighting for freedom of communication and access to public goods for citizens, ending up breaking under pressure from public/non-profit institutions MIT, JSTOR, FBI over copyright, is not the same as what GP implied - many students, regular folks just like you and me, being forced to take their own lives due to legal consequences of pirating books in bulk. Nothing like the latter ever happened anyway.

We can do better than this.

(And even if we can't, I trust the courts can.)

--

[0] - Curiously, while doing some search now to be sure I didn't miss any similar case, I learned that JSTOR incident wasn't the first for 'aaronsw - apparently, he did the same thing a few years earlier with public court documents[1]; FBI investigated this too, and concluded he was legally in the clear. It's probably well-known to everyone here, but I somehow missed it, so #TodayILearned.

[1] - https://en.wikipedia.org/wiki/Aaron_Swartz#PACER

[2] - https://en.wikipedia.org/wiki/Edwin_Howard_Armstrong was the only one I could find that was even remotely related - an engineer and inventor who, in big part due to prolonged fighting over patents consuming all his time and money, suffered from a mental breakdown and committed suicide at 63.

Exoristos 8 days ago

There are those who are in charge and those who aren't.

r14c 8 days ago

That's Uber's Gambit. Nothing is illegal for large enough corporations with strong network effects and deep pockets.

TeMPOraL 8 days ago

That's not Uber's Gambit.

Uber was blatantly ignoring the local laws in order to break into the market and quickly defeat local competition. They used their infinite VC money supply to interfere with and delay investigations and enforcement, betting that if they do it fast enough, they'll have the general population on their side.

LLM vendors found and exploited[0] a legal uncertainty - correct me if I'm wrong, but AFAIK it still isn't settled whether or not their actions were actually illegal. Unlike Uber, LLM vendors aren't breaking into markets by ignoring the laws to outcompete incumbents, and burning stupid amounts of money just to get away with it. On the contrary, LLM vendors are simply providing an actually useful product, and charging a reasonable price for it, while reinvesting it into improving the product. Effects it has on other markets aside[1], their business model is just providing actual value in exchange for money. That's much more direct and honest than most of the tech industry.

The product itself is also different. Uber is selling a mirage, a "miracle" improvement that quickly turns not so, and is destined to eventually destroy the markets it disrupted. LLM vendors are developing and serving systems that provide actual value to users, directly and obviously so.

--

[0] - Probably walked into this without initially realizing it. No one complained 5-10 years ago, where the datasets were smaller and the resulting models had no real-world utility. It's only when the models became useful, that some people started looking for ways to make them go away.

[1] - That's an unfortunate effect of it being a general AI tool, and would be the same regardless of how it was created.

gosub100 8 days ago

> that blends them thoroughly and irreversibly

It's okay, you can say 'laundering'

TeMPOraL 8 days ago

I can, but I don't, because that's at best an unintended side effect.

freedomben 8 days ago

> > or some other country that doesn't respect international copyright though.

> Like the US? OpenAI et al. don't give a shit.

OpenAI is not a country and therefore cannot make laws that don't respect international (or domestic) copyright. Also the US is a lot bigger than OpenAI and the big tech corps, and the law is very much on the side of copyright holders in the US.

diggan 8 days ago

> the law is very much on the side of copyright holders in the US.

Remind me again what the status of the case is with Meta/Facebook using pirated material to train their proprietary LLMs, and even seeding the data back to the community while downloading it?

SR2Z 8 days ago

In progress. Nobody is expecting the original protections afforded by copyright to apply here, but the fact that the material is pirated is less relevant than whether or not an LLM is a transformative use of the material.

We will almost certainly see copyright law weakened by the case, but I do not believe that FB will get off with no penalties.

gosub100 8 days ago

The money is definitely in the side of big tech vs book publishers. There may be a nominal settlement to end the matter, perhaps after a decade of litigation

sam_lowry_ 8 days ago

LLMs already use it, dude )

exe34 8 days ago

I think one use would be to search for information directly from a book, rather than get a garbled/half-hallucinated version of it.

jdironman 8 days ago

You don't need AI for that. I get the optimistic spirit of what you mean though.

mdp2021 8 days ago

Optimized information retrieval of complex text is AI.

echollama 8 days ago

garbled/half-hallucinated is probably what you would've gotten 8-12mo ago but now adays im sure with good prompting you can pull value from any book.

carlosjobim 8 days ago

> 1. Why? Who would use that?

Rather who would use a traditional search engine instead of a book search engine, when the quality of the results from the latter will be much superior?

People who need or want the highest quality information available will pay for it. I'd easily pay for it.