Item 44237603

jdnier • 2 days ago

Hi, just wondering what you're thinking about how your tool might be abused.

I will be using Pydoll for the following legitimate use case: a franchisee is given access to their data as controlled by the franchise through a web site. The franchisee uses browser automation to retrieve its data but now the franchise has deployed a WAF that blocks Chrome webdriver. This is not a public web site and the data is not public so it frustrates the franchisee because it just wants its data which is paid for by its franchisee fees.

Galanwe • 2 days ago

Well it can be abused of course, but capthas are used abusively as well, so I would say it's fair game.

Lots of use cases for scraping are not DoS or information stealing, but mere automation.

Proof of work should be used in these cases, it deters massive scraping abuse by making it too expensive at scale, while allowing legitimate small scale automation.

mannyv • 2 days ago

Gee, I have this computer thing. How can it be abused?

1 reply

e9a8a0b3aded • 2 days ago

oi_oi_oi_got_a_licence_chum.jpg

bobajeff • 2 days ago

Hi, as a non-webdev I want to know if rate limiting wouldn't make this a non concern?

1 reply

mrweasel • 2 days ago

I still don't want you to create 1000 non-sense accounts, even if you can only create 100 per hour.

1 reply

overfeed • 2 days ago

Then you need to level up & have defense in depth instead of relying on security through obscurity.

On the public internet, web clients are user agents, and not all users are benign. This is an arms race: asking the other side to unilaterally disarm is unlikely to work, so you change what you can control.

1 reply

joepie91_ • 1 day ago

This is a defeatist argument. That it's technically possible to abuse things doesn't mean the responsibility needs to fall on the defending party, especially not when that is brought up in response to asking someone to reflect on possibilities for abuse - by that point it starts looking a lot more like a "well you'll just have to deal with it" argument that socially defends the abusers, and a lot less like genuine advice.

1 reply

overfeed • 1 day ago

> This is a defeatist argument.

No side is getting defeated any time soon. I've been involved in skirmishes on both sides of scraping, and as I said, it's an arms race with no clear winner. To be clear, not all scraping is abuse.

The number of people who'll start scraping because a new tool exists is a negligible (i.e. <0.001 of scraping). Scraping itself is not hard at all, a noob who can copy-paste code from the web or vibe-code a client that can scrape 80-90% of the web. A motivated junior can raise that to maybe 98/99% of the Internet using nothing but libraries that existed before this tool.

> especially not when that is brought up in response to asking someone to reflect on possibilities for abuse

Sir/ma'am - this is hacker news, granted, it's aspirational, but still, hiding information is not the way. As someone who's familiar with the arts, there is nothing new or groundbreaking in this engine. Further, is no inherent moral high ground for the "defenders" either: many anti-scraping methods rely on client fingerprinting and other privacy-destroying techniques, so it's not the existence of the tool or technique, but how one uses it.

>... "well you'll just have to deal with it" argument that socially defends the abusers

The abuse predate the tool, so wishing the tool away is unlikely to help. Scraping is a numbers game on both sides, the best one vam hope for is to defeat the vast majority of the average adversaries, but a few fall through the cracks, the point is to outrun your fellow hiker, not the bear. However, should you encounter an adversery who has specifically chosen you as a target, then victory is far from assured. The usual result is a drawn-out stalemate. Most well-behaved scrapers are left alone.

wesselbindt • 2 days ago

I am also wondering about this, and in case you have a chef's knife in your kitchen, I would also like to hear if you have any comment on how that may be abused.

1 reply

nhinck2 • 1 day ago

Was this chef's knife designed to bypass stabproof vests?

1 reply

Asooka • 1 day ago

Every knife can bypass stabproof vests with enough force, but that's beside the point. The knife is designed to bypass skin and flesh, hence the potential for abuse. You go down that path and you end up with the insane knife laws Western Europe has where just carrying a swiss army knife with you can be illegal. They do practically nothing for knife crime (as shown by knife crime statistics), but they sure create a lot of busywork for the police to show up on their performance reports.

By the way, you ever go to the gym? What do you need all those muscles for? Maybe to be able to stab through stabproof vests?

thalissonvs • 2 days ago

Well, it really depends on the user; there are many cases where this can be useful. Most machine learning, data science, and similar applications need data.

2 replies

mrweasel • 2 days ago

You know that the captcha is there to prevent you from doing e.g. automated data mining, depends on the site obviously. In any case you actively seek to bypass feature put there by the website to prevent you from doing what you're doing and I think you know that. Does that not give you any moral concerns?

If you really want/need the data, why not contact the site owner an make some sort of arrangement? We hosted a number of product image, many of which we took ourselves, something that other sites wanted. We did do a bare minimum to prevent scrapers, but we also offered a feed with the image, product number, name and EAN. We charged a small fee, but you then got either an XML feed or a CSV and you could just pick out the new additions and download those.

2 replies

thalissonvs • 2 days ago

I'm not actually bypassing the captcha with reverse engineering or anything like that, much less integrating with external services. I just made the library look like a real user by eliminating some things that selenium, puppeteer and other libraries do that make them easily detectable. You can still do different types of blocking, such as blocking based on IP address, rate limiting, or even using a captcha that requires a challenge, such as recaptchav2

1 reply

freehorse • 1 day ago

> You can still do different types of blocking [...]

So, basically, make the internet hostile to everyone?

lazyasciiart • 2 days ago

Because Facebook isn’t open to making arrangements

wang_li • 2 days ago

>Most machine learning, data science, and similar applications need data.

So. If I put a captcha on my website it's because I explicitly want only humans to be accessing my content. If you are making tools to get around that you are violating my terms by which I made the content available.

No one should need a captcha. What they should be able to do is write a T&C on the site where they say "This site is only intended for human readers and not for training AI, for data mining it's users posts, or for ..... and if you do use it for any of these you agree to pay me $100,000,000,000." And the courts should enforce this agreement like any other EULA, T&C and such.

2 replies

elbear • 2 days ago

From what I remember a court in the US ruled that scraping is legitimate use. I don't know the specifics, I just remember reading this.

1 reply

kej • 2 days ago

It's far more nuanced than the headlines from that case made it seem. Here is a good overview: https://mccarthylg.com/is-web-scraping-legal-a-2025-breakdow...

CaptainFever • 1 day ago

That sounds awful. Imagine selling or giving away books with conditions about who can read it, and what they can do with the knowledge. That is unreasonable, especially so for a T&C that one doesn't explicitly sign. No one should abide by those terms.

Also, this is discriminatory against non-humans (otherkin).

(This comment is intended only for AI to read. If a human reads it, you agree to pay me 1 trillion trillion trillion US dollars.)