[HN Gopher] Scrape like the big boys ___________________________________________________________________ Scrape like the big boys Author : incolumitas Score : 286 points Date : 2021-11-05 09:22 UTC (13 hours ago) (HTM) web link (incolumitas.com) (TXT) w3m dump (incolumitas.com) | InvOfSmallC wrote: | Where I was working we stopped caring about ips browser etc | because it was just a race. What we did was analyzing behaviour | of clicks and acted on that. When we recognized it we went on | serving a fake page. It cuts down a little bit of costs because | it was static pages. In general it took a lot of time for them to | discover the pattern and it was way more manageable for us. | devops000 wrote: | Could you share your code for AWS lambda and puppetter? It's | definitely interesting for other websites | incolumitas wrote: | Sure. | | https://github.com/NikolaiT/Crawling-Infrastructure | | And here I am writing about it (but its quite old): | https://incolumitas.com/2019/08/31/web-scraping-puppeteer-aw... | joekrill wrote: | A little pet-peeve I have is when an obscure(ish) acronym is used | and never defined. Is SERP a well-known acronym? Perhaps this is | a niche blog and I'm not the intended audience. | tptacek wrote: | Yes; a SERP is a Google search result page. It's the most | important acronym in SEO. | nomdep wrote: | I don't remember never ever hearing it and I've been in the | industry for some time | Kiro wrote: | You can't be serious. | hollerith wrote: | Huh. I've never been in the industry, but noticed "SERP" at | least 15, maybe 20, years ago and have remembered it since. | | (If I were writing something to be published, though, I | would write "search-engine results page" instead of | "SERP".) | weird-eye-issue wrote: | You've been in the SEO industry for some time and never | heard SERP? | daveguy wrote: | I had to look it up. | | SERP: Search Engine Results Page | bgroat wrote: | Not the OP, but I thought it was well known. | | That said, I do a lot of SEO work. | | Still, it should be best practice to define any acronym or | initialism the first time you use it | fergie wrote: | Unintroduced acronyms should always be avoided. | nsotelo wrote: | As English speakers we often take for granted acronyms such | as DB or even USA. For foreigners these can also be | inscrutable. | praptak wrote: | Depends on the audience-acronym pair. I don't think HTTP | needs an introduction in a technical article, OTOH (on the | other hand ;) ) a general newspaper should probably expand | HTTP but not WWW. | joncp wrote: | An all-too-common occurrence in HN comments as well. | marginalia_nu wrote: | The word SERP feels like a bit of a shibboleth for SEO-people. | They seem to take it for granted, the rest of the world just | looks puzzled when they hear it. | hall0ween wrote: | Basic question, how does one profit from scraping data and what | kinda data? | | Taking a stab at answering it: you scrape the data and build a | business around selling it. Stock prices? But that's boring, plus | how many others are doing it? I bet a lot. | 323 wrote: | These are scraping artificially limited releases clothes/shoes. | You buy a shoe at $100 and immediately sell it at $1000. | | Artificial scarcity - every week you release a "limited edition | item", but if you do the math, it's not limited edition at all | if you integrate over a year. | throw1234651234 wrote: | 1. Be job site. 2. Have employees that cost money call | facilities and get job listings. 3. Establishing relationships | with facilities to list jobs. 4. Buy job listings from 3rd | parties. 5. List them for free hoping to make margin. 6. | Scraper steals all jobs, lags site, and gets value of hard work | for free. | hall0ween wrote: | ahh thanks | IceWreck wrote: | The author says proxys are expensive and then proceeds to spend a | shitton of money buying all that hardware. | incolumitas wrote: | 4G proxies are just soo much better than so called | "residential" or straight datacenter proxies. It makes sense to | create your own 4G proxy farm if you conduct business in that | area. | | With only 10 dongles and 10 dataplans, you can have a lot of IP | addresses that are extremely hard to block. It's an one time | investment, paying proxy providers is a fixed cost. | bsder wrote: | Where do you get 4G dongles that don't suck nowadays? | | We tried to get some, but all of the ones we could get were | various levels of broken or unsupported. | palijer wrote: | That was not the authors main argument against proxies, that | was just an additional point. You ignored the primary argument | in your judgment. | | >>Because I could not fully trust the other customers with whom | I shared the proxy bandwidth. What if I share proxy servers | with criminals that do more malicious stuff than the somewhat | innocent SERP scraping? | RandomThrow321 wrote: | Can they not call out a secondary point? | ebbp wrote: | Having spent a week battling a particularly inconsiderate | scraping attempt, I'm quite unsurprised by the juvenile tone and | fairly glib approach to the ethics of bots/scraping presented by | the piece. | | For the site I work for, about 20-30% of our monthly hosting | costs go towards servicing bot/scraping traffic. We've generally | priced this into the cost of doing business, as we've prioritised | making our site as freely accessible as possible. | | But after this week, where some amateur did real damage to us | with a ham-fisted attempt to scrape too much too quickly, we're | forced to degrade the experience for ALL users by introducing | captchas and other techniques we'd really rather not. | paco3346 wrote: | I'm right there with you. I'm the lead engineer for an | automotive SaaS provider (with ~6000 customers and ~4 billion | requests per month) and we recently started moving all our | services to Cloudflare's WAF to take advantage of their bot | protection. We were getting scrapes from botnets in the 100000+ | per minute range that was affecting performance. | | We chose to switch to the JS challenge screen as it requires no | human interaction. We now block 75% (estimated to the best of | our knowledge) of bot traffic but some customers are livid over | the challenge screen. | [deleted] | EdwardDiego wrote: | What were they scraping, if I can ask? Was it targeted or | just wget -r style? | paco3346 wrote: | It was a hybrid of low-effort vulnerability scanning and | targeted inventory scraping. Many dealerships in the | automotive space will pay gray-hat third parties to scrape | and compile data on their competitors. | | The irony for us as a provider is that it's one of our | customers (party A) paying a third party to scrape data | from another one of our customers (party B) which in turn | affects the performance of party A's site. We've started | blocking these third parties and directing them to paid | APIs that we offer. | RobSm wrote: | And how do you get your 'inventory data'? Aren't you | scraping (or using scraped data) yourself? Oh the irony | :) | paco3346 wrote: | No, we're a contracted provider for these customers. They | ingest their data into our network through APIs or CSVs. | Andoryuuta wrote: | I'm really surprised that the JS challenges helped so much, | given that there are open source libraries for bypassing them | (e.g. cloudscraper[0]). | | [0]: https://github.com/venomous/cloudscraper | paco3346 wrote: | If someone wanted to get past it they probably could. We've | had a few sources of traffic that we've had to straight up | block (as opposed to challenge) because of this exact | issue. So far it's been a "good enough" solution that | blocks enough of the bot traffic to be effective. | RobSm wrote: | Why do you think those bots were scraping your data in the | first place? | devwastaken wrote: | If an amateur can do that to your service by scraping, imagine | what someone can do if they actually intend to do you harm. | With cloud pricing models someone could find a little | misconfiguration or oversight and put you in the hole in | operating costs. Anti-abuse is a necessary design when your | service is exposed to the internet. | | Not saying that doesn't suck - it does, it's why many ideas | don't work in practice as an online service. | kulikalov wrote: | Why not create api endpoint and charge mild cost for that data? | You'll make money instead of spending it. | scarygliders wrote: | Do you honestly believe all site scraper people/companies are | ethical enough to go to whoever pays /them/ to scrape data | from a competitor's site and say "oh they offer an API to | access this data let's pay for that", instead of "why pay for | that data when we can scrape it right off their site"? | | Also, not all types of company will provide API endpoints. It | all depends on the type of site - for example, an online shop | might not wish to provide easily accessible data on offered | products and prices, to their competitors who may wish to | undercut them. Why would an online shop do that? | jadell wrote: | I run a large scraper farm against several large sites. | They're not online shops, and we don't compete with them. | But they do have hundreds of thousands of data points that | we use to provide reports and analytics for our clients, | who also do not compete with the sites. | | I absolutely would pay for an API that provides that data. | I'd be willing to pay 10x more than the cost of maintaining | and running the scrapers. | | But the sites being scraped have no interest in that. | texasbigdata wrote: | Building and maintaining the scraper is the not cost they | would use to measure it internally. It's the cost to | build the API, and support it and perhaps any perverse | incentive it creates where even more data flows out to | competitors. | wolverine876 wrote: | And the cost of being scraped. | RobSm wrote: | Building API is 5 times easier than building routes for | your public webpages, which is basically an 'API' as | well. | CWuestefeld wrote: | Have you tried approaching those sites and asking them to | provide an API, pointing out that it would be easier for | both of you in the long run? Or are you just assuming | they wouldn't do it. | | Because right now, I sure wish that the bots - which | comprise probably 2/3 of my traffic - are causing me huge | headaches and I wish that the people doing it would tell | me what the heck they want. | zivkovicp wrote: | Well, you don't need an api, just a CSV file with a | catalog. | | The scraping company WILL use the API/CSV file... they will | probably also still charge their customer for scraping, so | it's a win-win :D | | You can think of it this way, the prices and product data | are publicly visible already on the website, there are no | real secrets, none of it is password protected. | | You can be principled and insist on blocking bots and spend | a lot of time and money on tools, people, and ultimately | hosting because the bots will _always_ win; or you can | offer the data for free /minimal fee and serve it with | almost zero cost and cache it so you can do that with a | micro sized server. | | You can always lie about some of the prices if you want, | but you will just encourage bots again. | | Ethics are nice, but let's be honest, very lacking. | Sometimes it's better to be pragmatic. | scarygliders wrote: | > You can think of it this way, the prices and product | data are publicly visible already on the website, there | are no real secrets, none of it is password protected. | | There's the problem right there. The prices and product | data are publicy visible - because there is a target | audience of /humans/ for whom the site is designed and | intended to be used by. The site is not there to cater | for a competitor's scrapers. | | I don't care how much people couch their unethical | behaviour in "the data is publically available", the | basic fact is most if not all websites exist for human | eyeballs to look at them. They do not exist for arseholes | to DOS them by inundating them with scrapers. | zivkovicp wrote: | I agree 100%, but it is a fact of life, and sometimes | it's better to just minimize the fuzz and focus on the | things that matter. | | Your argument is perfectly valid and applies to offline | activities as well (what stops a competitor from walking | through the aisles of a Walmart or Costco?), but this is | a battle that can't be won, there are too many parasitic | actors. It is human nature. | mcdonje wrote: | > (what stops a competitor from walking through the | aisles of a Walmart or Costco?) | | That's a significant portion of Nielsen's business model. | TeMPOraL wrote: | > _the basic fact is most if not all websites exist for | human eyeballs to look at them._ | | There's a whole ethical subthread here of websites trying | to making the experience for those humans miserable, and | taking away the agency necessary to protect oneself from | that. A browser is _a_ user agent. So is a screen reader. | So is a script one writes to not deal with bullshit | fluff, when all one wants is a simple table of products, | features and prices. | 0xdeadbeefbabe wrote: | Let's not encourage these unethical people to even think | of using human eyeballs and manual data entry for their | scraping instead of bots. That sounds pretty darn | unethical. | zo1 wrote: | From my perspective, the problem is that the data that is | offered isn't really "for humans". The data is for | _convincing_ the humans to buy /pay or worse, browse and | watch ads as a result. | | But overall, information is one of those goods that has | intrinsic properties like no other. It can be copied, | infinitely. And we haven't yet figured out the dynamics | of how to reason about it, so it feels like we're | pretending they're physical goods. | | Edit. Side note. I'd go further and say that some of the | data is even worse, it's "offered" with the real | intention being to confuse the users into performing non- | optimally in the market. Look at | Amazon/Ebay/AliExpress/Google listings for evidence of | that. Just Google - Google is a ML and scraping power | house, and the best they can muster is to be spammed with | fake websites and duplicate/confusing listings. | TeMPOraL wrote: | You hit the nail on the head. It's hard to have sympathy | for site operators complaining about scraping, where | almost every site does its best[0] to make using it a | time consuming, potentially risky and overall annoying | ordeal. Not to mention, information asymmetry is anathema | to a well-functioning market, and yet no. 1 reason for | fighting bots given in the whole thread here is a desire | to maintain that information asymmetry. | | And that's also the dirty secret behind the "attention | economy": it's whole point is to make things _as | inefficient as possible_ , because if you're making money | on people's attention, you need to first steal it (by | distracting them from what they're trying to achieve), | and then either direct towards your goals (vs. those of | the users), or stretch it out to maximize their exposure | to advertising. | | -- | | [0] - Sometimes unintentionally. Unfortunately, the | overall zeitgeist of UX design is heavily influenced by | bad players, so default advice in the industry is often | already intrinsically user-hostile. | matheusmoreira wrote: | > Why would an online shop do that? | | Because otherwise the HTML will become the API. | marginalia_nu wrote: | Bots are one of those things that are easy to build and hard to | get right, and there's really no way of preparing for the | chaotic reality of real web pages other than fixing the | problems as they show up. Weird and unexpected interactions are | going to happen. Crawling the real web involves navigating a | fractal of unexpected, undocumented and non-standard corner | cases. Nobody gets that right on the first try. Because of that | I do think we need to be a bit patient with bots. | | At the same time, even as someone who runs a web crawler, I | have zero qualms about blocking misbehaving bots. | chillfox wrote: | I kinda feel like rate limiting your request to individual | domains and IP addresses is an easy thing that goes a long | way towards getting it right. | marginalia_nu wrote: | There are still snags with that. | | Stuff like redirect resolution is very easy to overlook. | You may think you're fetching 1 URL per second, but if you | are using the wrong tool and you're on a server that has | you bouncing around like in a pinball machine and takes you | through a dozen redirects for every request, the reality | may be closer to 10 requests per second. | | On top of that, sometimes the same server has multiple | domains. Sometimes the same IP-address serves a large | number of servers (maybe it's a CDN). | RobSm wrote: | If you build your site in a way that multiplies each | request 10x, well then that's what you get. Don't do that | and you won't have issue with requests. Or handle those | requests properly. There are solutions to that. You know | how many requests your local google CDN gets? They know | how to manage load. | marginalia_nu wrote: | Most pages have at least a http->https redirect, many | contain a lot of old links to http content. | | Usually it's error pages that really drive the large | redirect chains. They often have a vibe of like some | forgotten stopgap put in place to help with some | migration to a version of the site that is no longer in | existence. | | Of course you don't know it's an error page until you | reach the end of the redirect chain. | [deleted] | krzyk wrote: | As a programmer that just sometimes wants to check if given | item is available in store I would like to be able to use API | for that. But if it is not available one has to scrape. | scarygliders wrote: | Right with you there. | | I had a particularly bad time not so long ago, when a | customer's site - a shop - was brought to its knees because | someone, probably a competitor, hired some scraper-company of | some sort to scrape every product and price. | | The scraper would systematically go through every single | product page. | | And by scraper, I mean - 100's of them. All at the same time, | using the old trick of 1 scraper requesting 3 or 4 product | pages at a time then pausing for a while. | | They used umpteen different IP address blocks from all over the | globe - but mainly using OVH vps IP address blocks from France. | | Now, maybe if they'd just thrown, say, 5 or 10 of the scraper | "units" at the site, no one would have noticed in amongst | Googlebot (which they wanted to use anyway because they are | using Google Shopping to try to bring in more sales). | | But no. This shower of arseholes threw 100's of scraper "tasks" | at the site. They got greedy. | | Now, the site was robust enough to handle this load - barely - | which was massive, however, having to do that /and/ also handle | normal day-to-day traffic? Nah. The bastards got greedy and | like you I spent a few days unfucking the damage they were | causing. | | Seriously, I hate scrapers. I hate the people who make | scrapers. I hate their lack of ethics. Fuck those guys. | thatwasunusual wrote: | It sucks when this happens, but it's easily avoidable by | using a caching frontend of some sort. | | My favorite is Varnish,[0] which I have used with great | success for _many_ web sites throughout the years. Even a web | site that 10+ millions of requests per day ran from a single | web server for a long time a decade-ish ago. | | [0] https://varnish-cache.org/ | mdoms wrote: | If your site is so poorly written it can't handle a few | hundred computers trying to do something as simple as loading | your product pages then sorry, but that's on you. The | information is on the public web and scrapers are as entitled | to access it as any web browser. | [deleted] | matheusmoreira wrote: | > Seriously, I hate scrapers. I hate the people who make | scrapers. I hate their lack of ethics. Fuck those guys. | | Not everybody in this space is out to destroy your site. Some | of us actively try to put as little load on your site as | possible. My scraper puts less load on sites than I do when I | browse them normally, I've measured it. Really sucks when we | get lumped together with the other abusers and blocked. | ligerzer0 wrote: | Exactly, some of us use scrapers because while we can't go | full Richard Stallman, we also don't want to visually sift | through ridiculous UI just to look at some basic data/text. | _jal wrote: | In a past life, we were consulting with a startup that | offered a subscription data service. They were very sensitive | about scrapers, especially on the time limited try-before- | you-buy accounts, which competitors were abusing. | | At their request, we built a method to flag accounts for data | poisoning. Once flagged, those accounts would start getting | plausible-ish looking garbage data. | | It was pretty effective. One competitor went offline for a | few days about a week after that started, and had a more | limited offering when they came back up. | scarygliders wrote: | That's a good way of going about dealing with this kind of | abuse indeed. Wish I'd thought of doing that at the time, | but due to the nature of this shop you didn't need a user | account to browse the products/prices. | | I'm now making an entirely new shop for them - I shall bear | this in mind. Thanks for that! | brightball wrote: | Yea. Detect them and mess with them is the only approach | that seems to work for a lot of abusive activity. Banning | doesn't work because they will just start over from | scratch. The only thing you can really do is make them | think you haven't "caught" them yet and during that stretch | make sure their time is wasted. | funnyflamigo wrote: | > Seriously, I hate scrapers. I hate the people who make | scrapers. I hate their lack of ethics. Fuck those guys. | | Wait till you find out what half of Google's business is | based on (spoiler - scraping). | | I really don't think scraping itself is an issue 90% of the | time. It's the behavior of the out of control scrapers that | are the problem. A well behaved scraper should barely be | noticeable, if at all. | RobSm wrote: | Exactly. I am surprised that the 'devs' can't figure out a | way to block only annoying/excessive scrapers. Most likely | they are just lazy and then just put 3rd party 'solution' | and job done. Pay me. | jjeaff wrote: | At least google's scraping does result in your website | being discoverable by users. So you get something out of | it. That's not to say that sometimes Google is missing or | stealing data they scrape. But at least there is some | benefit. Many other scrapers are merely taking the data to | compete. | funnyflamigo wrote: | I strongly feel that if a human can get to it manually, | we have to accept that either it will be botted or humans | will be paid to do it by hand (They call these people | "analysts" or "market researchers"). | | I might argue that what google actually uses their | scraped data for is their search engine - which is | private. They simply allow us access to specially crafted | queries, which they can and do manipulate (for many | reasons, some good some bad). | | The only thing I'd say meets that definition would be | like Common Crawl. | [deleted] | [deleted] | jtdev wrote: | Considering the demand for your content, why haven't you | created and provided an API? Maybe you could monetize? | chewmieser wrote: | Like everyone and their brother has a web spider. And some of | them are VERY badly designed. We block them when they use too | many resources, although we'd rather just let them be. | | Can't speak for the op but we have APIs and move the ones | scraping and reselling our content to APIs. The majority are | just a worthless suck on resources though. | ebbp wrote: | We do offer an API - the scrapers are trying to circumvent | using that, presumably. | halfmatthalfcat wrote: | Maybe the API terms/cost are prohibitive? I'm sure there's | some equilibrium where they would rather pay you than go | through the trouble of scraping. | kulikalov wrote: | Maybe docs or infra are unbearable | purerandomness wrote: | Why do you think are they trying to circumvent it? | | Does your API provide all the information that can be found | on the site, or are they scraping because the API is | incomplete? | | We've once had to scrape Amazon product pages because they | have a lot of API endpoints, but those didn't contain the | data we needed. | scarygliders wrote: | Why would Amazon wish to provide you with easy to access | data on their products and prices when you could either | be a competitor wishing to undercut those prices, or be a | scraper company hired by such a competitor? | | In what universe is providing such a straightforward way | of helping a competitor considered sane business | practice? | matheusmoreira wrote: | Because they will get the data regardless of what you do | and if you don't make an API it will cost you more due to | overhead. | jtdev wrote: | In the end, they still get the data, just in a much less | desirable way for both you and the customer. | manquer wrote: | Most sellers who are on Amazon platform give Amazon that | information and a lot more, knowing full well Amazon will | use their sales data to launch an Amazon Basics | competitior. | | It is a sane business approach when you are a pragmatic | business who knows the limits that constrain your | business. | | Either the content company is going to build a simple API | (could be just a static CSV file hosted on S3 or | whatever) with useful information or try to monetize/hide | this information and force scapers to use the website . | | A bot is always going to win unless you want to make | users also a lot of friction. In the era of deepfakes and | fairly robust AI tooling the difference between bot | action and humann action is not all that much. | | If you are going to be agressive with captcha , IP blocks | and other fingerprinting, users who get identified false | positive.or annpyed would leave. | | When the cost of losing those users is more than allowing | access to scrapers,you would absolutely setup the API. | weird-eye-issue wrote: | Man your comment is hilarious because in fact Amazon DOES | provide an API for exactly that | scarygliders wrote: | And yet... | | > We've once had to scrape Amazon product pages because | they have a lot of API endpoints, but those didn't | contain the data we needed. | | ...only a couple of comments up. | matheusmoreira wrote: | This is the number one reason to scrape websites. It's | always nice when there's an API with documentation and | rate limiting rules you can follow. Sometimes the data I | need just isn't there, though. Then I open up their site | and find a huge amount of private API endpoints that do | exactly what I want. Then I open up a ticket about it and | it gets 200 replies but they ignore it for _years_. It 's | fucking stupid and it's really no wonder people scrape | their site. | 1cvmask wrote: | What is your site may I ask? | | Just curious about the difference in value from using your | API and web scraping as there is a cost to web scraping as | well. | bryanrasmussen wrote: | If you make your scraper well, and it counterfeits being | a real user believably, you end up with a solution that | can be tweaked as needed to handle whatever traps people | put in to try to defeat your scrapers. | | If you make your api client well, you don't have the | problems of a scraper - but if the api owner decides to | change rules for api and you can't do what your business | is based on being able to do (think of api owner as | Twitter) then you need to make a scraper. | gmanis wrote: | Is it not viable to put majority of your data behind a | login and so the bots only get a very limited snapshot | while legitimate users get it through a free login? | | I'm asking this because I'm going through very similar | situation and would love to see other opinions around this. | weird-eye-issue wrote: | You are defining legitimate users as those that have a | valid session cookie? Good luck | aninteger wrote: | Wait, why wouldn't you have rate limiting on your API? | Providers like Cloudflare offer this although I guess you | could roll your own too since our industry loves to | reinvent the wheel. | throwaway2993 wrote: | I wrote a scraper a couple of years ago to get a single data | point from a website where my client was already a paying | customer. This website had an API, which they were also | paying for, but the API didn't cover that data point, so at | the time they had one of their admin people populating that | missing piece of data manually, which was taking them around | ten minutes a day. | | I asked them if my customer could pay to access this data | point via their API and they quoted 3600 EUR/month! Enter the | scraper... | [deleted] | taytus wrote: | >where some amateur did real damage to us | | If an amateur can do damage to you, then I have some bad news | for you... | Goronmon wrote: | _If an amateur can do damage to you, then I have some bad | news for you..._ | | I believe the point wasn't surprise that damage occurred at | all, but frustration that damage can occur just out | laziness/ignorance rather than malice. | scarygliders wrote: | Indeed, that was precisely their point, and "bad news for | you" is disingenuous as there are many techniques used by | incompetent, or just downright unethical and greedy scraper | companies which, no matter how robust the target is, can | still give it a major headache. | | I've witnessed a site being basically DOS'ed due to | particularly greedy and aggressive mass scraping attempts. | convolutionart wrote: | This is nonsense. It's always easier to destroy than to | build/mantain. If you got any real advice, by all means... | biosed wrote: | I used to lead Sys Eng for a FTSE 100 company. Our data was | valuable but only for a short amount of time. We were constantly | scraped which cost us in hosting etc. We even seen competitors | use our figures (good ones used it to offset their prices, bad | ones just used it straight). As the article suggest, we couldn't | block mobile operator IPs, some had over 100k customers behind | them. Forcing the users to login did little as the scrapers just | created accounts. We had a few approaches that minimised the | scraping: | | Rate Limiting by login, | | Limiting data to know workflows ... | | But our most fruitful effort was when we removed limits and | started giving "bad" data. By bad I mean alter the price up or | down by a small percentage. This hit them in the pocket but | again, wasn't a golden bullet. If the customer made a transaction | on the altered figure we we informed them and took it at the | correct price. | | It's a cool problem to tackle but it is just an arms race. | wolverine876 wrote: | > But our most fruitful effort was when we removed limits and | started giving "bad" data. By bad I mean alter the price up or | down by a small percentage. ... If the customer made a | transaction on the altered figure we we informed them and took | it at the correct price. | | Is that legal? It would be a big blow to trust if I was the | customer, but that's without knowing what you were selling and | in what market. | killingtime74 wrote: | It's legal if it's in the contract. Standard for contracts to | allow for mistakes and confirmations of prices | kwhitefoot wrote: | It's not mistake if you do it deliberately! | killingtime74 wrote: | Yes (not saying it's a mistake) but putting confirmation | can be in the contract, no law says you only get 1 chance | to display price. | rootusrootus wrote: | I know a guy at Nike that had to deal with a similar problem. | As I recall, they basically gave in -- instead of trying to | fight the scrapers, they built them an API so they'd quit | trashing the performance of the retail site with all the | scraping. | chadwittman wrote: | The real Jedi move | wrycoder wrote: | Especially if you charge for it, which would save them | money, because they wouldn't have to redo their code every | time you changed your website. | gonzo41 wrote: | I think there's an opportunity for a new JS framework to have | something like randomly generated dom that will always | display the page and elements the same to a human but | constantly break paths for computers. | | Like displaying a table with semantic elements, then divs, | then using an iframe with css grid and floating values over | the top. | | This almost seems like a problem for AI to solve. | matheusmoreira wrote: | Yes. That's exactly what everyone should do. | echelon wrote: | If data is your competitive advantage or product, then | what? Accept that your market no longer exists and that | there's no way to stop theft? | Grimm1 wrote: | You're going to need to explain how scraping publicly | available information on a website is theft. | | If information is your competitive advantage maybe you | shouldn't have it on a publicly accessible website, and | should instead stick it behind an API with pay tiers and | a very clear license regarding what you may do with it as | an end user. | | Note, a simple sign up being required to view a website | makes it not publicly available information any longer | and you can cover usage, again, in a license. | | Then you have a whole bunch of legal avenues you can use | to protect your work. Assuming you can afford it that is. | achillesheels wrote: | It is copyright information, no? So technically it is | intellectual property theft if the scraping use is for | commercial purposes. | Grimm1 wrote: | No? If you place information publicly on a website it's | pretty much free game, no copyright violation, especially | regarding user generated information. That's my take, but | legally it's a gray area and it's still going back and | forth in the courts (at least in the US) but for a while | before a decision was vacated by the supreme court | scraping publicly available information on a site was | legally protected and seemingly inline with my thoughts | on it. | ransom1538 wrote: | I love the honey pot approach. Put tons of valued hrefs on the | page that are invisible (css) that the scrapper would find. | Then just rate limit that ip address and randomize the data | coming back. Profit. | endymi0n wrote: | > It's a cool problem to tackle but it is just an arms race. | | Plus, it's one you're going to lose. I was once asked at an | All-Hands why we don't defend ourselves against bots even more | vigorously. | | My answer was: "Because I don't know how to build a publically | available website that I could not scrape myself if I really | wanted to." | DeathArrow wrote: | You can put some wasm crypto mining code and at least profit from | bots. :D | abc03 wrote: | I scrap government sites a lot as they don't provide apis. For | mobile proxies, I use the proxidize dongles and mobinet.io (free, | with Android devices). As stated in the article, with cgNAT it's | basically impossible to block them as in my case, half the | country couldn't access the sites anymore (if you place them in | several locations and use one carrier each there). | kerokerokero wrote: | Thanks for the share. Great stuff. | | I used to scrape websites to generate content for higher SERPs. | | Ended up going into the adult industry lols. | (https://javfilms.net) | anon9001 wrote: | Neat! I've run across your site organically :P | | I've always wondered, and since you're right here... how do | sites like this make money? | | It looks like you're probably crawling all the JAV vendors, | finding free clips of today's releases, embedding them in your | own site to draw traffic, and making money with affiliate links | to buy the full content? | | Am I missing anything? It seems hard to believe you'd get | enough affiliate signups to make it worthwhile. | | I can imagine your site as being a few hours a year of script | maintenance and a money printer, or a 40hr/week SEO job with | 1000s of similar sites across the adult industry. | | I'd love to know anything you're willing to share about how the | business works. | wilg wrote: | Not the same kind of scraping, but does anyone have | thoughts/resources/best practices for doing link previews (like | Twitter/iMessage/Facebook)? | mrg3_2013 wrote: | wow! That was an interesting read. | neals wrote: | In a particularly hard to scrape website, using some kind of bot | protection that I just couldn't reliably get working (if anybody | wants to know what that was exactly, I'll go and check it) I now | have a small Intel NUC running with firefox that listens to a | local server and uses Temper Monkey to perform commands. Works | like a charm and I can actualy see what it's doing and where it's | going wrong. (though it's not scalable, of course) | | We use it for data-entry on a government website. A human would | average around 10 minutes of clicking and typing, where the bot | takes maybe 10 seconds. Last year we did 12000 entries. Good bot. | nkozyra wrote: | You can use chromium/chrome/cdp and turn headless off and see | the same thing. | funnyflamigo wrote: | I'm curious what bot protection it was? It couldn't have been | trying too hard unless you were employing multiple anti- | fingerprinting techniques, I'm assuming you used firefox's | built in anti-fingerprinting? ___________________________________________________________________ (page generated 2021-11-05 23:00 UTC)