2 Ethical Challenges of Web Scraping
One thing I run into often and I wrestle with this regularly is the ethics of using someone's website as a tool to manipulate data. Typically webscrapers (the actual person) don't ask permission before they start using someone else's website as a meat grinder to process data. Scraping is not new and their are bots trolling the entire internet ALL THE TIME. My intermittent scrape is like a drop of water in the ocean.
That all being said we should always consider ethics when we tackle tasks. In this article I will highlight two ethical issues faced by scrapers.
Let's talk about the very basic process flow of scraping:
This is generally how my scraping process works. I have data and I want that data changed somehow so I send it to a target website, the website does work on my data and outputs it. Once work is complete I choose to use it as is or continue modifying it with a different website.
Ethical Issue One:
Here comes the issue. I didn't make the website that churns out the data I input. Somebody else did all the hard work. I am just reaping all the benefits.
Is that wrong?
From an automation and efficiency standpoint it's awesome, I don't have to do all the work of reinventing the wheel. In fact I am even utilizing the resources of some other server on the internet.
However I can completely understand that it would kind of be a "pill to swallow" knowing that thousands of people are piggy backing off my handwork.
As a person who actually has AdSense approved, I literally make money when my adds are clicked on. If I had some functionality on my site that processed data, would I get money for the visits only?....no. I need engagement, I need people to want to click on the content relevant adds. The thing about robots is that they just do what they are told to do. BEEP BEEP must visit site, input data, retrieve data BEEP BEEP --> rinse and repeat 100,000 more times. Meanwhile the person never gets any of the $ for those 100,00 "visits."
This is the ethical quandary, scrapers can use a site for a long period of time but the website creator will not reap the benefits.
People do build add clicking robots, but this is bad. This is a sure fire way to get yourself banned from AdSense. A discussion for another day.
Ethical Issue Two:
Depending on who you talk to you in regards to scrapers you will get a variety of opinions. Some people truly hate the work of scrapers and other are indifferent. Does anyone absolutely love scrapers, I don't know.
Interesting, maybe I need a new hobby. ;)
Nah, I'm fine.
I legitimately think that some web designers have accepted the fact that their sites will/could be used as a meat grinder for data manipulation. When you visit their site and the data you need is placed in a "simple" table. (these tables are exceptionally easy to scrape). Or maybe the data is in a text file, also super easy to grab and work with.
Other sites that have that tasty data you want put up hurdles to stop scrapers. A popular choice is limiting visits. After a certain number of visits the door closes and your IP is remembered.
"Don't let this IP back in!!"
This is not a hard hurdle to get past as all you need to do is have a different IP. VPN and Free Proxies would solve this immediately.
What if the new IP gets blocked. Just switch to a different.
This is another ethical issue. If someone intentionally builds a block to stop you from scraping is it ethical to build a work around to grab that data anyways?
Well you can actually get into a sticky mess as some sites have put up legal notices to say that you can't scrape. It's a safe idea to avoid these places. I've seen this on government sites and places of a similar ilk.
You have to make up your mind about the ethics on this one. If someone has put up a block they are publicly saying that they don't like a certain action.
It reminds me of a neighbour who planted hedges to stop the mail delivery person from cutting across the lawn. They hated that the person walk across the lawn rather than walk down the path and use the sidewalk. You know what happened? The mail delivery person stepped over the hedges. My neighbours were incensed.
My example applies here. Just like those hedges an IP block is the webmaster saying "I don't like what you are doing." Even though the hedges were easy to step over, and the IP blocker is easy to get past my ethics tells me the right thing to do is to stop.
I would be lying if I said it wasn't serious temptation for me, especially since I have built custom POC scripts to do such actions. However when I see the block, I stop and try desperately to respect that the webmaster doesn't want a robot trolling across their website 100,000 times.
Thanks for reading