There has been a lot said about data scraping. Here is a breakdown of what it is, why it might be problematic and how we might deal with it going forward.
For a recent example of abusive data scraping, we do not have to go too far back in time: In April 2021, researchers discovered a database containing the personal details of more than 500 million Facebook users, which was circulating on hacker forums. Not much later, similar news reports surfaced about a LinkedIn database dataleak. Analysis of both incidents showed that hackers did not even need to attack the servers of the social media platforms to get hold of the data. They made use of a handy trick called "data scraping". How does this technique work and how big is the danger of data scraping for Internet users?
Data scraping is essentially a way of transferring data from one system to another. But it differs from more conventional data transfer methods. The main difference is in the output. The scraped data does not serve as input for another computer program, but is intended for display to the end user. Data scraping is therefore a very crude technique that will only be used when there is no other way to extract data from a system, such as an operating system that is no longer compatible with modern hardware. The output is often very unstructured because things like formatting, binary data and other additional information are not transferred. This can even cause programs to crash during data scraping.
There are different technical variants within data scraping. The oldest form is screen scraping. With screen scraping, a special tool is connected to an obsolete computer system. The scraping tool pretends to be a user and simulates the key commands to navigate through the system interface. The tool then extracts the data from the system and passes it on to the new system. This method of working inspired more modern automation tools that work on the same basis.
In addition to screen scraping, there is also web scraping, which is used to extract data from web pages. The principle is more or less the same. Again, you usually need a scraping tool to make the web page believe that you are a web administrator who is going to modify the page. Most websites today have built-in security algorithms to detect such tools and deny them access. So large-scale scraping incidents like those at Facebook are really very rare – at least as far as we know so far.
Data scraping is not in itself an illegal practice. Recognised cloud providers such as Amazon AWS offer secure web scraping tools in the form of free APIs. Like any computer program, data scraping only becomes dangerous when the tools fall into the wrong hands. As happened at Facebook, to refer back to that incident.
In the Facebook scraped dataleak incident the database did contain personal data such as phone numbers and email addresses. If cybercriminals get hold of this data, they can use it for phishing and other types of fraud. So it is true that data scraping is initially a lot less intrusive than hacking into someone's account and you will probably not be directly affected by a scraping attack. But in the long run, it can make you more vulnerable to phishing attacks. The recent LinkedIn scraped data leak seems less intrusive and showed less interesting data however every kind of data can always be interesting for every cybercriminal or hacker. Data scraping can open the door to spear phishing attacks; hackers can learn the names of superiors, ongoing projects, trusted companies or organizations, etc. Essentially, everything a hacker could need to craft their message to make it plausible and provoke the correct response in their victims.
As a user of a website, there is basically not much you can do against a scraping attack, except carefully manage what information you share about yourself on that website. With Facebook as an example, therefore do a regular privacy check to find out what you actually share or not. Ultimately, the responsibility lies in what you share yourself. And that’s probably not always that easy looking to all the problems we see these days. Also, bear in mind that the effects that result from someone accessing your personal information might not manifest for a long time. By the point someone abuses your data, you might already have forgotten that you even shared it with the network at some point.
You must keep in mind that everything that is visible and accessible on your website to human visitors is possibly also visible to scrapingbots. There are also some technical tricks that can be applied to secure the content. However, these tricks often have their limitations. You can often recognise a scraping attempt by a high number of requests sent to your website from a single IP address (not to be confused with a DDoS attack, which also relies on this technique). You can then exclude that suspicious IP address. In other cases, locking content with login details can go a long way. The scraper then has to expose a piece of itself in order to get access to the content. Regularly changing your HTML can confuse scrapers to such an extent that they do scrape elsewhere. The downside of this is that this approach can also lead to confusion among your own web developers. The use of CAPTCHAs or lots of media files can also discourage scraping attempts by shady individuals. Bots are sometimes coded to explicitly break specific CAPTCHA patterns or may employ third-party services that utilize human labor to read and respond in real-time to CAPTCHA challenges. On the legal side: companies need to take action against data scrapers and warn them against the process. This can be included in the terms of service. Of course this doesn’t do anything against scraping by itself but it can be used during lawsuits.
Diverse actors leverage web scraping bots, including nefarious competitors, internet upstarts, cybercriminals, hackers, and spammers, to effortlessly steal whatever pieces of content they are programmed to find, and often mimic regular user behavior, making them hard to detect and even harder to block. Web scraping pose a critical challenge to a website’s brand, it can threaten sales and conversions, lower SEO rankings, or undermine the integrity of content that took time and resources to produce. But there is even a bigger problem behind it which lies in the growth of the phishing attempts or ransomware attacks which could be based on the stolen and scraped data of the users of the attacked website. That’s the reason why webdesigners and social media companies should be thinking twice about using the necessary actions against this kind of attacks in the future. Understanding the intrusive nature of today’s web scraping danger not only raises awareness about this growing challenge, it also allows website owners to take action in the protection of their proprietary and the privacy of their users! Let’s hope they all read this blog.