Web Scraping Fact Sheet

What web data aggregation is, and what it does

What is “web scraping”?

In essence, web data aggregation (“web scraping”) is simply going to a web page, extracting data from it, and sorting that data to make it understandable. When done at scale, the results can be powerful. Whenever you search for something on Google, what you see is the result of web scraping. So you see that there are some very compelling positive use cases.

How does it work?

If you’ve ever copied and pasted something from a website, you’ve operated as a web scraper—but of course we’re talking about scale. Modern web data aggregation is done with bots and database software to gather large amounts of data, parse that data, and present it in a way that’s understandable to humans. You can find open-source data aggregation tools out there; as well as proprietary tools and companies that specialize in web scraping.

Wait, What’s a Bot?

A bot is an autonomous program that performs (often repetitive) tasks on the Internet much faster than a human could. There are good bots and bad bots, based upon how they’re configured and how they’re deployed. There are a lot of bots out there—over 40% of Internet traffic is made up of bot activity.

Examples of good bot behavior: Web crawlers, customer service chatbots

Examples of bad bot behavior: Spambots, DDoS attacks, click fraud automation

What can be done with aggregated web data?

  • Search engine functionality
  • Gauging customer sentiment
  • E-commerce competitor analysis
  • Hotel and flight price comparison
  • Market research
  • Academic research
  • Real estate listings
  • Weather data monitoring
  • Search engine optimization (SEO)
  • Website change detection
  • Online reputation management
  • Data visualization

Is web scraping legal?

Yes—in fact, the modern Internet wouldn’t work very well without it! Different jurisdictions have laws around which types of data can be aggregated, how it can be aggregated, and how that data can be used. For example, The European Union’s General Data Protection Regulation (GDPR) is a widely known example of this, but there are many more—and new laws are being written as we speak.

Web Scraping Does Not Equal Hacking

Web scraping involves extracting data from a website and storing it in a structured format. Hacking involves unauthorized access or manipulation of a computer or network. If web scraping is the equivalent of street photography, hacking would be like setting up a camera in someone’s house.

Why does web data aggregation often have a bad reputation?

Web data aggregation can be done maliciously. For example, web scraping is a core component to DDoS attacks. Also, the collection of personal data at scale often runs afoul of both local laws and the bounds of accepted business behavior. However, this bad reputation comes from a few bad actors and most web scrapers are conducting ethical web data aggregation for one of the reasons listed above.

What is the good that comes out of it?

When collected and sifted in certain ways, a dizzying multitude of anecdotes becomes actionable data. As companies and individuals, we can accomplish much more when we’re able to make sense of the vast amounts of disparate bits of information out there. However, like any technique, web scraping can be misused. That’s why we’re forming this group: to prevent bad actors from sullying an industry that can accomplish a whole lot of good.

The Public Deserves Digital Peace of Mind

The Ethical Web Data Collection Initiative (EWDCI) is an international, industry-led consortium of web data collectors focused on strengthening public trust, promoting ethical guidelines, and helping businesses make informed data aggregation choices.

The EWDCI is dedicated to defining positive and beneficial uses of the important abilities and potential of web data collection and aggregation at scale.

What EWDCI Does:

  • Advocate for responsible web data collection and use of personal data
  • Educate and guide the industry on the use of ethical resources and tools used in web data collection
  • Foster consumer confidence in data collection through transparency and accountability
  • Enable commercial innovation
  • Promote online safety

Our goal is to prevent harmful legislation from passage worldwide, but also, potentially to seek inclusion in federal laws. In addition, we are building a framework to establish an open, participatory process around the development of legal and ethical web scraper provider principles.

“Scraping” doesn’t have to be a dirty word.

About EWDCI + i2Coalition

The Ethical Web Data Collection Initiative (EWDCI) seeks to foster cooperation in the web data collection and aggregation industry and leverage collective first-hand knowledge and insights to advocate for beneficial technical standards and business best practices regarding the aggregation of data. The EWDCI is dedicated to serving as the voice of the industry, collaboratively strengthening public trust in the practice of Data Aggregation, promoting ethical guidelines, and helping businesses make informed data aggregation choices.  The Internet Infrastructure Coalition (i2Coalition, i2C) is the leading voice for web hosting companies, data centers, domain registrars and registries, cloud infrastructure providers, managed services providers, and related tech. The i2C works with Internet infrastructure providers to advocate for sensible policies, design and reinforce best practices, help create industry standards, and build awareness of how the Internet works. The i2Coalition also spearheaded the creation of the VPN Trust Initiative, which determined and promoted best practices for that vital industry.