The emergence of modern AI tools has brought a new urgency to defining an ethical set of best practices to web data collection that will both protect everyday web users and maintain an innovation-friendly environment for those who gather and use the vast amounts of publicly-available data generated each day. 

An international, industry-led consortium of web data collectors focused on strengthening public trust, promoting ethical guidelines, and helping businesses make informed data aggregation choices, the Ethical Web Data Collection Initiative (EWDCI) engages in discussions around all aspects of public data gathering and usage. This document narrowly focuses on establishing guidelines for the use of public personal data in training AI, as voluntarily-shared data increases and large language models (LLMs) grow hungrier for learning material.

Our position addresses three core elements of this complex issue: legitimate interest in collecting publicly-available data online; exceptions for personal data made public; and striking a balance between protecting the privacy rights of individuals while creating an environment where businesses can use AI to thrive.

Legitimate Interest

It doesn’t make sense to pretend that all the data willingly made public is not there. A legitimate interest exists for using publicly-available personal data to train an AI model—as long as safeguards such as public notices, subject access rights, and retention policies are built into the process.

Public Personal Data Exceptions

Given the vast amounts of personal data that people voluntarily make public, there should be a reasonable expectation that public data could be used to train AI models. Allowing AI companies to use this data to lawfully train AI models while also advocating for a carve-out in the law for personal data manifestly made public strikes us as the most reasonable way to protect personal data while also not chilling commerce. 

Balance Between Privacy and Commerce

Machine behavior should be guided by human intention and human-centric values. Achieving the proper balance between maintaining personal privacy and achieving smarter artificial intelligence outcomes is not only the right thing to do, but will also shape the very nature of tomorrow’s AI. 

As legislative efforts and public conversations around the regulation of artificial intelligence continue, the EWDCI looks forward to taking an active role establishing guidelines that will lead to a healthy online environment for Internet users and the companies that serve them.

About EWDCI + i2Coalition

The Ethical Web Data Collection Initiative (EWDCI) seeks to foster cooperation in the web data collection and aggregation industry and leverage collective first-hand knowledge and insights to advocate for beneficial technical standards and business best practices regarding the aggregation of data. The EWDCI is dedicated to serving as the voice of the industry, collaboratively strengthening public trust in the practice of Data Aggregation, promoting ethical guidelines, and helping businesses make informed data aggregation choices. 

The Internet Infrastructure Coalition (i2Coalition) is the leading voice for web hosting companies, data centers, domain registrars and registries, cloud infrastructure providers, managed services providers, and related tech. The i2C works with Internet infrastructure providers to advocate for sensible policies, design and reinforce best practices, help create industry standards, and build awareness of how the Internet works. The i2Coalition also spearheaded the creation of the VPN Trust Initiative, which determined and promoted best practices for that vital industry.