By: Gor Bagumyan
Data scraping is a tool that companies have used for years in order to better develop their Artificial Intelligence (AI) programs and software. In its most basic form, it is the extraction of data from one source to another. This was originally done manually until 2024, where years of development have led to automated data scraping programs which have the ability to scrape all available public data from websites. Many companies use data scraping programs for their own benefit, such as Nvidia and OpenAI. However, these two companies, along with many others, are facing lawsuits regarding their data scraping practices.
One of the most notable developments in recent times is OpenAI’s Chat GPT, which is a type of AI program better known as a Large Language Model (“LLM”). LLM’s, along with many other forms of AI, are trained on vast amounts of data in order to produce “human-like” responses when prompted by its user. It does so by recognizing patterns and keywords between what is being asked of it and the large amount of data it has been trained on. Naturally, a significant aspect of developing AI programs, which have the ability to consistently produce accurate human like responses, is how much data it is trained on. For companies at the forefront of developing this sector of technology, the ability to scrape the internet for data is paramount.
AI developers also focus on different types of data being scraped, employing different data scraping programs for information from social media websites, news articles, and other sources. While collecting data, these programs may also look to copyrighted or non-copyrighted material created by other people. YouTube for example, is a platform where thousands of creators upload unique content daily, a gold mine for developers looking to build LLM’s and other AI models which mirror complex human responses.
Recently, a federal judge in California related two class action lawsuits filed by a YouTube content creator. The lawsuits claim that Nvidia and OpenAI, were unjustly enriched at the expense of the plaintiffs after scraping “millions of YouTube videos” without the consent of the videos’ creators. The class action complaints against Nvidia and OpenAI further allege that the data scraping methods the companies used to bypass YouTube’s blocks against such activities are a clear violation of its terms.
While there have been several cases dealing with similar issues of public data scraping, Meta Platforms, Inc. v. Bright Data Ltd. is one of the more recent decisions coming out of the United States District Court of the Northern District of California. Although Meta was seeking damages under different legal theories than the cases involving Nvidia and OpenAI, both cases concern the legality of scraping publicly available data. In the case with Meta, Bright Data, Ltd. scraped data from Facebook and Instagram for their own benefit. However, the data they scraped was all publicly available and done so without creating an account on any of the websites. This effectively allowed them to bypass the need to sign any terms and service agreements which would have otherwise prohibited data scraping password-protected data without the permission of Facebook or Instagram.
Despite the user agreements of Facebook and Instagram, there is a clear line between prohibiting the automated data scraping of content that would only be available through an account, and content that is not password-protected. Although the court has yet to define the relevant clauses within YouTube’s Terms of Service, history shows us that Nvidia and OpenAI may have never needed the permission of the videos creators in the first place if they made their videos publicly available.
Although courts have yet to outright prohibit the scraping of publicly available data, there are significant implications for everyone involved. Access to digital data plays a vital role for AI developers looking to advance this sector of technology. Without the ability to employ programs that scrape vast amounts of publicly available data for the purposes of training AI programs, there would be very little development. However, there are legitimate privacy concerns and business interests that content creators wish to protect from companies systemically taking their content with no compensation.
While it is important to protect the interests of content creators and others alike, this would significantly limit the amount of data available to smaller businesses looking to develop their own AI programs. Compare this situation to other powerhouses within the industry, such as Google or Meta, whose websites are the ones being scraped by other companies for endless data to build and advance AI technology. Allowing companies to withhold publicly available data for the purposes of harmless data scraping could lead to a monopolization of the industry, as the small percentage of larger companies who have built massive networks would be in control of this practice.
Ultimately, these cases could have significant ramifications on both smaller and larger businesses, since the ability to freely scrape publically available data is instrumental for developing AI programs. Thus, as our legal system continues to navigate the complex realm of data privacy in an everchanging industry, it is important to protect users’ data while not hindering the rapid development of AI.
Student Bio: Gor Bagumyan is a second-year law student at Suffolk University Law School and a staff member for the Journal of High Technology Law. Gor received a Bachelor of Arts degree in Economics, with minors in Political Science and Business Management from Clark University in 2023.
Disclaimer: The views expressed in this blog are the views of the author alone and do not represent the views of JHTL or Suffolk University Law School.