International Regulators’ Unease With AI Data-Scraping Creates Gulf With U.S.
Mass data scraping of personal information can constitute a reportable data breach in many jurisdictions.
International privacy regulators are calling on social media giants to protect their users’ publicly available information from web-scraping, a common business practice in the United States but one that in other jurisdictions could constitute a reportable data breach.
“Social media companies have obligations under U.K. data protection law to protect the information people post on their platforms,” said Stephen Bonner, head of Britain’s top privacy watchdog, the Information Commissioner’s Office (ICO). “We are seeing increased reports of mass data-scraping from social media and remind organizations that such incidents may require reporting to the ICO as a personal data breach.”
The ICO was joined by 11 other data protection authorities—from China, Australia, Switzerland, Norway, Jersey, New Zealand, Colombia, Argentina, Mexico, and Morocco—in a statement that Bonner said should help provide “certainty, and consistency across borders, in how data protection applies to information people post online.”
It’s also a warning that companies must have a lawful reason for collecting personal data, even when it is publicly available. Social media platforms and website operators that host publicly accessible personal data have certain obligations with respect to third-party data-scraping, the statement says. “These obligations will generally apply to personal information, whether that information is publicly accessible or not. Mass data-scraping of personal information can constitute a reportable data breach in many jurisdictions.”
The statement was sent directly to companies running the world’s major social media platforms: Alphabet, Meta, ByteDance, Weibo, Microsoft, and X Corp. The watchdogs say they are worried about identity fraud, spam marketing, and profiling and surveillance by authorities, as well as targeted cyberattacks carried about by malicious actors using swathes of personal data posted to hacking forums.
Scraped data can also be exploited by foreign governments and intelligence agencies, or harvested without permission for dubious purposes, as with British consulting firm Cambridge Analytica, whose misuse of Facebook data for targeted political advertising triggered one of the most infamous data scandals in recent history.
Privacy regulators outline a litany of measures companies should take to mitigate these risks, including cracking down on bots, limiting the number of times per day that an account can visit other account profiles, and dedicating in-house teams to monitor and respond to scraping activity.
In an interview with Treasury & Risk sister publication Law.com, privacy scholar Omer Tene, a Goodwin Procter partner and senior fellow at the Future of Privacy Forum, said the statement reflects the attitude “that access to information, even if it’s publicly available, if it’s done at scale could lead to different harms.”
It also comes amid a surging commercial demand for data to train machine learning algorithms and the attendant fight over data ownership.
As lawsuits against artificial intelligence (AI) companies alleging copyright and privacy violations pile up, social media platforms are looking to courts to prevent data-scrapers from accessing public information on their sites. “The data is incredibly valuable for the platforms, especially today with generative AI and LLMs [large language models]. A lot of players want to access data to teach algorithms,” said Tene, who also noted that entire businesses have been built on scraping publicly accessible information.
In his forthcoming “Field Guide to AI Law,” scheduled to be published in January 2024, privacy expert Lothar Determann writes that data-scraping has been a longstanding practice. “Companies have been scraping so commonly that the Practicing Law Institute (PLI) hosted a conference in 2014 with the title ‘Everyone’s Doing It, But Is It Legal? Web Scraping and Online Data Harvesting.’”
He added: “Many online companies scrape content off other sites while complaining about being victims of scraping themselves. … Most companies welcome the robots deployed by search engines to ensure their websites are easily found. At the same time, however, they typically try to protect their sites from scraping and deep-linking that causes users to bypass portal sites (and their advertisements) or that otherwise affects their business interests adversely.”
In a recent interview, Determann said data-scraping in itself is neither good nor bad. “Accessing a website with automated tools is not per se illegal or problematic—far from it,” he said. “So long as website users respect applicable terms and technical limitations, they may use automation to access information online.”
But the recent watchdog statement on privacy could be taken as an invitation for companies to restrict access to public data, clashing with the U.S. principle of open access to public information, one that federal case law has so far staunchly protected.
Earlier this year, a federal judge in South Carolina ruled that the state court bureaucracy’s blanket ban on data-scraping its repository of case records violated the First Amendment, as it undermined a local chapter of the NAACP’s efforts to fight evictions and bring Fair Housing Act litigation.
Business interests also collided in hiQ Labs’ long-running court battle with LinkedIn, in which the now-defunct data analytics firm faced an anti-hacking Computer Fraud and Abuse Act claim for scraping LinkedIn members’ public profiles. HiQ prevailed on that issue in 2022, when the U.S. Court of Appeals for the Ninth Circuit ruled in its favor on a motion for a preliminary injunction, noting that on a platform like LinkedIn “the default is free access.”
The appeals court panel also expressed deep concern over the risk of creating an “information monopoly in the U.S. where companies have free rein to decide, on any basis, who can collect and use data—data that the companies do not own, that they otherwise make publicly available to viewers, and that the companies themselves collect and use.”
Tene drew a distinction between American and European views on privacy and data protection.
“In Europe, a company needs a legal basis—that is, positive permission—to process data. You are allowed to do only what the law explicitly sanctions,” he wrote. “Whereas in the U.S., the opposite is true. A company—anyone, really—is allowed to do anything with data, as long as the law doesn’t prohibit it,” he wrote in a blog post on the case for the International Association of Privacy Professionals.
Without commenting on any specific case or company, Determann said the recent joint privacy statement presents a “conundrum,” where “on one hand, in the U.S., companies are being told they should be opening up more, and for competition-law purposes they should be sharing nicely. And on the other hand, they’re being told to lock things down.”
Determann, who has argued that data shouldn’t be subject to property rights, said the privacy regulators are “basically warning everyone that they should be more mindful of what other people do on their websites.” While “no one owns data, companies own responsibilities with respect to personal data.”
Social media companies should reply by the end of the month, the regulators said, with feedback “demonstrating how they comply with the expectations outlined in this joint statement.”
Tene said it will be tough for companies to balance their commercial interests with the regulators’ expectations that they protect public information from data-scraping. But they shouldn’t brush it off, he said. “If I’m subject to the jurisdiction of any of these dozen regulators that tell you they view it as a privacy violation, I would take it seriously.”
From: Corporate Counsel