When Reddit said last month that it would block unauthorized data mining from its site, everyone’s first (justified) reaction was “ai, ai, ai.” Now that the change has gone into effect, however, chatbot creators may not be the only ones left without access. The widely used forum also appears to be blocking all major search engines except for Brave and Google, the latter of which reportedly signed a deal earlier this year with Reddit worth $60 million annually. However, a Reddit spokesperson told Engadget that the empty search results are because Google’s rivals don’t agree with the company’s requirements for ai training. It says it’s in talks with several of them.
404 Media ai-deal/” rel=”nofollow noopener” target=”_blank” data-ylk=”slk:reported on Wednesday;elm:context_link;elmt:doNotAffiliate;cpos:2;pos:1;itc:0;sec:content-canvas”>reported on Wednesday (And Engadget confirmed in our queries) that last week’s Reddit results search on rival engine Bing (using “site:reddit.com”) returned empty results. The publication reported that DuckDuckGo produced seven links with no description at all, providing only the note: “We’d like to show you a description here, but the site won’t let us.” Now, the engine appears to have removed even those, as our test only produced an empty page that said, “No results found.”
When Reddit technology/reddit-update-web-standard-block-automated-website-scraping-2024-06-25/” rel=”nofollow noopener” target=”_blank” data-ylk=”slk:said last month;elm:context_link;elmt:doNotAffiliate;cpos:3;pos:1;itc:0;sec:content-canvas”>said last month While Google announced it would update its Robots Exclusion Protocol (robots.txt) to block automated data scraping, it is now apparent that it was not just aimed at thwarting ai companies like Perplexity and its controversial “answer engine.” Currently, Google appears to be the only search engine allowed to crawl Reddit and produce results from “the front page of the internet.”
A Reddit spokesperson told Engadget on Wednesday that it’s not accurate to say the missing search results are a result of its agreement with Google. “We block all crawlers who are unwilling to commit to not using crawl data for ai training, which is in line with enforcement of our Public Content Policy and updated robots.txt file,” the company said. “Anyone accessing Reddit content must comply with our policies, including those put in place to protect redditors. We are selective about who we work with and who we trust with broad-scale access to Reddit content.”
Meanwhile, a source familiar with Reddit’s thinking told Engadget on Wednesday that Bing’s omission is due to Microsoft refusing to agree to Reddit’s terms regarding ai crawling. Instead, Bing’s creator allegedly claimed that its standard web controls were sufficient. The source claims that Microsoft’s stance conflicts with Reddit’s data privacy policy, leading to the impasse and empty search results.
The ubiquitous robots.txt file is the web standard that communicates which parts of a site can be crawled. While many crawlers are known to ignore its instructions, Google's standard procedure is to respect it. So, from a technical standpoint, the companies colluding in this lucrative arrangement appear to have implemented some form of manual override.
The saga could be seen as a domino effect of ai chatbots crawling the live web for results. As courts are slow to determine how much of the open web is fair game to train chatbots, companies like Reddit — whose bottom line now depends on protecting their data from non-payers — are building walls at the expense of the open web. (Though, given the integral role Microsoft has played in this ai era, by cosying up to OpenAI early on, it seems ironic that Bing finds itself on the losing end of at least one aspect of the fallout.)
Colin Hayhurst, CEO of the lesser-known “no-tracking” search engine Mojeek, said 404 Media that Reddit is “killing everything related to search except Google.” The executive also said his attempts to contact Reddit were ignored. “It’s never happened to us before,” he said. “As we do this, we get blocked, usually out of ignorance or stupidity or whatever, and when we contact the site, we can certainly resolve it, but we’ve never had a response from anyone before.”
Reddit has made no secret of its desire to prevent ai companies from using its trove of data in this burgeoning age of artificial intelligence. Last year, CEO Steve Huffman risked alienating a large portion of its user base by blocking third-party API requests, leading to the demise of beloved apps like Christian Selig’s Apollo. Despite widespread protests among moderators and forum-goers, the company only temporarily lost a negligible number of users.
The gamble seemed to pay off, and Reddit recovered. It went public in March.
Update, July 24, 2024, 5:00 PM ET:This story has been updated to add statements from Reddit and additional context from sources familiar with the company's thinking.