Freelancer has accused Anthropic, the ai startup behind Claude's Big Language Models, of ignoring its robots.txt “do not track” protocol to scrape data from its websites. Meanwhile, iFixit CEO Kyle Wiens said Anthropic has ignored the website's policy prohibiting the use of its content for training ai models. Freelancer CEO Matt Barrie said Information Anthropic's ClaudeBot is “the most aggressive scraper by far.” His website reportedly received 3.5 million hits from the company's crawler in a four-hour span, which is “probably about five times the volume of the number two ai crawler.” Similarly, Wiens x.com/kwiens/status/1816304897484284007″ rel=”nofollow noopener” target=”_blank” data-ylk=”slk:posted on x/twitter;cpos:2;pos:1;elm:context_link;itc:0;sec:content-canvas” class=”link “>posted on x/twitter Anthropic's bot attacked iFixit's servers a million times in 24 hours. “Not only are they taking our content without paying, they are tying up our development resources,” he wrote.
In June, Accused by cable Another ai company, Perplexity, has discovered that it can crawl your website despite the presence of the Robots Exclusion Protocol, or robots.txt. A robots.txt file typically contains instructions for web crawlers about which pages they can and cannot access. While compliance is voluntary, malicious bots have ignored it in most cases. From Wired piece A startup called TollBit, which connects ai companies with content publishers, reported that it's not just Perplexity that's bypassing robots.txt signals. While it didn't name names, ai-ignore-rule-scraping-web-contect-robotstxt” rel=”nofollow noopener” target=”_blank” data-ylk=”slk:Business Insider;cpos:5;pos:1;elm:context_link;itc:0;sec:content-canvas” class=”link “>Business Insider He said he learned that OpenAI and Anthropic were also ignoring the protocol.
Barrie said Freelancer tried to reject the bot's access requests at first, but eventually had to block Anthropic's crawler entirely. “This is an egregious crawl (that) makes the site slower for everyone operating on it and ultimately impacts our revenue,” he added. As for iFixit, Wiens said the website has set alarms for high traffic and that its people have been woken up at 3 a.m. because of Anthropic's activities. The company's crawler stopped crawling iFixit after it added a line in its robots.txt file which prevents, in particular, the Anthropic bot.
The artificial intelligence startup said Information that it respects robots.txt and that its crawler “respected that signal when iFixit implemented it.” It also said that its goal is “to achieve minimal disruption by being careful about how quickly it crawls the same domains,” so it is now looking into the case.
ai companies use crawlers to collect content from websites that they can use to train their generative ai technologies. As a result, they have been the target of multiple lawsuits, with publishers accusing them of copyright infringement. To prevent further lawsuits from being filed, companies like OpenAI have been reaching settlements with publishers and websites. OpenAI's content partners, so far, include News Corp, Vox Media, the Financial Times and Reddit. iFixit’s Wiens also seems open to the idea of signing a deal for the repair instruction website’s articles, telling Anthropic in a tweet that he’s willing to have a conversation about licensing content for commercial use.
<div class="twitter-tweet-wrapper” data-embed-anchor=”97d06742-daf4-58cd-a178-dd16b2031211″><blockquote placeholder="" data-theme="light" class="twitter-tweet”>
If any of those requests were to comply with our terms of service, they would have told you that use of our content is expressly prohibited. But don't ask me, ask Claude!
If you'd like to discuss licensing our content for commercial use, we're here. twitter.com/CAkOQDnLjD;elm:context_link;itc:0;sec:content-canvas” class=”link “>image.twitter.com/CAkOQDnLjD
—Kyle Wiens (@kwiens) twitter.com/kwiens/status/1816136485785186335?ref_src=twsrc%5Etfw” rel=”nofollow noopener” target=”_blank” data-ylk=”slk:July 24, 2024;elm:context_link;itc:0;sec:content-canvas” class=”link “>July 24, 2024
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>