Perplexity, a company that describes its product as “a free ai search engine,” has come under fire in recent days. Shortly after ai/” rel=”nofollow noopener” target=”_blank” data-ylk=”slk:Forbes;cpos:1;pos:1;elm:context_link;itc:0;sec:content-canvas” class=”link “>Forbes accused him of stealing her story and reposting it on multiple platforms, cabling reported that Perplexity has been ignoring the Robot Exclusion Protocol, or robots.txt, and has been scraping its website and other Condé Nast publications. technology websiteai-is-stealing-from-the-shortcut” rel=”nofollow noopener” target=”_blank” data-ylk=”slk:The Shortcut;cpos:3;pos:1;elm:context_link;itc:0;sec:content-canvas” class=”link “> The shortcut He also accused the company of scraping his items. Now, technology/artificial-intelligence/multiple-ai-companies-bypassing-web-standard-scrape-publisher-sites-licensing-2024-06-21/” rel=”nofollow noopener” target=”_blank” data-ylk=”slk:Reuters;cpos:4;pos:1;elm:context_link;itc:0;sec:content-canvas” class=”link “>Reuters has reported that Perplexity is not the only ai company that is bypassing robots.txt files and scraping websites for content that is then used to train its technologies.
Reuters said he saw a letter addressed to the editors of TollBit, a startup that links them with ai companies so they can reach licensing deals, warning them that “ai agents from multiple sources (not just one company) are choosing to Avoid the robots.txt protocol to retrieve content from sites.” The robots.txt file contains instructions for web crawlers which pages they can access and which pages they cannot. Web developers have been using the protocol since 1994, but compliance is completely voluntary.
TollBit's letter does not name any companies, but ai-ignore-rule-scraping-web-contect-robotstxt” rel=”nofollow noopener” target=”_blank” data-ylk=”slk:Business Insider;cpos:6;pos:1;elm:context_link;itc:0;sec:content-canvas” class=”link “>Business Insider He says that he has learned that ai/” rel=”nofollow noopener” target=”_blank” data-ylk=”slk:OpenAI;cpos:7;pos:1;elm:context_link;itc:0;sec:content-canvas” class=”link “>OpenAI and anthropic (the creators of the chatbots ChatGPT and Claude, respectively) are also bypassing robots.txt signals. Both companies previously proclaimed that they respect the “do not track” instructions that websites place in their robots.txt files.
During his investigation, cabling discovered that a machine on an amazon server “certainly operated by Perplexity” was bypassing its website's robots.txt instructions. To confirm if Perplexity was removing your content, cabling provided the company's tool with headlines for their articles or short prompts describing their stories. The tool reportedly returned results that faithfully paraphrased his articles “with minimal attribution.” And, at times, it even generated inaccurate summaries of his stories. cabling says the chatbot falsely claimed to report a specific California police officer committing a crime in a case.
In an interview with ai-ceo-aravind-srinivas-on-plagiarism-accusations” rel=”nofollow noopener” target=”_blank” data-ylk=”slk:Fast Company;cpos:9;pos:1;elm:context_link;itc:0;sec:content-canvas” class=”link “>fast companyAravind Srinivas, CEO of Perplexity, told the publication that his company “doesn't ignore the Robot Exclusion Protocol and then lie about it.” However, that doesn't mean you don't benefit from trackers that ignore the protocol. Srinivas explained that the company uses third-party web trackers in addition to its own, and that the tracker cabling identified was one of them. When fast company When asked if Perplexity told the tracker vendor to stop crawling Wired's website, he only responded that “it's complicated.”
Srinivas defended his company's practices, telling the publication that the Robot Exclusion Protocol “is not a legal framework” and suggesting that publishers and companies like his may have to establish a new type of relationship. He would also have hinted that cabling deliberately used prompts to make the Perplexity chatbot behave the way it did, so regular users won't get the same results. As for the inaccurate summaries the tool had generated, Srinivas said, “We've never said we've never hallucinated.”