From the Open-Publishing Calendar
From the Open-Publishing Newswire
Indybay Feature
Developer Creates Infinite Maze That Traps AI Training Bots
"๐๐๐ฅ๐๐ฃ๐ฉ๐๐๐จ ๐๐๐ฃ๐๐ง๐๐ฉ๐๐จ ๐ง๐๐ฃ๐๐ค๐ข ๐ก๐๐ฃ๐ ๐จ ๐ฉ๐๐๐ฉ ๐๐ก๐ฌ๐๐ฎ๐จ ๐ฅ๐ค๐๐ฃ๐ฉ ๐๐๐๐ ๐ฉ๐ค ๐๐ฉ๐จ๐๐ก๐ - ๐ฉ๐๐ ๐๐ง๐๐ฌ๐ก๐๐ง ๐๐ค๐ฌ๐ฃ๐ก๐ค๐๐๐จ ๐ฉ๐๐ค๐จ๐ ๐ฃ๐๐ฌ ๐ก๐๐ฃ๐ ๐จ. ๐๐๐ฅ๐๐ฃ๐ฉ๐๐๐จ ๐๐๐ฅ๐ฅ๐๐ก๐ฎ ๐๐ช๐จ๐ฉ ๐ง๐๐ฉ๐ช๐ง๐ฃ๐จ ๐ข๐ค๐ง๐ ๐๐ฃ๐ ๐ข๐ค๐ง๐ ๐ก๐๐จ๐ฉ๐จ ๐ค๐ ๐ก๐๐ฃ๐ ๐จ ๐ฅ๐ค๐๐ฃ๐ฉ๐๐ฃ๐ ๐๐๐๐ ๐ฉ๐ค ๐๐ฉ๐จ๐๐ก๐."
A pseudonymous coder has created and released an open source โtar pitโ to indefinitely trap AI training web crawlers in an infinitely, randomly-generating series of pages to waste their time and computing power. The program, called Nepenthes after the genus of carnivorous pitcher plants which trap and consume their prey, can be deployed by webpage owners to protect their own content from being scraped or can be deployed โoffensivelyโ as a honeypot trap to waste AI companiesโ resources.
โIt's less like flypaper and more an infinite maze holding a minotaur, except the crawler is the minotaur that cannot get out. The typical web crawler doesn't appear to have a lot of logic. It downloads a URL, and if it sees links to other URLs, it downloads those too. Nepenthes generates random links that always point back to itself - the crawler downloads those new links. Nepenthes happily just returns more and more lists of links pointing back to itself,โ Aaron B, the creator of Nepenthes, told 404 Media.
โOf course, these crawlers are massively scaled, and are downloading links from large swathes of the internet at any given time,โ they added. โBut they are still consuming resources, spinning around doing nothing helpful, unless they find a way to detect that they are stuck in this loop.โ
Human users can see how Nepenthes works by clicking here, though I must warn that the page loads incredibly slowly (on purpose) and links endlessly to pages that load the same way. It looks like this, in practice:
Aaron Bโs website says โTHIS IS DELIBERATELY MALICIOUS CODE INTENDED TO CAUSE HARMFUL ACTIVITY. DO NOT DEPLOY IF YOU ARENโT FULLY COMFORTABLE WITH WHAT YOUโRE DOING.โ It also notes it can be deployed โdefensivelyโ to โflood our valid URLs within your siteโs domain name, making it unlikely the crawler will access the real contentโ and โoffensivelyโ to actively trap and waste computing power: โLet's say you've got horsepower and bandwidth to burn, and just want to see these AI models burn. Nepenthes has what you need โฆ In short, let them suck down as much bullshit as they have diskspace for and choke on it.โ
We have previously written about the difficulty that website owners have had in blocking the web crawlers that train large language models. It is possible to use robots.txt to ask specific bots not to crawl a webpage, but different companies use different bots, the names of those bots often change, and some companies do not honor robots.txt requests or find ways to get around them. Nearly endless internet art projects have proven particularly difficult for bots to crawl; last year, the man who wrote The Internet for Dummies had โthe worldโs lamest content farm,โ a website made up of billions of interconnected single-page sites, hit more than 3 million times by OpenAIโs training bot in a single day. Anthropicโs AI scraper later hit the DIY repair company iFixit more than a million times in a day.
โHearing these stories recently definitely pushed me into putting out a release,โ Aaron B said. โIt's also sort of an art work, just me unleashing shear unadulterated rage at how things are going. I was just sick and tired of how the internet is evolving into a money extraction panopticon, how the world as a whole is slipping into fascism and oligarchs are calling all the shots - and it's gotten bad enough we can't boycott or vote our way out, we have to start causing real pain to those above for any change to occur.โ
Since they made and deployed a proof-of-concept, Aaron B said their pages have been hit millions of times by internet-scraping bots. On a Hacker News thread, someone claiming to be an AI company CEO said a tarpit like this is easy to avoid; Aaron B told 404 Media โIf thatโs, true, Iโve several million lines of access log that says even Google Almighty didnโt graduateโ to avoiding the trap.
โIt's less like flypaper and more an infinite maze holding a minotaur, except the crawler is the minotaur that cannot get out. The typical web crawler doesn't appear to have a lot of logic. It downloads a URL, and if it sees links to other URLs, it downloads those too. Nepenthes generates random links that always point back to itself - the crawler downloads those new links. Nepenthes happily just returns more and more lists of links pointing back to itself,โ Aaron B, the creator of Nepenthes, told 404 Media.
โOf course, these crawlers are massively scaled, and are downloading links from large swathes of the internet at any given time,โ they added. โBut they are still consuming resources, spinning around doing nothing helpful, unless they find a way to detect that they are stuck in this loop.โ
Human users can see how Nepenthes works by clicking here, though I must warn that the page loads incredibly slowly (on purpose) and links endlessly to pages that load the same way. It looks like this, in practice:
Aaron Bโs website says โTHIS IS DELIBERATELY MALICIOUS CODE INTENDED TO CAUSE HARMFUL ACTIVITY. DO NOT DEPLOY IF YOU ARENโT FULLY COMFORTABLE WITH WHAT YOUโRE DOING.โ It also notes it can be deployed โdefensivelyโ to โflood our valid URLs within your siteโs domain name, making it unlikely the crawler will access the real contentโ and โoffensivelyโ to actively trap and waste computing power: โLet's say you've got horsepower and bandwidth to burn, and just want to see these AI models burn. Nepenthes has what you need โฆ In short, let them suck down as much bullshit as they have diskspace for and choke on it.โ
We have previously written about the difficulty that website owners have had in blocking the web crawlers that train large language models. It is possible to use robots.txt to ask specific bots not to crawl a webpage, but different companies use different bots, the names of those bots often change, and some companies do not honor robots.txt requests or find ways to get around them. Nearly endless internet art projects have proven particularly difficult for bots to crawl; last year, the man who wrote The Internet for Dummies had โthe worldโs lamest content farm,โ a website made up of billions of interconnected single-page sites, hit more than 3 million times by OpenAIโs training bot in a single day. Anthropicโs AI scraper later hit the DIY repair company iFixit more than a million times in a day.
โHearing these stories recently definitely pushed me into putting out a release,โ Aaron B said. โIt's also sort of an art work, just me unleashing shear unadulterated rage at how things are going. I was just sick and tired of how the internet is evolving into a money extraction panopticon, how the world as a whole is slipping into fascism and oligarchs are calling all the shots - and it's gotten bad enough we can't boycott or vote our way out, we have to start causing real pain to those above for any change to occur.โ
Since they made and deployed a proof-of-concept, Aaron B said their pages have been hit millions of times by internet-scraping bots. On a Hacker News thread, someone claiming to be an AI company CEO said a tarpit like this is easy to avoid; Aaron B told 404 Media โIf thatโs, true, Iโve several million lines of access log that says even Google Almighty didnโt graduateโ to avoiding the trap.
For more information:
https://www.404media.co/email/7a39d947-4a4...
Add Your Comments
We are 100% volunteer and depend on your participation to sustain our efforts!
Get Involved
If you'd like to help with maintaining or developing the website, contact us.
Publish
Publish your stories and upcoming events on Indybay.
Topics
More
Search Indybay's Archives
Advanced Search
►
▼
IMC Network