top
US
US
Indybay
Indybay
Indybay
Regions
Indybay Regions North Coast Central Valley North Bay East Bay South Bay San Francisco Peninsula Santa Cruz IMC - Independent Media Center for the Monterey Bay Area North Coast Central Valley North Bay East Bay South Bay San Francisco Peninsula Santa Cruz IMC - Independent Media Center for the Monterey Bay Area California United States International Americas Haiti Iraq Palestine Afghanistan
Topics
Newswire
Features
From the Open-Publishing Calendar
From the Open-Publishing Newswire
Indybay Feature

Developer Creates Infinite Maze That Traps AI Training Bots

by Jason Koebler
"๐™‰๐™š๐™ฅ๐™š๐™ฃ๐™ฉ๐™๐™š๐™จ ๐™œ๐™š๐™ฃ๐™š๐™ง๐™–๐™ฉ๐™š๐™จ ๐™ง๐™–๐™ฃ๐™™๐™ค๐™ข ๐™ก๐™ž๐™ฃ๐™ ๐™จ ๐™ฉ๐™๐™–๐™ฉ ๐™–๐™ก๐™ฌ๐™–๐™ฎ๐™จ ๐™ฅ๐™ค๐™ž๐™ฃ๐™ฉ ๐™—๐™–๐™˜๐™  ๐™ฉ๐™ค ๐™ž๐™ฉ๐™จ๐™š๐™ก๐™› - ๐™ฉ๐™๐™š ๐™˜๐™ง๐™–๐™ฌ๐™ก๐™š๐™ง ๐™™๐™ค๐™ฌ๐™ฃ๐™ก๐™ค๐™–๐™™๐™จ ๐™ฉ๐™๐™ค๐™จ๐™š ๐™ฃ๐™š๐™ฌ ๐™ก๐™ž๐™ฃ๐™ ๐™จ. ๐™‰๐™š๐™ฅ๐™š๐™ฃ๐™ฉ๐™๐™š๐™จ ๐™๐™–๐™ฅ๐™ฅ๐™ž๐™ก๐™ฎ ๐™Ÿ๐™ช๐™จ๐™ฉ ๐™ง๐™š๐™ฉ๐™ช๐™ง๐™ฃ๐™จ ๐™ข๐™ค๐™ง๐™š ๐™–๐™ฃ๐™™ ๐™ข๐™ค๐™ง๐™š ๐™ก๐™ž๐™จ๐™ฉ๐™จ ๐™ค๐™› ๐™ก๐™ž๐™ฃ๐™ ๐™จ ๐™ฅ๐™ค๐™ž๐™ฃ๐™ฉ๐™ž๐™ฃ๐™œ ๐™—๐™–๐™˜๐™  ๐™ฉ๐™ค ๐™ž๐™ฉ๐™จ๐™š๐™ก๐™›."
A pseudonymous coder has created and released an open source โ€œtar pitโ€ to indefinitely trap AI training web crawlers in an infinitely, randomly-generating series of pages to waste their time and computing power. The program, called Nepenthes after the genus of carnivorous pitcher plants which trap and consume their prey, can be deployed by webpage owners to protect their own content from being scraped or can be deployed โ€œoffensivelyโ€ as a honeypot trap to waste AI companiesโ€™ resources.

โ€œIt's less like flypaper and more an infinite maze holding a minotaur, except the crawler is the minotaur that cannot get out. The typical web crawler doesn't appear to have a lot of logic. It downloads a URL, and if it sees links to other URLs, it downloads those too. Nepenthes generates random links that always point back to itself - the crawler downloads those new links. Nepenthes happily just returns more and more lists of links pointing back to itself,โ€ Aaron B, the creator of Nepenthes, told 404 Media.

โ€œOf course, these crawlers are massively scaled, and are downloading links from large swathes of the internet at any given time,โ€ they added. โ€œBut they are still consuming resources, spinning around doing nothing helpful, unless they find a way to detect that they are stuck in this loop.โ€

Human users can see how Nepenthes works by clicking here, though I must warn that the page loads incredibly slowly (on purpose) and links endlessly to pages that load the same way. It looks like this, in practice:

Aaron Bโ€™s website says โ€œTHIS IS DELIBERATELY MALICIOUS CODE INTENDED TO CAUSE HARMFUL ACTIVITY. DO NOT DEPLOY IF YOU ARENโ€™T FULLY COMFORTABLE WITH WHAT YOUโ€™RE DOING.โ€ It also notes it can be deployed โ€œdefensivelyโ€ to โ€œflood our valid URLs within your siteโ€™s domain name, making it unlikely the crawler will access the real contentโ€ and โ€œoffensivelyโ€ to actively trap and waste computing power: โ€œLet's say you've got horsepower and bandwidth to burn, and just want to see these AI models burn. Nepenthes has what you need โ€ฆ In short, let them suck down as much bullshit as they have diskspace for and choke on it.โ€

We have previously written about the difficulty that website owners have had in blocking the web crawlers that train large language models. It is possible to use robots.txt to ask specific bots not to crawl a webpage, but different companies use different bots, the names of those bots often change, and some companies do not honor robots.txt requests or find ways to get around them. Nearly endless internet art projects have proven particularly difficult for bots to crawl; last year, the man who wrote The Internet for Dummies had โ€œthe worldโ€™s lamest content farm,โ€ a website made up of billions of interconnected single-page sites, hit more than 3 million times by OpenAIโ€™s training bot in a single day. Anthropicโ€™s AI scraper later hit the DIY repair company iFixit more than a million times in a day.

โ€œHearing these stories recently definitely pushed me into putting out a release,โ€ Aaron B said. โ€œIt's also sort of an art work, just me unleashing shear unadulterated rage at how things are going. I was just sick and tired of how the internet is evolving into a money extraction panopticon, how the world as a whole is slipping into fascism and oligarchs are calling all the shots - and it's gotten bad enough we can't boycott or vote our way out, we have to start causing real pain to those above for any change to occur.โ€

Since they made and deployed a proof-of-concept, Aaron B said their pages have been hit millions of times by internet-scraping bots. On a Hacker News thread, someone claiming to be an AI company CEO said a tarpit like this is easy to avoid; Aaron B told 404 Media โ€œIf thatโ€™s, true, Iโ€™ve several million lines of access log that says even Google Almighty didnโ€™t graduateโ€ to avoiding the trap.
We are 100% volunteer and depend on your participation to sustain our efforts!

Donate

$215.00 donated
in the past month

Get Involved

If you'd like to help with maintaining or developing the website, contact us.

Publish

Publish your stories and upcoming events on Indybay.

IMC Network