radikal.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
radikal.social was created by a group of activists to offer federated social media for the radical left in and around Denmark.

Administered by:

Server stats:

163
active users

#webscraping

2 posts2 participants0 posts today

I'm having trouble figuring out what kind of botnet has been hammering our web servers over the past week. Requests come in from tens of thousands of addresses, just once or twice each (and not getting blocked by fail2ban), with different browser strings (Chrome versions ranging from 24.0.1292.0 - 108.0.5163.147) and ridiculous cobbled-together paths like /about-us/1-2-3-to-the-zoo/the-tiny-seed/10-little-rubber-ducks/1-2-3-to-the-zoo/the-tiny-seed/the-nonsense-show/slowly-slowly-slowly-said-the-sloth/the-boastful-fisherman/the-boastful-fisherman/brown-bear-brown-bear-what-do-you-see/the-boastful-fisherman/brown-bear-brown-bear-what-do-you-see/brown-bear-brown-bear-what-do-you-see/pancakes-pancakes/pancakes-pancakes/the-tiny-seed/pancakes-pancakes/pancakes-pancakes/slowly-slowly-slowly-said-the-sloth/the-tiny-seed

(I just put together a bunch of Eric Carle titles as an example. The actual paths are pasted together from valid paths on our server but in invalid order, with as many as 32 subdirectories.)

Has anyone else been seeing this and do you have an idea what's behind it?

#legal #WebScraping #EU #BestPractices

'At The Markup, some of our data journalists recently had questions about the legal risks involved in scraping websites hosted in the European Union. We conducted our own research to answer this question, and offer a summary of what we learned below. Our goal is to help other journalists, researchers, and advocates come up with a low-risk strategy for scraping in the EU.'

hackernoon.com/a-guide-on-how-

hackernoon.comA Guide on How to Legally Web Scrape EU Data | HackerNoonScraping has long existed in a legally gray area, so journalists and other researchers tend to approach it cautiously.

"On Saturday, Triplegangers CEO Oleksandr Tomchuk was alerted that his company’s e-commerce site was down. It looked to be some kind of distributed denial-of-service attack.

He soon discovered the culprit was a bot from OpenAI that was relentlessly attempting to scrape his entire, enormous site.

“We have over 65,000 products, each product has a page,” Tomchuk told TechCrunch. “Each page has at least three photos.”

OpenAI was sending “tens of thousands” of server requests trying to download all of it, hundreds of thousands of photos, along with their detailed descriptions.

“OpenAI used 600 IPs to scrape data, and we are still analyzing logs from last week, perhaps it’s way more,” he said of the IP addresses the bot used to attempt to consume his site.

“Their crawlers were crushing our site,” he said “It was basically a DDoS attack.”

Triplegangers’ website is its business. The seven-employee company has spent over a decade assembling what it calls the largest database of “human digital doubles” on the web, meaning 3D image files scanned from actual human models.

It sells the 3D object files, as well as photos — everything from hands to hair, skin, and full bodies — to 3D artists, video game makers, anyone who needs to digitally recreate authentic human characteristics."

techcrunch.com/2025/01/10/how-

TechCrunch · How OpenAI's bot crushed this seven-person company's web site ‘like a DDoS attack’ | TechCrunchOpenAI was sending “tens of thousands” of server requests trying to download Triplegangers' entire site which hosts hundreds of thousands of photos.

"The success of generative AI relies heavily on training on data scraped through extensive crawling of the Internet, a practice that has raised significant copyright, privacy, and ethical concerns. While few measures are designed to resist a resource-rich adversary determined to scrape a site, crawlers can be impacted by a range of existing tools such as robots.txt, NoAI meta tags, and active crawler blocking by reverse proxies.

In this work, we seek to understand the ability and efficacy of today’s networking tools to protect content creators against AI-related crawling. For targeted populations like human artists, do they have the technical knowledge and agency to utilize crawler-blocking tools such as robots.txt, and can such tools be effective? Using large scale measurements and a targeted user study of 182 professional artists, we find strong demand for tools like robots.txt, but significantly constrained by significant hurdles in technical awareness, agency in deploying them, and limited efficacy against unresponsive crawlers. We further test and evaluate network level crawler blockers by reverse-proxies, and find that despite very limited deployment today, their reliable and comprehensive blocking of AI-crawlers make them the strongest protection for artists moving forward."

arxiv.org/html/2411.15091v1#S3

arxiv.orgSomesite I Used To Crawl: Awareness, Agency and Efficacy in Protecting Content Creators From AI Crawlers

"Now that the seal is broken on scraping Bluesky posts into datasets for machine learning, people are trolling users and one-upping each other by making increasingly massive datasets of non-anonymized, full-text Bluesky posts taken directly from the social media platform’s public firehose—including one that contains almost 300 million posts.

Last week, Daniel van Strien, a machine learning librarian at open-source machine learning library platform Hugging Face, released a dataset composed of one million Bluesky posts, including when they were posted and who posted them. Within hours of his first post—shortly after our story about this being the first known, public, non-anonymous dataset of Bluesky posts, and following hundreds of replies from people outraged that their posts were scraped without their permission—van Strein took it down and apologized."

404media.co/bluesky-posts-mach

404 Media · Your Bluesky Posts Are Probably In A Bunch of Datasets NowAfter a machine learning librarian released and then deleted a dataset of one million Bluesky posts, several other, even bigger datasets have appeared in its place—including one of almost 300 million non-anonymized posts.

I'm not sure this kind of tools are legal in the European Union...

"Fast-forward to today, and millions of artists have deployed two tools born from that Zoom: Glaze and Nightshade, which were developed by Zhao and the University of Chicago’s SAND Lab (an acronym for “security, algorithms, networking, and data”).

Arguably the most prominent weapons in an artist’s arsenal against nonconsensual AI scraping, Glaze and Nightshade work in similar ways: by adding what the researchers call “barely perceptible” perturbations to an image’s pixels so that machine-learning models cannot read them properly. Glaze, which has been downloaded more than 6 million times since it launched in March 2023, adds what’s effectively a secret cloak to images that prevents AI algorithms from picking up on and copying an artist’s style. Nightshade, which I wrote about when it was released almost exactly a year ago this fall, cranks up the offensive against AI companies by adding an invisible layer of poison to images, which can break AI models; it has been downloaded more than 1.6 million times.

Thanks to the tools, “I’m able to post my work online,” Ortiz says, “and that’s pretty huge.” For artists like her, being seen online is crucial to getting more work. If they are uncomfortable about ending up in a massive for-profit AI model without compensation, the only option is to delete their work from the internet. That would mean career suicide."

technologyreview.com/2024/11/1

MIT Technology Review · The AI lab waging a guerrilla war over exploitative AIBy Melissa Heikkilä

#CyberSecurity #BitTorrent #P2P #OrpheusNetwork #WebScraping #Privacy #DataProtection: "Orpheus Network, a popular and private music torrent tracker, experienced a “massive peer scraping attack” that may have exposed the IP addresses, files shared, and other information about users earlier this month, site administrators told its roughly 19,000 users.

“With great displeasure we need to inform you that a malicious actor has successfully carried out a massive peer scraping attack on our tracker on Thursday,” a note from admins posted to the site on September 18 read. “The unknown actor has downloaded the majority of our torrent files and corresponding peer lists. This means the malicious third party is now in possession of most of our users' torrent client information (seeding IP, client port, torrents seeded). As far as we can observe their immediate goal is downloading a huge part of our library, but we do not know if they have further plans with the collected data.”

The attack is notable because it comes against a private torrent tracker that requires users to be invited or to pass through an interview process."

404media.co/major-private-musi

404 Media · Major Private Music Torrenting Site Suffers ‘Massive Peer Scraping Attack’Orpheus Network tells users: "With great displeasure we need to inform you that a malicious actor has successfully carried out a massive peer scraping attack on our tracker."

#AI #GenerativeAI #AITraining #CyberSecurity #Botnets #WebScraping: "In the race to build the world's most advanced AI, tech companies have fanned out across the web, releasing botnets like a plague of digital locusts to scour sites for anything they can use to fuel their voracious models.

It's often high quality training data they're after, but also other information that may help AI models understand the world. The race is on to collect as much information as possible before it runs out, or the rules change on what's acceptable.

One study estimated that the world's supply of usable AI training data could be depleted by 2032. The entire online corpus of recorded human experience may soon be inadequate to keep ChatGPT up to date.

A resource like the Game UI Database, where a human has already done the painstaking labor of cleaning and categorizing images, must have looked like an all-you-can-eat-buffet.

For small website owners with limited resources, the costs of playing host to swarm of hungry bots can present a significant burden."

businessinsider.com/openai-ant

Insider · OpenAI and Anthropic AI bots cause havoc and raise costs for websitesBy Darius Rafieyan

#Australia #SocialMedia #AI #Facebook #GenerativeAI #AITraining #WebScraping: "Facebook is scraping the public data of all Australian adults on the platform, it has acknowledged in an inquiry.

The company does not offer Australians an opt out option like it does in the EU, because it has not been required to do so under privacy law.

What's next?
Facebook representatives could not say whether an opt-out option would be offered to Australians in the future."

abc.net.au/news/2024-09-11/fac

ABC News · Facebook admits to scraping every Australian adult user's public photos and posts to train AI, with no opt-out optionBy Jake Evans

#AI #GenerativeAI #Nvidia #YouTube #NetFlix #WebScraping: "A YouTube creator filed a class action lawsuit against Nvidia on Wednesday, claiming that Nvidia profited significantly from his and others’ videos, violated California’s Unfair Competition Law, and unjustly enriched the company at his and other creators’ expense, after a recent 404 Media investigation revealed Nvidia scraped YouTube and other platforms en masse to build its own AI systems.

YouTuber David Millette claims that Nvidia has enriched itself unjustly and broken federal labor laws in building a new video model based on content scraped from YouTube without his or other creators’ permission."

404media.co/nvidia-sued-for-sc

404 Media · Nvidia Sued for Scraping YouTube After 404 Media InvestigationDavid Millette, a YouTube creator, filed a class action lawsuit against Nvidia citing “unjust enrichment and unfair competition” for how the company built its training data for the “Cosmos” project video model.

#AI #GenerativeAI #AITraining #Anthropic #WebCrawlers #WebScraping #Robotstxt: "Hundreds of websites trying to block the AI company Anthropic from scraping their content are blocking the wrong bots, seemingly because they are copy/pasting outdated instructions to their robots.txt files, and because companies are constantly launching new AI crawler bots with different names that will only be blocked if website owners update their robots.txt.

In particular, these sites are blocking two bots no longer used by the company, while unknowingly leaving Anthropic’s real (and new) scraper bot unblocked.

This is an example of “how much of a mess the robots.txt landscape is right now,” the anonymous operator of Dark Visitors told 404 Media. Dark Visitors is a website that tracks the constantly-shifting landscape of web crawlers and scrapers—many of them operated by AI companies—and which helps website owners regularly update their robots.txt files to prevent specific types of scraping. The site has seen a huge increase in popularity as more people try to block AI from scraping their work."

404media.co/websites-are-block

404 Media · Websites are Blocking the Wrong AI Scrapers (Because AI Companies Keep Making New Ones)Hundreds of sites have put old Anthropic scrapers on their blocklist, while leaving a new one unblocked.

#AI #GenerativeAI #AITraining #Anthropic #iFixit #WebScraping: "The web scraper bot for Anthropic’s AI chatbot Claude hit iFixit’s website nearly a million times in a single day, despite the repair database having terms of service provisions that state “reproducing, copying or distributing any Content, materials or design elements on the Site for any other purpose, including training a machine learning or AI model, is strictly prohibited without the express prior written permission of iFixit.”

iFixit CEO Kyle Wiens tweeted Wednesday “Hey @AnthropicAI: I get you're hungry for data. Claude is really smart! But do you really need to hit our servers a million times in 24 hours? You're not only taking our content without paying, you're tying up our devops resources. Not cool.”

404media.co/anthropic-ai-scrap

404 Media · Anthropic AI Scraper Hits iFixit’s Website a Million Times in a Day“We're just the largest database of repair information in the world, no big deal if they take it all without asking and swamp our servers in the process.”

#UK #SocialMedia #TikTok #WebScraping #Media #News #Journalism:"The Chinese owner of TikTok has been accused of using UK news sites to train up its rival to ChatGPT without permission or fair payment.

Publishers including The Guardian, Daily Mail and The Telegraph are believed to have been targeted by a bot operated by the Beijing-based tech giant Bytedance.

The company has said its bot, dubbed Bytespider, has been deployed for “search optimisation” purposes.

However, news organisations are concerned that their articles are being used without permission to train chatbots and have raised concerns about copyright violations."

telegraph.co.uk/business/2023/

The TelegraphTikTok owner ‘scraping’ UK news sites to train ChatGPT rivalBytedance’s bot is said to have targeted publishers including The Guardian and Daily Mail