WILDLABS downtime and performance issues due to AI bot attack

Director at Octophin Digital building things for wildlife conservation.

Groups

Community Base

Hi everyone,

Some of you will have noticed that WILDLABS was inaccessible or frustratingly slow on Friday (April 26th, 2024). Aside from explaining this downtime, what happened is probably of interest to some of you on the technology side of things.

The database was under a lot of strain from the morning, very soon blocking the server's CPU from running any page loads, or making them take a very long time to return results.

There are thousands of pieces of content (members, groups, discussions, articles, comments, reactions...) here, all with many references to other content, so there are a lot of database queries needed. This has been one of the biggest challenges when building WILDLABS so to speed this up, most page loads don't trigger many database hits, instead hitting a cache that updates every time something changes. As they were causing so much strain on the database these must have been direct database hits, so searches for dynamic things that aren't necessarily cached...

After a long day of debugging, restarting services and looking at a lot of log files we spotted the cause in the server access logs.

An AI bot called Claude, built by AI company Anthropic, was being used to make millions of search requests to the WILDLABS Platform (many a second) to discover (through constant searches using the search interface) and scrape the content. Exactly how it was using the training data is unclear but it was doing a lot of searches in all parts of the site.

Once we identified it we blocked it and it has now stopped making any more requests, but while there, we noticed a few other bots (also all blocked) trying similar things (though in no way to the level of "Claude").

In previous years the main threats to websites were security attacks and spam, with denial of service attacks being rare on sites that no one would want to bring down (such as a conservation technology community site).

Unfortunately, we now have a new type of attack which is not only basically a denial of service attack but one that is stealing content without permission to feed into algorithms.

We'll continue to monitor the access logs and block any new bots, and also look into ways of mitigating this in the future.

A big thank you to community member and friend @Alasdair for chatting this through late at night on Friday while we deployed the solution.

I used to love manually coding scrapers for very specific purposes but this is a whole other situation where it's getting easier to deploy something like this at great scale and let it run wild.

As many of you have a deep knowledge of AI matters, insights into this are welcome.

(Serves me right for years ago temporarily bringing down a historical database of UK seaports by scraping it, but I asked for permission beforehand and apologised afterwards).

Frank van der Most

@Frank_van_der_Most

FileMaker developer with a passion for nature.

1 May 2024 6:05pm

I noticed the site being annoyingly slow some time last week. Thank you for clearing that up, for finding the cause and solving the issue.

I'm not claiming deep knowledge on AI, but as a member this community, I'd be happy to give you my insights.

For starters: I am not categorically against bots scraping 'my' content, whether for AI training purposes, search engines, or other purposes. In principle, I find it a good thing that this forum is open to non-member users, and to me that extends to non-member use. Obviously, there are some exceptions here. For example when locations of individuals of endangered species are discussed, that should be behind closed doors.

Continuing down this line of reasoning, apparently it matters to me how 'my' content is being used. So, if someone wants to make an AI to aid nature conservation, I say, let them have it. There is the practical side of scraping activities that may be blocking or hindering the site, but there may be practical solutions for this. I don't know, say, have special opening hours for such things, or have the site engine prepare the data and make it available as a data dump somewhere.

Since purpose matters, organizations or individuals wanting to scrape the site should be vetted and given access or not. This is far more easily said than done. However, every step in the direction would be worth the while, because most technology publicly discussed here has good use for nature conservation, but equally bad use for nature destruction. For example, it's good to acoustically monitor bird sounds to monitor species, but also comes in handy when you are in the exotic bird trafficking business.

One could argue that since we allow public access, we should not care either about why bots are scraping the site. I would not go that far. After all, individual people browsing the site with nefarious purposes in mind is something else than a bot systematically scraping the entire site (or sections thereof) for bad purposes. It's a matter of scale.

WILDLABS downtime and performance issues due to AI bot attack

Filip Hnízdo

Octophin Digital

Groups

Frank van der Most