I self-host some of my git repositories to keep sovereignty and independence
from large Internet corporations. Public facing repositories are for everybody,
and today that means for robots. Robots are the main consumers of my work. With
the AI-hype, I wanted to have a look at what are those AI companies collecting
from my work. It is worse than everything, it is idiotically everything. They
can’t recognize, that they are parsing git repositories and use the appropriate
way of downloading them.
I analyzed the Apache log files of my cgit service in the period from
2025-01-01 till 2025-04-20.
Table 1
shows the top users of my public facing git repository. The
leading AI companies OpenAI and Anthropic with their respective bots
GPTBot and ClaudeBot simply dominate the load on the service. I found it
unbelievable that each could extract about ≈7GiB of data. That is a lot of
Bandwidth out of my server for a few git repositories in a lightweight web
interface.
Table 1:
Top 10 users ranked by bandwidth usage (Tx). User Agent is how they self-identify themselves.
Requests
Tx MiB
User Agent
3572480
8819.6
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot /1.2; +https://openai.com/gptbot
)
1617262
6766.3
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot /1.0; [email protected])
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.7049.52 Mobile Safari/537.36 (compatible; Google Other)
What does it look like as a function of time? Figure 1
shows the
load on CGit frontend service by each visiting agent over time. Hover over the
plot to read the exact value for each agent at a given time on the legend. You
can highlight a specific curve by hovering over it or its legend. You can toggle
the display of a curve by clicking on its legend.
Figure 1: Load on CGit frontend service by each visiting agent. The black dashed line shows the total request at the server and uses the right axis scale. All other solid-filled lines, use the left axis and represent the bandwidth usage.
You can see how aggressively the ClaudeBot scrapes pages, using a lot of
bandwidth in a short time. On the other hand OpenAI-GPTBot seems rate
limited, because it scrapes over a longer period of time. However, as seen in
table 1
, it performs more than twice the amount of request and consumes
30% more bandwidth.
The rest of the visitors are bots too. Barkrowler is a regular visitor
gathering metrics for online marketing. AhrefsBot is of the same type, yet
started crawling in March. Macintosh is certainly a bot hiding itself as a
browser and constantly probing. Scrappy is also a scraper, it came at the
start of the year and never came back.
PetalBot is for a search engine with AI recommendation by Huawei, it lingers
and slowly scrapes everything. Seekport is a search engine, it came all of a
sudden, took as much as it found useful, <1% of what the big AI bot take, and
swiftly left again.
Bytespider is almost background noise, but it is also to train an LLM, this
time for ByteDance, the Chinese owner of TikTok.
The last one Google doesn’t even seem to be the bot for indexing its search
engine, but rather one to test its Chrome browser and how it renders pages.
Rest is all the remaining robots or users. They have consumed around
≈400MiB, placing them in aggregate in a behavior like Macintosh, Scrapy, PetalBot & AhrefsBot. Mostly is hacker bots proving the site. Which also means
that ~400MiB is what you need to crawl the site. AI crawlers siphoning 10X
that amount is abusive.
CGit is a web interface for git repositories. You can browse some of my
code, some files, or diff in isolation, that is its use. If you want
everything, the correct use of this service is through the git client.
Download with it my publicly available software.
That makes the data as the isolated code is a lot more useful, even for those AI
companies. Because the data cleanup and manipulation is easier. They, themselves
should use their AI to recognize what kind of page they are vising and act
accordingly instead of stupidly scraping everything.
How have the good citizens behaved? That is on table 2
. The Software Heritage keeps mirrors of git repositories. It thus watches for updates and
then downloads. There are other people besides them that downloaded, but in
total they only downloaded ≈21MiB. That is 0.3% compared to ClaudeBot.
The web front end of git repositories of course, but is there a pattern?
Table 3
show the status codes of all requests performed by the users.
The failure rate of OpenAI is alarming. From its 3.5 million requests 15%
are client errors: 404 not found page error, and it consumes about ≈2GiB of
bandwidth. What is their scraper doing so wrong? ClaudeBot as noted earlier,
manages to scrape with half the requests and an error rate of 1.6%.
Everybody else are all the remaining users. They do have an error rate of
25%, but that is normal as they are generally hacker robots scanning for
vulnerabilities. You are always under attack on the internet.
Table 3:
Pages served and their given HTTP status codes per user agent
Agent
2XX
3XX
4XX
5XX
4XX MiB
4XX %
OpenAI-GPTBot
3017848
0
554511
121
2060.23
15.52
Everybody else
99066
467
34630
14
101.96
25.81
ClaudeBot
1591179
26
25611
446
162.67
1.58
Barkrowler
272343
0
1618
7
5.35
0.59
Macintosh
79071
2
1086
0
7.87
1.35
Bytespider
13609
0
531
2
3.94
3.75
PetalBot
69223
0
473
1
3.2
0.68
Scrapy
207240
0
348
183
1.14
0.17
AhrefsBot
59733
0
90
9
0.61
0.15
Google
3576
0
2
0
0.02
0.06
SeekportBot
2500
0
0
0
0.0
0.0
Let’s have a look at the most not found pages. Listed in table 4
are
each of the page paths and then the requests per bot. With one exception, all
pages are placeholder links used in website theme templates. The repository
hugo-minimalist-theme is a Hugo
theme. Within the curly braces {{ }} the
rendering engine replaces values. Certainly the bot’s HTML parser reads them
raw from the link a tag and requests the page. ClaudeBot seems to track
error pages and doesn’t query them repeatedly. OpenAI is incapable of doing
that, and stubbornly tries over and over again.
If you grep for the string href="{{ .RelPermalink }}" over the entire git
history of that repository, you find it appears 954 times up to today. It is
surprising and annoying how OpenAI manages to request it triple that amount.
What about the hackers? Table 5
excludes the AI bots idiotic
crawling and thus exposes what the rest to look for. You instantly recognize
they looking at the attack surface of the site. First is producing a 400 Bad Request to the main site. Trying to steal/find environment secrets under the
.env file, or the git configuration. Then, the most common type of attack aims
to exploit the remote code execution in PHPUnit by looking for the file
eval-stdin.
Table 5:
Top 10 attacks leading to error pages ranked by number of requests. Agents and IPs columns count different agents and IPs doing the requests.
Quite many webmasters have been annoyed by this abusive scraping of AI bots. The
project Anubis
implements a proof of work tax to visitors of a webpage with
the expectation to reduce the abusive AI bots scraping.
I personally dislike that idea. It does create an extra expense to the AI
companies, which indiscriminately crawl the internet, but no one really wins out
of it. It is a failure of our internet system, that micro payments aren’t yet a
reality. That should be the proper way to implement the tax, giving the website
operator some revenue.
That means for myself, being part of the change and bring my bitcoin lightning
tipping system back online, this time with real coins. We need to get people
used to pay for resources on the internet. For that we need a working
infrastructure, we can’t wait for the banking system to do it. In my opinion,
the main reason why our internet so aggressively invades our privacy is because
the banking system never provided a way to move money across the internet. The
only people motivated enough to endure the inconvenience of traditional payments
were advertising companies.
Knowing how stupid the AI crawlers are, I believe poisoning the training data is
better than using a Proof of work tax to hurt AI companies from such aggressive
and mindless crawling. Projects like Iocane
provide a way for it and is what
I’ll implement in the future.
As a scientist I studied the physics of the very small quantum world. As a computer hacker I distill code. Software is eating the world, and less code means less errors, less problems. Millions of lines of legacy code demand attention and have to be understood and simplified for future reliable operation.