Working for robots

By — Dr. Óscar Nájera
May 5, 2025 | 9 min read | Thoughts

I self-host some of my git repositories to keep sovereignty and independence from large Internet corporations. Public facing repositories are for everybody, and today that means for robots. Robots are the main consumers of my work. With the AI-hype, I wanted to have a look at what are those AI companies collecting from my work. It is worse than everything, it is idiotically everything. They can’t recognize, that they are parsing git repositories and use the appropriate way of downloading them.

Who is visiting

I analyzed the Apache log files of my cgit service in the period from 2025-01-01 till 2025-04-20.

Table 1 shows the top users of my public facing git repository. The leading AI companies OpenAI and Anthropic with their respective bots GPTBot and ClaudeBot simply dominate the load on the service. I found it unbelievable that each could extract about ≈7GiB of data. That is a lot of Bandwidth out of my server for a few git repositories in a lightweight web interface.

Table 1: Top 10 users ranked by bandwidth usage (Tx). User Agent is how they self-identify themselves.

Requests	Tx MiB	User Agent
3572480	8819.6	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot /1.2; +https://openai.com/gptbot )
1617262	6766.3	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot /1.0; [email protected])
273968	721.4	Mozilla/5.0 (compatible; Barkrowler /0.9; +https://babbar.tech/crawler )
80159	498.3	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36
207771	475.8	Scrapy /2.11.2 (+https://scrapy.org )
69697	466.1	Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot )
59832	416.4	Mozilla/5.0 (compatible; AhrefsBot 7.0; +http://ahrefs.com/robot )
14142	83.3	Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; [email protected] )
2500	53.7	Mozilla/5.0 (compatible; SeekportBot; +https://bot.seekport.com )
3578	30.9	Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.7049.52 Mobile Safari/537.36 (compatible; Google Other)

What does it look like as a function of time? Figure 1 shows the load on CGit frontend service by each visiting agent over time. Hover over the plot to read the exact value for each agent at a given time on the legend. You can highlight a specific curve by hovering over it or its legend. You can toggle the display of a curve by clicking on its legend.

Figure 1: Load on CGit frontend service by each visiting agent. The black dashed line shows the total request at the server and uses the right axis scale. All other solid-filled lines, use the left axis and represent the bandwidth usage.

You can see how aggressively the ClaudeBot scrapes pages, using a lot of bandwidth in a short time. On the other hand OpenAI-GPTBot seems rate limited, because it scrapes over a longer period of time. However, as seen in table 1 , it performs more than twice the amount of request and consumes 30% more bandwidth.

The rest of the visitors are bots too. Barkrowler is a regular visitor gathering metrics for online marketing. AhrefsBot is of the same type, yet started crawling in March. Macintosh is certainly a bot hiding itself as a browser and constantly probing. Scrappy is also a scraper, it came at the start of the year and never came back.

PetalBot is for a search engine with AI recommendation by Huawei, it lingers and slowly scrapes everything. Seekport is a search engine, it came all of a sudden, took as much as it found useful, <1% of what the big AI bot take, and swiftly left again.

Bytespider is almost background noise, but it is also to train an LLM, this time for ByteDance, the Chinese owner of TikTok.

The last one Google doesn’t even seem to be the bot for indexing its search engine, but rather one to test its Chrome browser and how it renders pages.

Rest is all the remaining robots or users. They have consumed around ≈400MiB, placing them in aggregate in a behavior like Macintosh, Scrapy, PetalBot & AhrefsBot. Mostly is hacker bots proving the site. Which also means that ~400MiB is what you need to crawl the site. AI crawlers siphoning 10X that amount is abusive.

How should they visit?

CGit is a web interface for git repositories. You can browse some of my code, some files, or diff in isolation, that is its use. If you want everything, the correct use of this service is through the git client. Download with it my publicly available software.

That makes the data as the isolated code is a lot more useful, even for those AI companies. Because the data cleanup and manipulation is easier. They, themselves should use their AI to recognize what kind of page they are vising and act accordingly instead of stupidly scraping everything.

How have the good citizens behaved? That is on table 2 . The Software Heritage keeps mirrors of git repositories. It thus watches for updates and then downloads. There are other people besides them that downloaded, but in total they only downloaded ≈21MiB. That is 0.3% compared to ClaudeBot.

Table 2: Git users

User Agent	hits	tx KiB
git/2.40.3	1075	12821.3
git/2.34.1	1149	3687.1
Software Heritage dumb Git loader	337	2533.6
git/2.48.1	115	1908.6
Software Heritage cgit lister v6.9.3 (+https://www.softwareheritage.org/contact )	8	21.3
Software Heritage cgit lister v6.9.2 (+https://www.softwareheritage.org/contact )	8	21.0
git/dulwich/0.22.7	2	8.5
git/dulwich/0.22.6	1	4.2
Total	2695	21005.6

What are they looking at?

The web front end of git repositories of course, but is there a pattern?

Table 3 show the status codes of all requests performed by the users. The failure rate of OpenAI is alarming. From its 3.5 million requests 15% are client errors: 404 not found page error, and it consumes about ≈2GiB of bandwidth. What is their scraper doing so wrong? ClaudeBot as noted earlier, manages to scrape with half the requests and an error rate of 1.6%.

Everybody else are all the remaining users. They do have an error rate of 25%, but that is normal as they are generally hacker robots scanning for vulnerabilities. You are always under attack on the internet.

Table 3: Pages served and their given HTTP status codes per user agent

Agent	2XX	3XX	4XX	5XX	4XX MiB	4XX %
OpenAI-GPTBot	3017848	0	554511	121	2060.23	15.52
Everybody else	99066	467	34630	14	101.96	25.81
ClaudeBot	1591179	26	25611	446	162.67	1.58
Barkrowler	272343	0	1618	7	5.35	0.59
Macintosh	79071	2	1086	0	7.87	1.35
Bytespider	13609	0	531	2	3.94	3.75
PetalBot	69223	0	473	1	3.2	0.68
Scrapy	207240	0	348	183	1.14	0.17
AhrefsBot	59733	0	90	9	0.61	0.15
Google	3576	0	2	0	0.02	0.06
SeekportBot	2500	0	0	0	0.0	0.0

Let’s have a look at the most not found pages. Listed in table 4 are each of the page paths and then the requests per bot. With one exception, all pages are placeholder links used in website theme templates. The repository hugo-minimalist-theme is a Hugo theme. Within the curly braces {{ }} the rendering engine replaces values. Certainly the bot’s HTML parser reads them raw from the link a tag and requests the page. ClaudeBot seems to track error pages and doesn’t query them repeatedly. OpenAI is incapable of doing that, and stubbornly tries over and over again.

If you grep for the string href="{{ .RelPermalink }}" over the entire git history of that repository, you find it appears 954 times up to today. It is surprising and annoying how OpenAI manages to request it triple that amount.

Table 4: Top 10: 404 error not found pages.

Page	OpenAI	ClaudeBot	Rest
/hugo-minimalist-theme/plain/layouts/partials/{{ .RelPermalink }}	2805	3	7
/hugo-minimalist-theme/plain/layouts/partials/{{ .URL }}	1629	1	13
/hugo-minimalist-theme/plain/layouts/partials/{{ . }}	1559	1	4
/hugo-minimalist-theme/plain/layouts/partials/{{ $href }}	1209	4	5
/hugo-minimalist-theme/plain/layouts/partials/{{ .Permalink }}	1060	2	15
/hugo-minimalist-theme/plain/layouts/partials/{{ $pag.Next.URL }}	916	1	7
/hugo-minimalist-theme/plain/layouts/partials/{{ $pag.Prev.URL }}	912	0	7
/hugo-minimalist-theme/plain/layouts/partials/{{ if ne .MediaType.SubType	817	1	0
/hugo-minimalist-theme/plain/layouts/{{ .Type }}	798	5	4
/hugo-minimalist-theme/plain/layouts/taxonomy/{{ .Name \| urlize }}	745	5	0

What about the hackers? Table 5 excludes the AI bots idiotic crawling and thus exposes what the rest to look for. You instantly recognize they looking at the attack surface of the site. First is producing a 400 Bad Request to the main site. Trying to steal/find environment secrets under the .env file, or the git configuration. Then, the most common type of attack aims to exploit the remote code execution in PHPUnit by looking for the file eval-stdin.

Table 5: Top 10 attacks leading to error pages ranked by number of requests. Agents and IPs columns count different agents and IPs doing the requests.

Requests	Agents	IPs	Errors	Methods	path
3482	6	1139	400,421,408	GET,POST	/
744	368	256	404	GET,POST	/.env
409	1	11	404	GET	cgi-bin/luci;stok=/locale
381	182	121	404	GET	/.git/config
222	12	167	404	GET,POST	/vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php
195	1	1	404	GET	/actuator/gateway/routes
173	2	137	404	POST	/hello.world?%ADd+allow_url_include%3d1+%ADd+auto_prepend_file%3dphp://input
157	2	127	404	GET	/vendor/phpunit/phpunit/Util/PHP/eval-stdin.php
152	2	123	404	GET	/vendor/phpunit/src/Util/PHP/eval-stdin.php
148	2	119	404	GET	/vendor/phpunit/phpunit/LICENSE/eval-stdin.php

Future plans

Quite many webmasters have been annoyed by this abusive scraping of AI bots. The project Anubis implements a proof of work tax to visitors of a webpage with the expectation to reduce the abusive AI bots scraping.

I personally dislike that idea. It does create an extra expense to the AI companies, which indiscriminately crawl the internet, but no one really wins out of it. It is a failure of our internet system, that micro payments aren’t yet a reality. That should be the proper way to implement the tax, giving the website operator some revenue.

That means for myself, being part of the change and bring my bitcoin lightning tipping system back online, this time with real coins. We need to get people used to pay for resources on the internet. For that we need a working infrastructure, we can’t wait for the banking system to do it. In my opinion, the main reason why our internet so aggressively invades our privacy is because the banking system never provided a way to move money across the internet. The only people motivated enough to endure the inconvenience of traditional payments were advertising companies.

Knowing how stupid the AI crawlers are, I believe poisoning the training data is better than using a Proof of work tax to hurt AI companies from such aggressive and mindless crawling. Projects like Iocane provide a way for it and is what I’ll implement in the future.

Dr. Óscar Nájera

Software archeologist – Recovering Physicist – Dancer

As a scientist I studied the physics of the very small quantum world. As a computer hacker I distill code. Software is eating the world, and less code means less errors, less problems. Millions of lines of legacy code demand attention and have to be understood and simplified for future reliable operation.