I self-host some of my git repositories to keep sovereignty and independence from large Internet corporations. Public facing repositories are for everybody, and today that means for robots. Robots are the main consumers of my work. With the AI-hype, I wanted to have a look at what are those AI companies collecting from my work. It is worse than everything, it is idiotically everything. They can’t recognize, that they are parsing git repositories and use the appropriate way of downloading them.

Who is visiting

I analyzed the Apache log files of my cgit service in the period from 2025-01-01 till 2025-04-20.

Table 1 shows the top users of my public facing git repository. The leading AI companies OpenAI and Anthropic with their respective bots GPTBot and ClaudeBot simply dominate the load on the service. I found it unbelievable that each could extract about ≈7GiB of data. That is a lot of Bandwidth out of my server for a few git repositories in a lightweight web interface.

Table 1: Top 10 users ranked by bandwidth usage (Tx). User Agent is how they self-identify themselves.
RequestsTx MiBUser Agent
35724808819.6Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot /1.2; +https://openai.com/gptbot )
16172626766.3Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot /1.0; [email protected])
273968721.4Mozilla/5.0 (compatible; Barkrowler /0.9; +https://babbar.tech/crawler )
80159498.3Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36
207771475.8Scrapy /2.11.2 (+https://scrapy.org )
69697466.1Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot )
59832416.4Mozilla/5.0 (compatible; AhrefsBot 7.0; +http://ahrefs.com/robot )
1414283.3Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; [email protected] )
250053.7Mozilla/5.0 (compatible; SeekportBot; +https://bot.seekport.com )
357830.9Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.7049.52 Mobile Safari/537.36 (compatible; Google Other)

What does it look like as a function of time? Figure 1 shows the load on CGit frontend service by each visiting agent over time. Hover over the plot to read the exact value for each agent at a given time on the legend. You can highlight a specific curve by hovering over it or its legend. You can toggle the display of a curve by clicking on its legend.

Figure 1: Load on CGit frontend service by each visiting agent. The black dashed line shows the total request at the server and uses the right axis scale. All other solid-filled lines, use the left axis and represent the bandwidth usage.

Figure 1: Load on CGit frontend service by each visiting agent. The black dashed line shows the total request at the server and uses the right axis scale. All other solid-filled lines, use the left axis and represent the bandwidth usage.

You can see how aggressively the ClaudeBot scrapes pages, using a lot of bandwidth in a short time. On the other hand OpenAI-GPTBot seems rate limited, because it scrapes over a longer period of time. However, as seen in table 1 , it performs more than twice the amount of request and consumes 30% more bandwidth.

The rest of the visitors are bots too. Barkrowler is a regular visitor gathering metrics for online marketing. AhrefsBot is of the same type, yet started crawling in March. Macintosh is certainly a bot hiding itself as a browser and constantly probing. Scrappy is also a scraper, it came at the start of the year and never came back.

PetalBot is for a search engine with AI recommendation by Huawei, it lingers and slowly scrapes everything. Seekport is a search engine, it came all of a sudden, took as much as it found useful, <1% of what the big AI bot take, and swiftly left again.

Bytespider is almost background noise, but it is also to train an LLM, this time for ByteDance, the Chinese owner of TikTok.

The last one Google doesn’t even seem to be the bot for indexing its search engine, but rather one to test its Chrome browser and how it renders pages.

Rest is all the remaining robots or users. They have consumed around ≈400MiB, placing them in aggregate in a behavior like Macintosh, Scrapy, PetalBot & AhrefsBot. Mostly is hacker bots proving the site. Which also means that ~400MiB is what you need to crawl the site. AI crawlers siphoning 10X that amount is abusive.

How should they visit?

CGit is a web interface for git repositories. You can browse some of my code, some files, or diff in isolation, that is its use. If you want everything, the correct use of this service is through the git client. Download with it my publicly available software.

That makes the data as the isolated code is a lot more useful, even for those AI companies. Because the data cleanup and manipulation is easier. They, themselves should use their AI to recognize what kind of page they are vising and act accordingly instead of stupidly scraping everything.

How have the good citizens behaved? That is on table 2 . The Software Heritage keeps mirrors of git repositories. It thus watches for updates and then downloads. There are other people besides them that downloaded, but in total they only downloaded ≈21MiB. That is 0.3% compared to ClaudeBot.

Table 2: Git users
User Agenthitstx KiB
git/2.40.3107512821.3
git/2.34.111493687.1
Software Heritage dumb Git loader3372533.6
git/2.48.11151908.6
Software Heritage cgit lister v6.9.3 (+https://www.softwareheritage.org/contact )821.3
Software Heritage cgit lister v6.9.2 (+https://www.softwareheritage.org/contact )821.0
git/dulwich/0.22.728.5
git/dulwich/0.22.614.2
Total269521005.6

What are they looking at?

The web front end of git repositories of course, but is there a pattern?

Table 3 show the status codes of all requests performed by the users. The failure rate of OpenAI is alarming. From its 3.5 million requests 15% are client errors: 404 not found page error, and it consumes about ≈2GiB of bandwidth. What is their scraper doing so wrong? ClaudeBot as noted earlier, manages to scrape with half the requests and an error rate of 1.6%.

Everybody else are all the remaining users. They do have an error rate of 25%, but that is normal as they are generally hacker robots scanning for vulnerabilities. You are always under attack on the internet.

Table 3: Pages served and their given HTTP status codes per user agent
Agent2XX3XX4XX5XX4XX MiB4XX %
OpenAI-GPTBot301784805545111212060.2315.52
Everybody else990664673463014101.9625.81
ClaudeBot15911792625611446162.671.58
Barkrowler2723430161875.350.59
Macintosh790712108607.871.35
Bytespider13609053123.943.75
PetalBot69223047313.20.68
Scrapy20724003481831.140.17
AhrefsBot5973309090.610.15
Google35760200.020.06
SeekportBot25000000.00.0

Let’s have a look at the most not found pages. Listed in table 4 are each of the page paths and then the requests per bot. With one exception, all pages are placeholder links used in website theme templates. The repository hugo-minimalist-theme is a Hugo theme. Within the curly braces {{ }} the rendering engine replaces values. Certainly the bot’s HTML parser reads them raw from the link a tag and requests the page. ClaudeBot seems to track error pages and doesn’t query them repeatedly. OpenAI is incapable of doing that, and stubbornly tries over and over again.

If you grep for the string href="{{ .RelPermalink }}" over the entire git history of that repository, you find it appears 954 times up to today. It is surprising and annoying how OpenAI manages to request it triple that amount.

Table 4: Top 10: 404 error not found pages.
PageOpenAIClaudeBotRest
/hugo-minimalist-theme/plain/layouts/partials/{{ .RelPermalink }}280537
/hugo-minimalist-theme/plain/layouts/partials/{{ .URL }}1629113
/hugo-minimalist-theme/plain/layouts/partials/{{ . }}155914
/hugo-minimalist-theme/plain/layouts/partials/{{ $href }}120945
/hugo-minimalist-theme/plain/layouts/partials/{{ .Permalink }}1060215
/hugo-minimalist-theme/plain/layouts/partials/{{ $pag.Next.URL }}91617
/hugo-minimalist-theme/plain/layouts/partials/{{ $pag.Prev.URL }}91207
/hugo-minimalist-theme/plain/layouts/partials/{{ if ne .MediaType.SubType81710
/hugo-minimalist-theme/plain/layouts/{{ .Type }}79854
/hugo-minimalist-theme/plain/layouts/taxonomy/{{ .Name | urlize }}74550

What about the hackers? Table 5 excludes the AI bots idiotic crawling and thus exposes what the rest to look for. You instantly recognize they looking at the attack surface of the site. First is producing a 400 Bad Request to the main site. Trying to steal/find environment secrets under the .env file, or the git configuration. Then, the most common type of attack aims to exploit the remote code execution in PHPUnit by looking for the file eval-stdin.

Table 5: Top 10 attacks leading to error pages ranked by number of requests. Agents and IPs columns count different agents and IPs doing the requests.
RequestsAgentsIPsErrorsMethodspath
348261139400,421,408GET,POST/
744368256404GET,POST/.env
409111404GETcgi-bin/luci;stok=/locale
381182121404GET/.git/config
22212167404GET,POST/vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php
19511404GET/actuator/gateway/routes
1732137404POST/hello.world?%ADd+allow_url_include%3d1+%ADd+auto_prepend_file%3dphp://input
1572127404GET/vendor/phpunit/phpunit/Util/PHP/eval-stdin.php
1522123404GET/vendor/phpunit/src/Util/PHP/eval-stdin.php
1482119404GET/vendor/phpunit/phpunit/LICENSE/eval-stdin.php

Future plans

Quite many webmasters have been annoyed by this abusive scraping of AI bots. The project Anubis implements a proof of work tax to visitors of a webpage with the expectation to reduce the abusive AI bots scraping.

I personally dislike that idea. It does create an extra expense to the AI companies, which indiscriminately crawl the internet, but no one really wins out of it. It is a failure of our internet system, that micro payments aren’t yet a reality. That should be the proper way to implement the tax, giving the website operator some revenue.

That means for myself, being part of the change and bring my bitcoin lightning tipping system back online, this time with real coins. We need to get people used to pay for resources on the internet. For that we need a working infrastructure, we can’t wait for the banking system to do it. In my opinion, the main reason why our internet so aggressively invades our privacy is because the banking system never provided a way to move money across the internet. The only people motivated enough to endure the inconvenience of traditional payments were advertising companies.

Knowing how stupid the AI crawlers are, I believe poisoning the training data is better than using a Proof of work tax to hurt AI companies from such aggressive and mindless crawling. Projects like Iocane provide a way for it and is what I’ll implement in the future.