Which provider is immune to this? Gitlab? Bitbucket?
Or is it better to self host?
Self hosted GitLab with a self-hosted LLM Provider connected to GitLab powering GitLab Duo. This should ensure that the data never gets outside your network, is never used in training data, and still allows you/staff to utilize LLMs. If you don’t want to self host an LLM, you could use something like Amazon Q, but then you’re trusting Amazon to do right by you.
https://docs.gitlab.com/administration/gitlab_duo_self_hoste...
GitHub won’t use private repos for training data. You’d have to believe that they were lying about their policies and coordinating a lot of engineers into a conspiracy where not a single one of them would whistleblow about it.
Copilot won’t send your data down a path that incorporates it into training data. Not unless you do something like Bring Your Own Key and then point it at one of the “free” public APIs that are only free because they use your inputs as training data. (EDIT: Or if you explicitly opt-in to the option to include your data in their training set, as pointed out below, though this shouldn’t be surprising)
It’s somewhere between myth and conspiracy theory that using Copilot, Claude, ChatGPT, etc. subscriptions will take your data and put it into their training set.
“GitHub Copilot for Individual users, however, can opt in and explicitly provide consent for their code to be used as training data. User engagement data is used to improve the performance of the Copilot Service; specifically, it’s used to fine-tune ranking, sort algorithms, and craft prompts.”
- https://github.blog/news-insights/policy-news-and-insights/h...
So it’s a “myth” that github explicitly says is true…
> can opt in and explicitly provide consent for their code to be used as training data.
I guess if you count users explicitly opting in, then that part is true.
I also covered the case where someone opts-in to a “free” LLM provider that uses prompts as training data above.
There are definitely ways to get your private data into training sets if you opt-in to it, but that shouldn’t surprise anyone.
You speak in another comment about the “It would involve thousands or tens of thousands of engineers to execute. All of them would have to keep the conspiracy quiet.” yet if the pathway exists, it seems to me there is ample opportunity for un-opted-in data to take the pathway with plausible deniability of “whoops that’s a bug!” No need for thousands of engineers to be involved.
Or instead of a big conspiracy, maybe this code which was written for a client was later used by someone at the client who triggered the pathway volunteering the code for training?
Or the more likely explanation: That this vague internet anecdote from an anonymous person is talking about some simple and obvious code snippets that anyone or any LLM would have generated in the same function?
I think people like arguing conspiracy theories because you can jump through enough hoops to claim that it might be possible if enough of the right people coordinated to pull something off and keep it secret from everyone else.
My point is less “it’s all a big conspiracy” and more that this can fall into Hanlon’s razor territory. All it takes is not actually giving a shit about un-opted in code leaking into the training set for this to happen.
The existence of the ai generated studio ghibli meme proves ai models were trained on copyrighted data. Yet nobody’s been fired or sued. If nobody cares about that, why would anybody care about some random nobody’s code?
https://www.forbes.com/sites/torconstantino/2025/05/06/the-s...
Companies lie all the time, I don't know why you have such faith in them
Anonymous Internet comment section stories are confused and/or lie a lot, too. I’m not sure why you have so much faith in them.
Also, this conspiracy requires coordination across two separate companies (GitHub for the repos and the LLM providers requesting private repos to integrate into training data). It would involve thousands or tens of thousands of engineers to execute. All of them would have to keep the conspiracy quiet.
It would also permanently taint their frontier models, opening them up to millions of lawsuits (across all GitHub users) and making them untouchable in the future, guaranteeing their demise as soon a single person involved decided to leak the fact that it was happening.
I know some people will never trust any corporation for anything and assume the worst, but this is the type of conspiracy that requires a lot of people from multiple companies to implement and keep quiet. It also has very low payoff for company-destroying levels of risk.
So if you don’t trust any companies (or you make decisions based on vague HN anecdotes claiming conspiracy theories) then I guess the only acceptable provider is to self-host on your own hardware.
Another thing that would permanently taint models and open their creators to lawsuits is if they were trained on many terabytes worth of pirated ebooks. Yet that didn't seem to stop Meta with Llama[0]. This industry is rife with such cases; OpenAI's CTO famously could not answer a simple question about whether Sora was trained on Youtube data or not. And now it seems they might be trained on video game content [1], which opens up another lawsuit avenue.
The key question from the perspective of the company is not whether there will be lawsuits, but whether the company will get away with it. And so far, the answer seems to be: "yes".
The only exception that is likely is private repos owned by enterprise customer. It's unlikely that GitHub would train LLMs on that, as the customer might walk away if they found out. And Fortune 500 companies have way more legal resources to sue them than random internet activists. But if you are not a paying customer, well, the cliche is that you are the product.
[0]: https://cybernews.com/tech/meta-leeched-82-terabytes-of-pira... [1]: https://techcrunch.com/2024/12/11/it-sure-looks-like-openai-...
With the current admin I don't think they really have any legal exposure here. If they ever do get caught, it's easy enough to just issue some flimsy excuse about ACLs being "accidentally" omitted and then maybe they stop doing it for a little while.
This is going to be the same disruption as Airbnb or Uber. Move fast and break things. Why would you expect otherwise?
I really don't see how tens of thousands of engineers would be required.
I work for <company>, we lie, in fact, many of us in our industry lie, to each other, but most importantly to regulators. I lie for them because I get paid to. I recommend you vote for any representative that is hostile towards the marketing industry.
And companies are conspirators by nature, plenty of large movie/game production companies manage to keep pretty quiet about game details and release-dates (and they often don't even pay well!).
I genuinely don't understand why you would legitimately "trust" a Corporation at all, actually, especially if it relates to them not generating revenue/marketshare where they otherwise could.