“GitHub Copilot for Individual users, however, can opt in and explicitly provide consent for their code to be used as training data. User engagement data is used to improve the performance of the Copilot Service; specifically, it’s used to fine-tune ranking, sort algorithms, and craft prompts.”
- https://github.blog/news-insights/policy-news-and-insights/h...
So it’s a “myth” that github explicitly says is true…
> can opt in and explicitly provide consent for their code to be used as training data.
I guess if you count users explicitly opting in, then that part is true.
I also covered the case where someone opts-in to a “free” LLM provider that uses prompts as training data above.
There are definitely ways to get your private data into training sets if you opt-in to it, but that shouldn’t surprise anyone.
You speak in another comment about the “It would involve thousands or tens of thousands of engineers to execute. All of them would have to keep the conspiracy quiet.” yet if the pathway exists, it seems to me there is ample opportunity for un-opted-in data to take the pathway with plausible deniability of “whoops that’s a bug!” No need for thousands of engineers to be involved.
Or instead of a big conspiracy, maybe this code which was written for a client was later used by someone at the client who triggered the pathway volunteering the code for training?
Or the more likely explanation: That this vague internet anecdote from an anonymous person is talking about some simple and obvious code snippets that anyone or any LLM would have generated in the same function?
I think people like arguing conspiracy theories because you can jump through enough hoops to claim that it might be possible if enough of the right people coordinated to pull something off and keep it secret from everyone else.
My point is less “it’s all a big conspiracy” and more that this can fall into Hanlon’s razor territory. All it takes is not actually giving a shit about un-opted in code leaking into the training set for this to happen.
The existence of the ai generated studio ghibli meme proves ai models were trained on copyrighted data. Yet nobody’s been fired or sued. If nobody cares about that, why would anybody care about some random nobody’s code?
https://www.forbes.com/sites/torconstantino/2025/05/06/the-s...