Comments Page - When you're asking AI chatbots for answers, they're data-mining you

« Back When you're asking AI chatbots for answers, they're data-mining youtheregister.comSubmitted by rntn 4 hours ago

roscas 4 hours ago
Always good to remember people of this.
But not just AI bots or interfaces. Everything is saved and never deleted.
Remember Facebook? "We will never delete anything" that is their business.
So anything that you put on those "services" is gone out of your hands. But we still have an option, is to stop using these ads company and let them die.
Back to AI, there are loads of offline models we can use. Many like Ollama that will even download it. Install Ollama, on the ollama site find a model name and "ollama run model-name" and you can use it.
Ok, it is not as chatgpt5 but it can help you so much, that you might not even need chatgpt.
- Phemist 3 hours ago
  Indeed, and asking facebook to delete the data or to not use it for AI training is just another data point indicating you care about it. Your preferences will eventually be stripped through redesigns, refactors, careless usage or facebooks crooked idea of consent. The data will remain and be used again.
  lowwave 2 hours ago
  It is better to NOT delete facebook, but spam your profile with other data and just leave it.
  Phemist an hour ago
  Maybe, but that depends on facebook's ability to filter that data.. The filtering should be be easy for my inactive-for-10-years FB account that suddenly uploads a bunch of garbage data. Mixing in genuine data seems antithetical especially considering the garbage may be filtered out.
  actionfromafar an hour ago
  And/or change friends to random spam accounts first, then unfriend your real friends.
- dylan604 31 minutes ago
  Facebook doesn't just get data from direct input from users though. So if people stop using FB, that's a good first step, that does not stop the firehose of data.
- Sophira an hour ago
  There are also things like Oobabooga's text-generation-webui[0] which can present a similar interface to ChatGPT for local models.
  I've had great success in running Qwen3-8B-GGUF[1] on my RTX 2070 SUPER (8GB VRAM) using Oobabooga (everyone just calls it via the author's name, it's much catchier) so this is definitely doable on consumer hardware. Specifically, I run the Q4_K_M model as Oobabooga loads all of its layers into the GPU by default, making it nice and snappy. (Testing has shown that I can actually load up to the Q6_K model before some layers have to be loaded into the CPU, but I have to manually specify that all those layers should be loaded into the GPU, as opposed to leaving it auto-determined.)
  It does obviously hallucinate more often than ChatGPT does, so care should be taken. That said, it's really nice to have something local.
  There's a subreddit for running text gen models locally that people might be interested in: https://www.reddit.com/r/LocalLlama
  [0] https://github.com/oobabooga/text-generation-webui
  [1] https://huggingface.co/Qwen/Qwen3-8B-GGUF
- throwaway29246 39 minutes ago
  > Back to AI, there are loads of offline models we can use. Many like Ollama that will even download it. Install Ollama, on the ollama site find a model name and "ollama run model-name" and you can use it.
  A privilege that is limited to the top 1%. It may come as a surprise, but most people don't have 32GB of VRAM [0]. The rest of us with normal people hardware are stuck with AI cloud providers or good old searching, which is a lot harder now that those same AI providers have ruined search results.
  [0] There are some lightweight models you can run on normal people hardware, but they are just too unreliable even for casual usage and are likely to waste more of your time than they save.
- lm28469 3 hours ago
  That's why you should use multiple accounts and bullshit about 30% of what you post. LLMs are godsent for that, they poison their own well.
  SoftTalker 3 hours ago
  I assume that companies like Facebook know pretty well which accounts are really the same person. Even if you are careful about keeping cookies in separate browser profiles, your machine can be fingerprinted, your posting habits and writing style can be fingerprinted, and Facebook/Google have the resources to do it.
  mgh2 2 hours ago
  The risk are the externalities to actual users who don't know the difference and get affected by your 30% bs
- BolexNOLA 2 hours ago
  I recently set up LM Studio and have run open AI's 20b model locally using an AMD 9070 + 9800x3d. I honestly assumed it would be way more work than it was to set it up. It has limitations, but given it took me all of 5min and I can easily attach docs for it to reference as it all runs locally...it's fantastic. I've got a Claude model I've been messing with too.
- notpushkin 3 hours ago
  > Always good to remember people of this.
  You mean “remind”?
glitchc 2 hours ago
Everyone knows this. Every layperson I talk to is aware that these companies are siphoning their information. When free email was introduced over two decades ago, the behaviour was the same. Everyone knew Microsoft and Google could read your emails. Then, like now, people think it's worth it. It is too useful a tool to have and the price is palatable.
What people don't want to do is sign up for yet another subscription. There's immense subscription fatigue among the general population, especially in tough economic times such as now.
- rafark 26 minutes ago
  Agreed. Not only do I think it’s worth it, i actually like that I can contribute. I’m getting so much good value for free I think it’s fair. It’s a win-win situation. The AIs get better and I get better answers.
avmich 21 minutes ago
I have an issue with "stupidity" suggestion. Clicking "Agree" without full analysis is tried and true Internet tradition, it's so sad somebody assumes it's serious and attempts to use it. We should have legal protections against wringing quasi-agreements from customers and then using them against.
smjburton 2 hours ago
> The more data you give any of the AI services, the more that information can potentially be used against you.
It may seem obvious, but Sam Altman also recently emphasized that the information you share with ChatGPT is not confidential, and could potentially be used against you in court.
[1] https://www.pcmag.com/news/altman-your-chatgpt-conversations...
[2] https://techcrunch.com/2025/07/25/sam-altman-warns-theres-no...
- Jalad 2 hours ago
  This is always true though. Any data that a cloud company has against you can be subpoenad
  It would be weird for him not to be transparent about that
ceroxylon 2 hours ago
What about the people who did not opt to share or index their chats, and the companies that claim to not train on user chats?
https://privacy.anthropic.com/en/articles/10023555-how-do-yo...
> We do not actively set out to collect personal data to train our models
The 'snarky tech guy' tone of the article is a bit like nails on a chalkboard.
- hazKu4 2 hours ago
  (At least to me) that language doesn’t feel particularly reassuring… especially given the duplicitous nature of data collection - i.e. “we don’t sell your data” translates to “we create a sophisticated advertising profile about you, and monetize that”
  boesboes 2 hours ago
  That line is about data they find on internet. soooo completely not relevant
Kim_Bruning 3 hours ago
Earlier discussion on the "ChatGPT chats in google" angle:
https://news.ycombinator.com/item?id=44778764
Interesting how much traction
```
     "[x] Make this chat discoverable (allows it to be shown in web searches)" 
```
gets in news articles.
People don't seem to have the same intuition for the web that they used to!
falcor84 3 hours ago
> So, kids, let's not be asking any AI chatbot whether you should divorce your husband, how to cheat on your taxes, or if you should try to get your boss fired. That information will be kept, it may be revealed in a security breach, and, if so, it will come back to bite you in the buns.
Just as a PSA - there's nothing unique to AIs here - whenever you ask a question of anyone, in any way, they then have the memory of you having asked it. A lot of sitcoms and comedic plays have the plot premise build upon such questions that a person voiced then eventually reaching (either accurately or inaccurately) the person they were hiding the question from.
And as someone who's into spy stories, I know that a big part of tradecraft is of formulating your questions in a way that divulges the least about your actual intentions and current information.
If anything, LLM-driven AIs are the first technology that in principle allow you to ask a complex question that would be immediately forgotten. The thing is that you need to be running the AI yourself; if you ask an AI controlled by another entity, then you're trusting that entity with your question, regardless of whether there's an AI on the way.
- frakt0x90 2 hours ago
  Books are also technology that allow you to answer complex questions without recording the question.
  Jalad 2 hours ago
  Not necessarily though, it depends on where you got the book from (Amazon, the library?), and what your question is
  shadowgovt 18 minutes ago
  In general, libraries actually do go out of their way to minimize the ways circulation history can be used against card-holders.
  This isn't airtight, but it'a a point of principle for most libraries and librarians and they've gone to the mat over this. https://www.newtactics.org/tactics/protecting-right-privacy-...
- y0eswddl 2 hours ago
  The questions and info you ask friends doesn't end up in a massive data profile on you stored in somebody's cloud to be used for future manipulation/marketing/profiling...
nachox999 an hour ago
We need a tool that create random fake data for the data-mining web apps
thisisit an hour ago
If you ask a layperson the answer is - "Yes, and?". If its free, very few people care. Sure you can run a local instance and yes, it might be as simple as downloading Ollama but not many will do it or even have a powerful enough computer to run it.
Worst yet you might individually make a choice to do that but others might not care. They might share email/chats with you to a chatbot to parse it or "make it think like them" and then the chatbot has info about you. So, as much as I understand this sentiment this seems like a losing battle.
- dialup_sounds an hour ago
  Why should they care?
shadowgovt 24 minutes ago
This is also true of search engines, social media, and various other interactive systems. Google's initial search-algorithm breakthrough was the realization that they had a massive source of data for search result correctness in the form of the behavior of users querying their site.
In general, it's wise to assume that all web interactions are a two-way street between the user and the service provider.
nottorp 2 hours ago
> "How to Use a Microwave Without Summoning Satan,"
Oh, nice idea. We should all ask that.
- mystraline 2 hours ago
  Wait, you can summon Satan with a microwave?!
  Lemee ask ShatGPT how to do that!
unethical_ban 2 hours ago
Duck.ai claims to anonymize AI chats and says its conversations are not used for training. It is my go to for casual usage.
Otherwise, I use local for complex for potentially controversial questions.
boesboes 2 hours ago
What a terrible, utter bullshit article. Full of half truths and fear mongering. smh.
- AlexandrB an hour ago
  > fear mongering
  The last 10 years of tech "innovation" is basically what the article describes but happening to other tech products[1]. So, why is this fear mongering? It's basically inevitable unless:
  a. There's legislation. But I would bet on legislation for the opposite - storing chats forever - instead.
  b. AI moves to on-device where users have control of their own data. Also unlikely considering how much tech loves web technologies and recurring revenue streams.
  [1] https://www.cam.ac.uk/research/news/menstrual-tracking-app-d...
- actionfromafar an hour ago
  All hail centralized cloud services?
andrepd 3 hours ago
What can you do online these days without being data mined? Browsing gemini?
- y0eswddl 2 hours ago
  Start w/
  https://ssd.eff.org
  https://privacyguides.net
- em3rgent0rdr 3 hours ago
  Download stuff in bulk (for instance the entire wikipedia torrent) and then peruse it on you own computer.
  Squeeeez 2 hours ago
  If you are not using an OS which has something like windows recall enabled, or that weird stardict with online lookup with automatic lookup on select which came up recently.
  I wonder how far back this has been going on. Did ICQ, IRC server hosters, BBSes do similar things?
  reactordev 2 hours ago
  No, back then storage was a premium so everything aside from config, accounts, and billing was ephemeral. It really wasn’t until Cloud came along that storage made it so you could keep everything. About the time of the social media boom.
  It wasn’t until around 2014 that I stopped building routes that did:
  DELETE FROM <table> WHERE id = ? ON DELETE CASCADE;
panny 3 hours ago
I would expect this, but it doesn't seem to be the case.
If I ask for search.brave.com to give me a list of gini coefficients for the top ten countries by GDP, it can't do it. However, if I tell it the data is available on the CIA world factbook, it can then spit that info out promptly. However, if I close the context and ask again, it hasn't learned this information and once again is unable to provide the list.
It didn't datamine me. It had no better idea where to find this information the second time I asked. This is the experience others have stated with other AIs as well. It does not seem special to brave.
- Etheryte 3 hours ago
  Data mining doesn't mean the model is instantly updated, that would be prohibitively expensive at scale. It's way easier to batch your data together with a bunch of other data and use it later on. That doesn't even mean it will know where to find the information eventually since models are not one to one with their inputs, because again, size and cost.
  panny 3 hours ago
  >Data mining doesn't mean the model is instantly updated
  I'm not expecting instant. Even next week it won't be there. It's like how AI never learned to count how many times the letter r appears in strawberry. Like sure, now if you ask brave, it will tell you three, but that is only because that question went viral. It didn't "learn" anything, it was just hard coded for that particular answer. Ask it how many times the letter l appears in smallville and it will get it wrong again.
  t0md4n 3 hours ago
  It wouldn’t be instant, next week or even next month. Pre-training doesn’t happen that frequently and varies between each model provider. As for the strawberry test, this is a tokenization issue that is fundamental to LLM’s, however, most models can now solve this type of question using thinking/code/tools to count the letters.
  https://imgur.com/a/NqIJEx6
  Etheryte 2 hours ago
  Both OpenAI and Claude average roughly one flagship release a year, and these are some of the best funded companies in the space. The bigger your model, the more expensive it is to train, so you want to do it as rarely as reasonably possible. Every other company will either work with smaller models and/or train even more rarely, aside from fine-tunes and customizations they put on top.
  simgt 3 hours ago
  I didn't think for a second you could be right, so I tried with Claude. L in smallville was correct, then it suggests it'd have gotten l in parallel wrong by answering 3 instead of 2 (buts gets it right in a new chat). Then it suggests it'd get n in millennium wrong by giving the right answer, and gets it wrong in a new chat. https://claude.ai/share/93b46c3b-23a7-40ad-8a2b-ec2ed6c34a19
  Thanks, that was enlightening.
  ordersofmag 3 hours ago
  LLM aren't retrained and released on a weekly time-scale. The data mining may only be reflected in the training of the next generation of the model.
  qwertytyyuu 3 hours ago
  every week is still way to expensive to do at scale, at best they'll update training data with each model iteration.
- add-sub-mul-div 3 hours ago
  Brave isn't data mining you for your benefit, they're doing it for their benefit.