ResearchMIT Technology Review·

AI chatbots are giving out people’s real phone numbers

New reports suggest AI chatbots are surfacing private phone numbers and contact data, raising urgent questions about training data and digital privacy.

By Pulse AI Editorial·3 min read
Share
AI chatbots are giving out people’s real phone numbers
AI-Assisted Editorial

This article is original editorial commentary written with AI assistance, based on publicly available reporting by MIT Technology Review. It is reviewed for accuracy and clarity before publication. See the original source linked below.

The recent emergence of reports regarding Google’s AI surfacing private contact information—rendering private individuals the unintentional recipients of dozens of cold calls—marks a troubling transition in the generative AI era. While the "hallucination" problem has long been a quirk of large language models (LLMs), shifting from harmless factual errors to the dissemination of sensitive, real-world personal data represents a significant escalation. Users have reported being inundated by calls from strangers who believe they are reaching professional services, such as law firms or designers, all because an AI chatbot confidently provided their personal mobile number as a business contact.

This phenomenon is rooted in the "scraping" era of the early 2020s, where companies like Google, OpenAI, and Meta ingested vast swaths of the public internet to train their models. During this gold rush for data, the distinction between a business listing and a private individual’s footprint often became blurred. While traditional search engines provide links to sources—allowing users to verify the context of a phone number—generative AI synthesizes this data into a definitive, conversational answer. This lack of attribution strips away the surrounding context that might have signaled to a human reader that a phone number belonged to a private residence rather than a corporate office.

The mechanics of this failure lie in the associative nature of neural networks. LLMs do not "know" facts; they predict the next most likely token in a sequence based on training data. If a specific phone number appeared in a data dump near keywords like "legal services" or "product design," the model may solidify that association. Because these models are black boxes, there is no simple "find and replace" function for developers to scrub a specific person’s data once the model has been trained. Standard "fine-tuning" or RLHF (Reinforcement Learning from Human Feedback) can nudge a model toward safer behavior, but it cannot guarantee the complete excision of a specific data point buried deep within billions of parameters.

The industry implications of these privacy breaches are profound, particularly as global regulators move closer to enforcing "the right to be forgotten." Under frameworks like Europe’s GDPR or California’s CCPA, individuals have the right to demand the removal of their personal data. However, the architecture of current transformer models suggests that complete removal might be technically impossible without retraining a model from scratch—a process that costs tens of millions of dollars. This creates a looming legal collision between the technical rigidity of AI and the fluid requirements of international privacy law.

Furthermore, this development erodes the primary value proposition of "AI Overviews" and chatbot search: trust. If a user receives a wrong answer about a movie’s release date, the cost is minor. If an AI redirects a desperate legal client to a private citizen’s kitchen table, the cost is the violation of domestic peace and potential professional liability. For the tech giants, the risk is not just a PR crisis, but a series of class-action lawsuits centered on "algorithmic negligence." The defense of "we don't know why it said that" is becoming increasingly untenable as these tools are integrated into critical infrastructure.

As we look ahead, the focus will shift toward more robust "unlearning" techniques and stricter data-cleansing protocols. Researchers are currently investigating "machine unlearning," a method to remove the influence of specific training data without a full retraining cycle, though the tech is still in its infancy. In the interim, companies like Google will likely implement more aggressive "output filters"—software layers that catch sensitive numerical patterns before they reach the user. Whether these band-aid solutions can keep pace with the sheer volume of data being surfaced remains the defining question for the safety of the modern web.

Why it matters

  • 01The surfacing of private phone numbers shifts the AI 'hallucination' problem from a factual nuisance to a high-stakes privacy and liability violation.
  • 02Current AI architectures make it technically difficult to excise specific personal data once it has been ingested into a model's training set.
  • 03Regulators may soon treat these leaks as violations of 'right to be forgotten' laws, potentially forcing expensive retraining or legal settlements for AI developers.
Read the full story at MIT Technology Review
Share