Synthetic Data: The Fictional Information Powering Real-World AI
Rohan Mehta explains synthetic data, the AI-generated information that lets us train technology without sacrificing our personal privacy. A crucial concept for India.

This opinion piece was drafted with AI assistance under the editorial direction of Rohan Mehta and reviewed before publication. Views expressed are the author's own.
A few years ago, a friend of mine, a clinical psychologist, told me something that has stuck with me ever since. We were discussing her training, and she mentioned how much she learned from reading novels. Not textbooks or case studies, but fiction. She said a great novelist creates characters so rich, so full of authentic contradictions and believable histories, that a psychologist can study them to understand the patterns of the human mind. The character isn't a real person, but their psychological profile is perfectly plausible. You can learn to spot real-world conditions by studying a fictional person.
I’ve been thinking about that conversation a lot lately, especially in my role here at Pulse AI. We spend our days grappling with one of the biggest paradoxes of our time: how do we build intelligent systems that learn from the world without compromising the privacy of the people who live in it? The answer, I believe, lies in an idea very similar to my friend’s training method. It’s called synthetic data.
At its heart, synthetic data is exactly what it sounds like: data that is artificially created rather than being collected from the real world. But it's not just random numbers and gibberish. Like the novelist’s fictional character, it is created to be statistically identical to the real thing. It has the same patterns, the same relationships, the same nuances as a genuine dataset, but with one crucial difference: none of the entries correspond to a real person, a real transaction, or a real event. It's a perfect, privacy-preserving replica.
Think back to the novelist and the psychologist. The novelist studies people, absorbing their tics, their speech patterns, their deepest motivations. Then, they create a new character, let's call him Ajay. Ajay is not a real person. He's a composite, a fiction. But his struggles with anxiety, his family history, his career ambitions—they are all drawn from the novelist's deep understanding of reality. My psychologist friend can read about Ajay and learn to identify anxiety in her actual patients, without ever needing to read a real person’s private journal.
This is precisely how synthetic data works for AI. An AI model, often called a “generator,” is first trained on a real-world dataset. Let’s say it's a set of customer transaction records from a bank in Delhi. The generator model learns the deep statistical patterns within this data: what time of day people shop, the average transaction value, the subtle signs that might indicate fraud, and so on. It learns the “psychology” of the data.
Then, we ask this generator to create a brand new dataset from scratch. It becomes the novelist. It generates fictional customers with fictional transactions. These fictional records aren't copies; they are original creations. But because the generator has learned the underlying rules of the real data, this new synthetic dataset looks and feels just like the real thing. It has the same patterns of spending, the same fraud indicators. You could give it to a data scientist, and they wouldn't be able to tell the difference just by looking at the statistics.
Now, here’s the magic. We can take this new, completely anonymous synthetic dataset and use it to train another AI model, say, a fraud detection system. This system learns to spot fraudulent patterns from the synthetic data, just as my friend learned about anxiety from the fictional character Ajay. The final AI becomes incredibly effective at its job, but it has never once seen a single piece of real, private customer information. The link to any actual individual has been completely and irrevocably severed.
For someone like me, who has grown up and now works in India, this concept feels less like a clever technical trick and more like a fundamental necessity. We live in a country that has embraced digital transformation at a scale the world has never seen. The UPI system alone processes billions of transactions a month. Our Aadhaar ID is linked to everything from our bank accounts to our mobile phones. This digital infrastructure is a marvel of convenience, but it also represents one of the largest, most sensitive collections of personal data on the planet.
Every time I use my phone to pay for chai, I have a fleeting thought about where that data goes. Who sees it? How is it being used? We are told it’s being used to build better services, to prevent fraud, to offer us more personalized experiences. And I believe that. But the risk of leaks, misuse, and breaches is always there. Synthetic data offers a way to get all the benefits of large-scale data analysis without the accompanying risk to our privacy.
Look at healthcare. Imagine we want to build an AI that can detect a rare form of liver cancer from MRI scans. To do this well, the AI needs to see thousands of scans from both healthy and sick patients. But these scans are deeply personal medical records. Sharing them, even for a good cause, is fraught with ethical and legal challenges. In a place like India, where healthcare data is just beginning to be digitized systematically, establishing privacy protocols from the start is critical.
With synthetic data, we can take a smaller, carefully anonymized set of real MRI scans. A generator AI can learn the visual texture of a liver, the shape of a tumor, and the subtle differences between a healthy and a diseased organ. It can then generate thousands of brand new, medically realistic MRI scans. Some of these synthetic scans will show healthy livers, others will show various stages of cancer. These new scans correspond to no actual patient. They are medical fictions. We can then use this vast, safe, and anonymous dataset to train our cancer-detecting AI, making it incredibly accurate without ever compromising the privacy of a single patient.
Or consider finance, a sector built on trust and confidentiality. A fintech startup in Bangalore wants to build a new credit scoring model that is more inclusive than the traditional ones. It needs access to vast amounts of financial history to find new patterns that indicate creditworthiness. Banks, rightly so, are hesitant to share this data. It’s their customers’ most private information. This data scarcity can stifle innovation.
Synthetic data breaks this logjam. The bank can use its internal data to generate a synthetic dataset of customer profiles—fictional people with realistic incomes, spending habits, and loan repayment histories. It can then share this synthetic data with the startup. The startup can build and test its innovative new models on this rich, realistic dataset, all while the bank’s actual customer data remains safely locked away. It’s a way to democratize access to data's insights without democratizing access to the data itself.
Of course, synthetic data is not a silver bullet. The process is only as good as the original data it learns from. If the initial, real-world dataset is biased, the synthetic data will be biased too. If our original set of MRIs mostly features men, the AI we train on the synthetic data will be less effective at diagnosing cancer in women. The old maxim of “garbage in, garbage out” still holds true. We have to be incredibly diligent about ensuring our source data is fair, representative, and clean. The novelist, after all, can only write what they know.
But these are challenges to be managed, not reasons for dismissal. The worldwide push for data privacy, from GDPR in Europe to a growing awareness here in India, is making it harder and harder to use real data for AI training. Synthetic data is emerging as the most practical and scalable solution to this problem.
For me, it comes back to that simple, powerful analogy. We are asking our machines to learn about our world. For a long time, we thought the only way to do this was to force-feed them our private diaries. We worried about what they might learn, and what they might reveal. Now, we’re realizing there’s a better way. We can be the novelist. We can write fictional stories—rich, detailed, and statistically perfect—and let the machines learn from those instead. We can teach them about the human world without sacrificing the privacy of a single human being. And that is a story I think we can all get behind.
Why it matters
- 01Synthetic data is artificially generated information that mimics the statistical properties of real-world data.
- 02It allows companies to train AI models without using sensitive personal information, solving major privacy issues.
- 03Crucial for industries like healthcare and finance, it can accelerate innovation while protecting individuals.