Fake It to Make It: Companies Beef Up AI Models With Synthetic Data

Fake It to Make It: Companies Beef Up AI Models With Synthetic Data

American Express experiments with AI-generated fake fraud patterns to sharpen its models' ability to detect rare or uncommon swindles

American Express is training artificial intelligence models to look for suspicious patterns of behavior. (Photo: Reuters)
American Express is training artificial intelligence models to look for suspicious patterns of behavior. (Photo: Reuters)

Companies rely on real-world data to train artificial-intelligence models that can identify anomalies, make predictions and generate insights. But often, it isn't enough.

To detect credit-card fraud, for example, researchers train AI models to look for specific patterns of known suspicious behavior, gleaned from troves of data. But unique, or rare, types of fraud are difficult to detect when there isn't enough data to support the algorithm's training.

To get around that, companies are learning to fake it, building so-called synthetic data sets designed to augment training data.

At American Express Co., machine-learning and data scientists have been experimenting with synthetic data for nearly two years in hopes of improving the company's AI-based fraud-detection models, said Dmitry Efimov, head of the company's Machine Learning Center of Excellence. The credit-card company uses an advanced form of AI to generate fake fraud patterns aimed at bolstering the real training data.

Rare or uncommon types of fraud can be overlooked by the company's AI-based fraud-detection model if the algorithms don't have enough training examples of that type of fraud, he said.

"There are a lot of different kinds of patterns, the number of fraud patterns in real life is pretty big," Mr. Efimov said. "Some fraud patterns happen more often than others, and some patterns are very rare."

American Express is working on improving these models by experimenting with generative adversarial networks, a technique to create synthetic data on uncommon fraud patterns. That data is then used to augment the company's existing data set of fraud behaviors to improve its overall AI-based fraud-detection models.

"We started thinking, can we balance the presence of different fraud patterns? That's where [generative adversarial networks] come up," he said.

A generative adversarial network is an AI technique commonly used to create simulation data to train the underlying AI models that power self-driving cars. It is also used to create deepfakes, that is, photographs or videos of people that are often indistinguishable from reality.

One AI model acts as a "generator" that produces new data, and the second model tries to determine whether the data is real or fake, Mr. Efimov said. The "perfect" generative adversarial network is one that cannot tell fake data from real, he said.

Personally identifiable information isn't used at any stage of the process, he said.

The effort is still in the research stages, in part, because it is difficult to determine the amount of each unique fake fraud pattern the AI model should be generating, he said. But early tests are promising. Experiments have shown that for specific types of fraud, the fake data does improve the AI-based fraud-detection model, he said.

American Express has had the lowest U.S. fraud-loss rates among the major banks for the past 14 years, according to a February Nilson Report, a source of news and statistics on the payment industry.

Synthetic data has already found uses in other industries. Hospitals, for example, are using synthetic data based on real medical records from patients to make medical decisions.

Startup Moveworks Inc. generates synthetic data to improve its AI-based chatbots, used by corporate customers to answer employee questions related to information technology, finance and human resources, said Vaibhav Nivargi, co-founder and chief technology officer.

Moveworks' customers supply it with technical documents to help answer IT questions related to, for example, computer memory, Mr. Nivargi said. But that data is frequently insufficient to train its chatbots to answer questions.

"[Synthetic data] becomes very important because we operate in a domain with limited data," Mr. Nivargi said.

Moveworks built a machine-learning model that generates questions that could be asked by humans based on those technical documents, he said.

By 2024, 60% of the data used for the development of AI and analytics projects will be synthetically generated, according to Gartner Inc. The technology research firm is receiving an increasing number of questions regarding synthetic data, said Erick Brethenoux, Gartner's AI research manager.



Do you like the content of this article?
COMMENT

Hong Kong denies work visas to dozens of Cathay Pacific pilots seeking to relocate to city

HONG KONG: Immigration officials in Hong Kong have denied work visas to dozens of overseas Cathay Pacific pilots seeking to relocate to the city, prompting the airline to terminate their employment.

16:23

Rafting in Nam Khek stream postponed by overflow

PHITSANULOK: This year's launch of rafting in the Nam Khek, tentatively scheduled for Monday in this central northern province, has been postponed indefinitely as the stream has burst its banks.

15:44

Mitsubishi invests in Laos wind farm project, biggest in SE Asia

Japanese trading house Mitsubishi Corp has invested in a massive wind power plant project in Laos to deliver electricity to Vietnam, in what would be the biggest onshore wind farm in Southeast Asia.

14:38