A new method for evaluating the performance of models trained on synthetic data when applied to real-world data

Credit scoring models are crucial in assessing and managing credit risk within financial institutions. However, it is limited due to the challenges in obtaining data from financial institutions to protect the private information of borrowers. Generative models for synthetic data generation can provide a solution by creating synthetic data that resembles real-world data, enabling research without compromising privacy. Synthetic data can also improve the accuracy of credit scoring models by augmenting limited real-world data.

The use of synthetic data in credit scoring has been limited primarily to addressing imbalanced data in rating problems using techniques such as SMOTE, variational autoencoders, and generative antagonistic networks. These methods have been proposed and used in recent studies to generate synthetic data that can be used to balance minority class and improve the accuracy of credit scoring models. Recently, a new paper presented a novel framework for training credit scoring models on synthetic data and applying it to real-world data while also analyzing the model’s ability to handle data drift. The main findings suggest that it is possible to train a model with synthetic data that performs well but at a performance cost of working in a privacy-preserving environment, resulting in a loss of predictive power.

In the proposed work, a data set provided by a financial institution is used, which includes financial information of the borrower and characteristics of social interaction during two periods, January 2018 and January 2019, each with 500,000 people. Borrowers are tagged based on their repayment behavior over the next 12-month observation period. To generate synthetic data that mimics real-world behavior while maintaining privacy, two state-of-the-art synthetic data generators, CTGAN and TVAE, are compared using different configurations, and the best one is selected. A new synth is then trained with the best settings and the feature set is extended with social interaction features. Finally, a framework is proposed to estimate the creditworthiness of borrowers, using feature selection and a K-fold cross-validation scheme. Performance is evaluated using various metrics such as AUC, KS, and F1-score.

The authors implemented the methodology using the Python libraries Networkx and Synthetic Data Vault. The performance of the two synthetic data generators, CTGAN and TVAE, was compared using two different architectures and different feature sets. The results show that TVAE had faster execution times and better performance in both continuous and categorical feature synthesis. In addition, a logistic regression model was trained to distinguish between real and synthetic data, and the results indicate that TVAE achieved the best performance. Still, this performance decreased as more features were included in the synthesizer. The authors compared the performance of credit rating models trained on synthetic data and real-world data. They trained the classifiers on real-world data and tested their performance on standby data sets. The results show that the gradient boosting algorithm achieved better performance compared to logistic regression. They also trained classifiers using synthetic data and applied it to real-world data. The results indicate that the performance of the model was similar when trained with synthetic data, except in one case. Performance comparison between models trained on synthetic data and real-world data shows a cost for using synthetic data, corresponding to a loss of predictive power of approximately 3% and 6% when measured in AUC and KS, respectively.

In this article, we feature a study that uses synthetic data generation to investigate credit scores and protect borrowers’ privacy. The proposed framework trains models on synthetic data and applies them to real-world data while testing their ability to handle data drift. The results show that models trained on synthetic data can perform well but with a loss of predictive power. The study also found that TVAE outperformed CTGAN, and that there is a cost in terms of lost predictive power when using synthetic data.

review the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our reddit page, discord channel, Y electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.

Mahmoud is a PhD researcher in machine learning. He also has a
bachelor’s degree in physical sciences and master’s degree in
telecommunication systems and networks. Your current areas of
the research concerns computer vision, stock market prediction and
learning. He produced several scientific articles on the relationship with the person.
identification and study of the robustness and stability of depths
networks