Exploiting synthetically generated data with semi-supervised learning for small and imbalanced datasets
Hits: 9462
- Research areas:
- Year:
- 2019
- Type of Publication:
- In Proceedings
- Authors:
-
- Pérez-Ortiz, María
- Tino, Peter
- Mantiuk, Rafal
- Hervás-Martínez, César
- Book title:
- Proceedings of the Thirty-Third AAAI (Association for the Advancement of Artificial Intelligence) Conference on Artificial Intelligence (AAAI'19)
- Pages:
- 4715-4722
- Organization:
- Honolulu,Hawaii, USA
- Month:
- 27th February
- ISBN:
- 978-1-57735-809-1
- ISSN:
- 2159-5399
- BibTex:
- Abstract:
- Data augmentation is rapidly gaining attention in machinelearning. Synthetic data can be generated by simple transfor-mations or through the data distribution. In the latter case,the main challenge is to estimate the label associated to newsynthetic patterns. This paper studies the effect of generat-ing synthetic data by convex combination of patterns and theuse of these as unsupervised information in a semi-supervisedlearning framework with support vector machines, avoidingthus the need to label synthetic examples. We perform ex-periments on a total of 53 binary classification datasets. Ourresults show that this type of data over-sampling supportsthe well-known cluster assumption in semi-supervised learn-ing, showing outstanding results for small high-dimensionaldatasets and imbalanced learning problems.