Generalized Wasserstein Deep Forest
An Interpretable Deep Learning Model for Imbalanced Learning
Deep Learning methods have boosted the capacity of machine learning algorithms and are now being used for non-trivial applications in various applied artificial intelligence (AI) domains. However, the real-life datasets are extremely imbalanced which severely hampers the neural network’s capabilities, reducing the robustness and trust. The basic concept of class imbalance in the binary pattern classification problem is concerned with the situation in which one class of interest (positive or minority class) is relatively rare as compared to the other class (negative or majority classes). As a result, the classifier can be heavily biased toward the majority class. Consider an example of online social media platforms which sensitize users to involve in different activities. These platforms have been shown to play a contributory role in several decision-making processes such as Presidential election, product purchase, and movie selection. This has naturally encouraged the fraudsters to participate under a covert ecosystem, impacting the final decision. Social media fraud exhibits certain sophisticated characteristics: (a) suspicious users are active and intelligent in conducting fraudulent activities; (b) fraudulent behavior is very dynamic; (c) fraud is hidden in diversified user behavior; (d) fraud-related activities are dispersed in highly imbalanced datasets; and (e) fraud activities take place within a very limited time, which requires real-time uninterrupted scanning of online platforms. The most important limitation of existing fraud detection methods is the data imbalance problem for AI and Machine Learning researchers.
The arena of research in “learning from imbalanced data” continues to grow, largely driven by challenging problems including fraud detection, face recognition, spam and anomaly detection, medical diagnosis, etc. The overarching question is — how to push the boundaries of prediction on the underrepresented or minority classes while managing the trade-off with false positives? The solution space ranges from sampling approaches to new imbalanced learning algorithms designed specifically for imbalanced datasets.
Despite the progress the deep learning methods have made, there are still several challenges that they ignore or cannot handle well as follows: (a) Model explainability and interpretability; (b) High dimensionality; and (c) Classifying extreme samples and unseen categories. Deep learning methods, like ensemble deep learning model, knowledge-shot learning, and dynamic curriculum learning for imbalanced data classification incur high time complexity than traditional neural networks and can classify unseen classes only if the knowledge vector of these classes is artificially given. Thus, there is an urgent need to handle the problem of high-dimensional imbalanced datasets, model interpretability issues while using deep learning methods, and creating more robust methods that can deal with outliers.
In the project, we address the curse of imbalanced datasets and the deficiencies of the past literature as discussed above, arising in the domain of fraud detection, land cover recognition, medical imaging, physiological data analysis among many others, by designing a novel deep model, namely Generalized Wasserstein Deep Forest (GWDF). This project aims to design an interpretable, scalable deep model that can handle high-dimensional imbalanced datasets arising in various applied AI domains.
- Prof. Uttam Kumar
- Dr. Tanujit Chakraborty (Ph.D. Indian Statistical Institute (ISI) Kolkata), Postdoctoral Fellow, IIIT Bangalore.