Conditional GAN(Generative Adversarial Network) on Tabular Data

Kundan Kumar Jha
3 min readMay 11, 2021

--

This blog consists of everything about use of conditional GAN on tabular data and here I am using UNSW tabular dataset for training generators and discriminators.

Note: If you are totally new to GANs then first read this article:

https://jonathan-hui.medium.com/gan-whats-generative-adversarial-networks-and-its-application-f39ed278ef09#:~:text=The%20main%20focus%20for%20GAN,a%20zebra%20from%20a%20horse.

Information about UNSW Dataset:

UNSW-NB15 is a network intrusion dataset. It contains nine different attacks, includes DoS, worms, Backdoors, and Fuzzers. The dataset contains raw network packets. The number of records in the training set is 175,341 records and the testing set is 82,332 records from the different types, attack and normal.

Importing necessary modules

Loading Dataset

Using pandas to load dataset

Performing label encoding:

As in UNSW dataset a lot of features have categorical values in the form of textual data and we know that ML models doesn’t get trained on textual data directly instead first of all that textual data has to be converted into numerical values using the concept of one hot encoding.

One hot encoding helps to convert categorical data into numbers.

Code to perform encoding:

Normalization of data

Separating data to feature and label

Printing shape of X_train, X_test, y_train, y_test

Hint: this data set has 175341 rows so in training on this data will consume a lot of time so for fast result visualization we just take 2K samples to train using this code:

Every GAN has two main components :

  1. Generator: Generates fake data samples using random input.
  2. Discriminator: Discriminate the generated data from original data.

Both together creates a competitive environment where after every epoch the generator improves itself on the basis of score given by discriminator to its generated sample and try to produce samples so accurately that the discriminator can not able to differentiate between those generated samples and the original samples. On the other hand the discriminator improves itself to by retraining itself after every epoch with the original and generated data from generator so that it can able to classify the fake and original data accurately. And as a result both generator and discriminator are getting improved after every epoch.

Defining discriminator and generator

Discriminator

Generator:

Now implementation of GAN logic which makes a competitive environment:

After performing above steps we will get our generator as “cgan_generator_unsw.h5" in root directory.

Using that generator to generate sample:

Working of this conditional generator:

  1. we will make data samples of same dimension with labels using random values and feed that data to generator and the generator changes that random feature values to the significant values similar to our data set and the label of the random data we had provided is preserved.
  2. So it produces data with same label we had provided with our randomly made data.

Code for producing features data from generator:

Testing using RandomForestClassifier :

Locate the repository at https://github.com/Kundannitp/CGAN_ON_UNSW

Credits & Reference

Thanks to prabhkirat SINGH and NAVIN BHARTI for helping me in the development process.

--

--