The Next AI Revolution: A Tutorial Using VAEs to Generate High-Quality Synthetic Data

What’s artificial information?

Knowledge created by a pc supposed to duplicate or increase present information.

Why is it helpful?

Now we have all skilled the success of ChatGPT, Llama, and extra not too long ago, DeepSeek. These language fashions are getting used ubiquitously throughout society and have triggered many claims that we’re quickly approaching Synthetic Normal Intelligence — AI able to replicating any human perform.

Earlier than getting too excited, or scared, relying in your perspective — we’re additionally quickly approaching a hurdle to the development of those language fashions. In response to a paper revealed by a bunch from the analysis institute, Epoch [1], we’re working out of information. They estimate that by 2028 we can have reached the higher restrict of attainable information upon which to coach language fashions.

Picture by Creator. Graph primarily based on estimated dataset projections. It is a reconstructed visualisation impressed by Epoch analysis group [1].

What occurs if we run out of information?

Effectively, if we run out of information then we aren’t going to have something new with which to coach our language fashions. These fashions will then cease bettering. If we need to pursue Synthetic Normal Intelligence then we’re going to must give you new methods of bettering AI with out simply rising the quantity of real-world coaching information.

One potential saviour is artificial information which could be generated to imitate present information and has already been used to enhance the efficiency of fashions like Gemini and DBRX.

Artificial information past LLMs

Past overcoming information shortage for big language fashions, artificial information can be utilized within the following conditions:

Delicate Knowledge — if we don’t need to share or use delicate attributes, artificial information could be generated which mimics the properties of those options whereas sustaining anonymity.
Costly information — if gathering information is dear we are able to generate a big quantity of artificial information from a small quantity of real-world information.
Lack of information — datasets are biased when there’s a disproportionately low variety of particular person information factors from a specific group. Artificial information can be utilized to steadiness a dataset.

Imbalanced datasets

Imbalanced datasets can (*however not all the time*) be problematic as they could not comprise sufficient data to successfully prepare a predictive mannequin. For instance, if a dataset incorporates many extra males than girls, our mannequin could also be biased in the direction of recognising males and misclassify future feminine samples as males.

On this article we present the imbalance within the fashionable UCI Adult dataset [2], and the way we are able to use a variational auto-encoder to generate Synthetic Data to enhance classification on this instance.

We first obtain the Grownup dataset. This dataset incorporates options equivalent to age, schooling and occupation which can be utilized to foretell the goal final result ‘earnings’.

# Obtain dataset right into a dataframe
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/grownup/grownup.information"
columns = [
   "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
   "occupation", "relationship", "race", "sex", "capital-gain",
   "capital-loss", "hours-per-week", "native-country", "income"
]
information = pd.read_csv(url, header=None, names=columns, na_values=" ?", skipinitialspace=True)

# Drop rows with lacking values
information = information.dropna()

# Break up into options and goal
X = information.drop(columns=["income"])
y = information['income'].map({'>50K': 1, '<=50K': 0}).values

# Plot distribution of earnings
plt.determine(figsize=(8, 6))
plt.hist(information['income'], bins=2, edgecolor="black")
plt.title('Distribution of Earnings')
plt.xlabel('Earnings')
plt.ylabel('Frequency')
plt.present()

Within the Grownup dataset, earnings is a binary variable, representing people who earn above, and under, $50,000. We plot the distribution of earnings over all the dataset under. We will see that the dataset is closely imbalanced with a far bigger variety of people who earn lower than $50,000.

Picture by Creator. Unique dataset: Variety of information situations with the label ≤50k and >50k. There’s a disproportionately bigger illustration of people who earn lower than 50k within the dataset.

Regardless of this imbalance we are able to nonetheless prepare a machine studying classifier on the Grownup dataset which we are able to use to find out whether or not unseen, or take a look at, people must be labeled as incomes above, or under, 50k.

# Preprocessing: One-hot encode categorical options, scale numerical options
numerical_features = ["age", "fnlwgt", "education-num", "capital-gain", "capital-loss", "hours-per-week"]
categorical_features = [
   "workclass", "education", "marital-status", "occupation", "relationship",
   "race", "sex", "native-country"
]

preprocessor = ColumnTransformer(
   transformers=[
       ("num", StandardScaler(), numerical_features),
       ("cat", OneHotEncoder(), categorical_features)
   ]
)

X_processed = preprocessor.fit_transform(X)

# Convert to numpy array for PyTorch compatibility
X_processed = X_processed.toarray().astype(np.float32)
y_processed = y.astype(np.float32)
# Break up dataset in prepare and take a look at units
X_model_train, X_model_test, y_model_train, y_model_test = train_test_split(X_processed, y_processed, test_size=0.2, random_state=42)


rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.match(X_model_train, y_model_train)

# Make predictions
y_pred = rf_classifier.predict(X_model_test)

# Show confusion matrix
plt.determine(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="YlGnBu", xticklabels=["Negative", "Positive"], yticklabels=["Negative", "Positive"])
plt.xlabel("Predicted")
plt.ylabel("Precise")
plt.title("Confusion Matrix")
plt.present()

Printing out the confusion matrix of our classifier reveals that our mannequin performs pretty properly regardless of the imbalance. Our mannequin has an general error charge of 16% however the error charge for the constructive class (earnings > 50k) is 36% the place the error charge for the damaging class (earnings < 50k) is 8%.

This discrepancy reveals that the mannequin is certainly biased in the direction of the damaging class. The mannequin is continuously incorrectly classifying people who earn greater than 50k as incomes lower than 50k.

Under we present how we are able to use a Variational Autoencoder to generate artificial information of the constructive class to steadiness this dataset. We then prepare the identical mannequin utilizing the synthetically balanced dataset and scale back mannequin errors on the take a look at set.

Picture by Creator. Confusion matrix for predictive mannequin on unique dataset.

How can we generate artificial information?

There are many totally different strategies for producing artificial information. These can embody extra conventional strategies equivalent to SMOTE and Gaussian Noise which generate new information by modifying present information. Alternatively Generative fashions equivalent to Variational Autoencoders or Normal Adversarial networks are predisposed to generate new information as their architectures be taught the distribution of actual information and use these to generate artificial samples.

On this tutorial we use a variational autoencoder to generate artificial information.

Variational Autoencoders

Variational Autoencoders (VAEs) are nice for artificial information technology as a result of they use actual information to be taught a steady latent house. We will view this latent house as a magic bucket from which we are able to pattern artificial information which intently resembles present information. The continuity of this house is considered one of their massive promoting factors because it means the mannequin generalises properly and doesn’t simply memorise the latent house of particular inputs.

A VAE consists of an encoder, which maps enter information right into a likelihood distribution (imply and variance) and a decoder, which reconstructs the info from the latent house.

For that steady latent house, VAEs use a reparameterization trick, the place a random noise vector is scaled and shifted utilizing the realized imply and variance, making certain easy and steady representations within the latent house.

Under we assemble a BasicVAE class which implements this course of with a easy structure.

The encoder compresses the enter right into a smaller, hidden illustration, producing each a imply and log variance that outline a Gaussian distribution aka creating our magic sampling bucket. As an alternative of instantly sampling, the mannequin applies the reparameterization trick to generate latent variables, that are then handed to the decoder.
The decoder reconstructs the unique information from these latent variables, making certain the generated information maintains traits of the unique dataset.

class BasicVAE(nn.Module):
   def __init__(self, input_dim, latent_dim):
       tremendous(BasicVAE, self).__init__()
       # Encoder: Single small layer
       self.encoder = nn.Sequential(
           nn.Linear(input_dim, 8),
           nn.ReLU()
       )
       self.fc_mu = nn.Linear(8, latent_dim)
       self.fc_logvar = nn.Linear(8, latent_dim)
      
       # Decoder: Single small layer
       self.decoder = nn.Sequential(
           nn.Linear(latent_dim, 8),
           nn.ReLU(),
           nn.Linear(8, input_dim),
           nn.Sigmoid()  # Outputs values in vary [0, 1]
       )

   def encode(self, x):
       h = self.encoder(x)
       mu = self.fc_mu(h)
       logvar = self.fc_logvar(h)
       return mu, logvar

   def reparameterize(self, mu, logvar):
       std = torch.exp(0.5 * logvar)
       eps = torch.randn_like(std)
       return mu + eps * std

   def decode(self, z):
       return self.decoder(z)

   def ahead(self, x):
       mu, logvar = self.encode(x)
       z = self.reparameterize(mu, logvar)
       return self.decode(z), mu, logvar

Given our BasicVAE structure we assemble our loss capabilities and mannequin coaching under.

def vae_loss(recon_x, x, mu, logvar, tau=0.5, c=1.0):
   recon_loss = nn.MSELoss()(recon_x, x)
 
   # KL Divergence Loss
   kld_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
   return recon_loss + kld_loss / x.measurement(0)

def train_vae(mannequin, data_loader, epochs, learning_rate):
   optimizer = optim.Adam(mannequin.parameters(), lr=learning_rate)
   mannequin.prepare()
   losses = []
   reconstruction_mse = []

   for epoch in vary(epochs):
       total_loss = 0
       total_mse = 0
       for batch in data_loader:
           batch_data = batch[0]
           optimizer.zero_grad()
           reconstructed, mu, logvar = mannequin(batch_data)
           loss = vae_loss(reconstructed, batch_data, mu, logvar)
           loss.backward()
           optimizer.step()
           total_loss += loss.merchandise()

           # Compute batch-wise MSE for comparability
           mse = nn.MSELoss()(reconstructed, batch_data).merchandise()
           total_mse += mse

       losses.append(total_loss / len(data_loader))
       reconstruction_mse.append(total_mse / len(data_loader))
       print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss:.4f}, MSE: {total_mse:.4f}")
   return losses, reconstruction_mse

combined_data = np.concatenate([X_model_train.copy(), y_model_train.cop
y().reshape(26048,1)], axis=1)

# Prepare-test cut up
X_train, X_test = train_test_split(combined_data, test_size=0.2, random_state=42)

batch_size = 128

# Create DataLoaders
train_loader = DataLoader(TensorDataset(torch.tensor(X_train)), batch_size=batch_size, shuffle=True)
test_loader = DataLoader(TensorDataset(torch.tensor(X_test)), batch_size=batch_size, shuffle=False)

basic_vae = BasicVAE(input_dim=X_train.form[1], latent_dim=8)

basic_losses, basic_mse = train_vae(
   basic_vae, train_loader, epochs=50, learning_rate=0.001,
)

# Visualize outcomes
plt.determine(figsize=(12, 6))
plt.plot(basic_mse, label="Primary VAE")
plt.ylabel("Reconstruction MSE")
plt.title("Coaching Reconstruction MSE")
plt.legend()
plt.present()

vae_loss consists of two elements: reconstruction loss, which measures how properly the generated information matches the unique enter utilizing Imply Squared Error (MSE), and KL divergence loss, which ensures that the realized latent house follows a standard distribution.

train_vae optimises the VAE utilizing the Adam optimizer over a number of epochs. Throughout coaching, the mannequin takes mini-batches of information, reconstructs them, and computes the loss utilizing vae_loss. These errors are then corrected through backpropagation the place the mannequin weights are up to date. We prepare the mannequin for 50 epochs and plot how the reconstruction imply squared error decreases over coaching.

We will see that our mannequin learns rapidly easy methods to reconstruct our information, evidencing environment friendly studying.

Picture by Creator. Reconstruction MSE of BasicVAE on the Grownup dataset.

Now now we have educated our BasicVAE to precisely reconstruct the Grownup dataset we are able to now use it to generate artificial information. We need to generate extra samples of the constructive class (people who earn over 50k) as a way to steadiness out the lessons and take away the bias from our mannequin.

To do that we choose all of the samples from our VAE dataset the place earnings is the constructive class (earn greater than 50k). We then encode these samples into the latent house. As now we have solely chosen samples of the constructive class to encode, this latent house will replicate properties of the constructive class which we are able to pattern from to create artificial information.

We pattern 15000 new samples from this latent house and decode these latent vectors again into the enter information house as our artificial information factors.

# Create column names
col_number = sample_df.form[1]
col_names = [str(i) for i in range(col_number)]
sample_df.columns = col_names

# Outline the function worth to filter
feature_value = 1.0  # Specify the function worth - right here we set the earnings to 1

# Set all earnings values to 1 : Over 50k
selected_samples = sample_df[sample_df[col_names[-1]] == feature_value]
selected_samples = selected_samples.values
selected_samples_tensor = torch.tensor(selected_samples, dtype=torch.float32)

basic_vae.eval()  # Set mannequin to analysis mode
with torch.no_grad():
   mu, logvar = basic_vae.encode(selected_samples_tensor)
   latent_vectors = basic_vae.reparameterize(mu, logvar)

# Compute the imply latent vector for this function
mean_latent_vector = latent_vectors.imply(dim=0)


num_samples = 15000  # Variety of new samples
latent_dim = 8
latent_samples = mean_latent_vector + 0.1 * torch.randn(num_samples, latent_dim)

with torch.no_grad():
   generated_samples = basic_vae.decode(latent_samples)

Now now we have generated artificial information of the constructive class, we are able to mix this with the unique coaching information to generate a balanced artificial dataset.

new_data = pd.DataFrame(generated_samples)

# Create column names
col_number = new_data.form[1]
col_names = [str(i) for i in range(col_number)]
new_data.columns = col_names

X_synthetic = new_data.drop(col_names[-1],axis=1)
y_synthetic = np.asarray([1 for _ in range(0,X_synthetic.shape[0])])

X_synthetic_train = np.concatenate([X_model_train, X_synthetic.values], axis=0)
y_synthetic_train = np.concatenate([y_model_train, y_synthetic], axis=0)

mapping = {1: '>50K', 0: '<=50K'}
map_function = np.vectorize(lambda x: mapping[x])
# Apply mapping
y_mapped = map_function(y_synthetic_train)

plt.determine(figsize=(8, 6))
plt.hist(y_mapped, bins=2, edgecolor="black")
plt.title('Distribution of Earnings')
plt.xlabel('Earnings')
plt.ylabel('Frequency')
plt.present()

Picture by Creator. Artificial dataset: Variety of information situations with the label ≤50k and >50k. There at the moment are a balanced variety of people incomes extra and fewer than 50k.

We will now use our balanced coaching artificial dataset to retrain our random forest classifier. We will then consider this new mannequin on the unique take a look at information to see how efficient our artificial information is at decreasing the mannequin bias.

rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.match(X_synthetic_train, y_synthetic_train)

# Step 5: Make predictions
y_pred = rf_classifier.predict(X_model_test)

cm = confusion_matrix(y_model_test, y_pred)

# Create heatmap
plt.determine(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="YlGnBu", xticklabels=["Negative", "Positive"], yticklabels=["Negative", "Positive"])
plt.xlabel("Predicted")
plt.ylabel("Precise")
plt.title("Confusion Matrix")
plt.present()

Our new classifier, educated on the balanced artificial dataset makes fewer errors on the unique take a look at set than our unique classifier educated on the imbalanced dataset and our error charge is now lowered to 14%.

Picture by Creator. Confusion matrix for predictive mannequin on artificial dataset.

Nonetheless, now we have not been in a position to scale back the discrepancy in errors by a major quantity, our error charge for the constructive class remains to be 36%. This could possibly be as a result of to the next causes:

Now we have mentioned how one of many advantages of VAEs is the training of a steady latent house. Nonetheless, if the bulk class dominates, the latent house would possibly skew in the direction of the bulk class.
The mannequin might not have correctly realized a definite illustration for the minority class as a result of lack of information, making it laborious to pattern from that area precisely.

On this tutorial now we have launched and constructed a BasicVAE structure which can be utilized to generate artificial information which improves the classification accuracy on an imbalanced dataset.

Observe for future articles the place I’ll present how we are able to construct extra refined VAE architectures which tackle the above issues with imbalanced sampling and extra.

[1] Villalobos, P., Ho, A., Sevilla, J., Besiroglu, T., Heim, L., & Hobbhahn, M. (2024). Will we run out of information? Limits of LLM scaling primarily based on human-generated information. arXiv preprint arXiv:2211.04325, 3.

[2] Becker, B. & Kohavi, R. (1996). Grownup [Dataset]. UCI Machine Studying Repository. https://doi.org/10.24432/C5XW20.

Source link