

Picture by Writer | Canva
# Introduction
Discovering real-world datasets might be difficult as a result of they’re typically non-public (protected), incomplete (lacking options), or costly (behind a paywall). Artificial datasets can clear up these issues by letting you generate the info based mostly in your venture wants.
Artificial information is artificially generated info that mimics real-life datasets. You may management the dimensions, complexity, and realism of the artificial dataset to tailor it based mostly in your information wants.
On this article, we’ll discover artificial information technology strategies. We’ll then construct a portfolio venture by analyzing the info, making a machine studying mannequin, and utilizing AI to develop a whole portfolio venture with a Streamlit app.
# The way to Generate Artificial Information
Artificial information is usually created randomly, utilizing simulations, guidelines, or AI.
// Methodology 1: Random Information Era
To generate information randomly, we’ll use easy features to create values with none particular guidelines.
It’s helpful for testing, but it surely received’t seize reasonable relationships between options. We’ll do it utilizing NumPy’s random technique and create a Pandas DataFrame.
import numpy as np
import pandas as pd
np.random.seed(42)
df_random = pd.DataFrame({
"feature_a": np.random.randint(1, 100, 5),
"feature_b": np.random.rand(5),
"feature_c": np.random.selection(["X", "Y", "Z"], 5)
})
df_random.head()
Right here is the output.
// Methodology 2: Rule-Primarily based Information Era
Rule-based information technology is a better and extra reasonable technique than random information technology. It follows a exact formulation or algorithm. This makes the output purposeful and constant.
In our instance, the dimensions of a home is instantly linked to its value. To indicate this clearly, we are going to create a dataset with each dimension and value. We’ll outline the connection with a formulation:
Value = dimension × 300 + ε (random noise)
This manner, you possibly can see the correlation whereas maintaining the info moderately reasonable.
np.random.seed(42)
n = 5
dimension = np.random.randint(500, 3500, n)
value = dimension * 300 + np.random.randint(5000, 20000, n)
df_rule = pd.DataFrame({
"size_sqft": dimension,
"price_usd": value
})
df_rule.head()
Right here is the output.
// Methodology 3: Simulation-Primarily based Information Era
The simulation-based information technology technique combines random variation with guidelines from the true world. This combine creates datasets that behave like actual ones.
What can we find out about housing?
- Greater properties often price extra
- Some cities price greater than others
- A baseline value
How can we construct the dataset?
- Decide a metropolis at random
- Draw a house dimension
- Set bedrooms between 1 and 5
- Compute the value with a transparent rule
Value rule: We begin with a base value, add a metropolis value bump, after which add dimension × charge.
price_usd = base_price × city_bump + sqft × charge
Right here is the code.
import numpy as np
import pandas as pd
rng = np.random.default_rng(42)
CITIES = ["los_angeles", "san_francisco", "san_diego"]
# Metropolis value bump: greater means pricier metropolis
CITY_BUMP = {"los_angeles": 1.10, "san_francisco": 1.35, "san_diego": 1.00}
def make_data(n_rows=10):
metropolis = rng.selection(CITIES, dimension=n_rows)
# Most properties are close to 1,500 sqft, some smaller or bigger
sqft = rng.regular(1500, 600, n_rows).clip(350, 4500).spherical()
beds = rng.integers(1, 6, n_rows)
base = 220_000
charge = 350 # {dollars} per sqft
bump = np.array([CITY_BUMP[c] for c in metropolis])
value = base * bump + sqft * charge
return pd.DataFrame({
"metropolis": metropolis,
"sqft": sqft.astype(int),
"beds": beds,
"price_usd": value.spherical(0).astype(int),
})
df = make_data()
df.head()
Right here is the output.
// Methodology 4: AI-Powered Information Era
To have AI create your dataset, you want a transparent immediate. AI is highly effective, but it surely works greatest once you set easy, sensible guidelines.
Within the following immediate, we are going to embrace:
- Area: What’s the information about?
- Options: Which columns do we would like?
- Metropolis, neighborhood, sqft, bedrooms, loos
- Relationships: How do the options join?
- Value relies on metropolis, sqft, bedrooms, and crime index
- Format: How ought to AI return it?
Right here is the immediate.
Generate Python code that creates an artificial California actual property dataset.
The dataset ought to have 10,000 rows with columns: metropolis, neighborhood, latitude, longitude, sqft, bedrooms, loos, lot_sqft, year_built, property_type, has_garage, situation, school_score, crime_index, dist_km_center, price_usd.
Cities: Los Angeles, San Francisco, San Diego, San Jose, Sacramento.
Value ought to rely upon metropolis premium, sqft, bedrooms, loos, lot dimension, faculty rating, crime index, and distance from metropolis heart.
Embrace some random noise, lacking values, and some outliers.
Return the end result as a Pandas DataFrame and reserve it to ‘ca_housing_synth.csv’
Let’s use this immediate with ChatGPT.
It returned the dataset as a CSV. Right here is the method that reveals how ChatGPT created it.
That is essentially the most advanced dataset we’ve generated by far. Let’s see the primary few rows of this dataset.
# Constructing a Portfolio Venture from Artificial Information
We used 4 totally different methods to create an artificial dataset. We’ll use the AI-generated information to construct a portfolio venture.
First, we are going to discover the info, after which construct a machine studying mannequin. Subsequent, we are going to visualize the outcomes with Streamlit by leveraging AI, and within the closing step, we are going to uncover which steps to comply with to deploy the mannequin to manufacturing.
// Step 1: Exploring and Understanding the Artificial Dataset
We’ll begin exploring the info by first studying it with pandas and exhibiting the primary few rows.
df = pd.read_csv("ca_housing_synth.csv")
df.head()
Right here is the output.
The dataset contains location (metropolis, neighborhood, latitude, longitude) and property particulars (dimension, rooms, yr, situation), in addition to the goal value. Let’s test the data within the column names, dimension, and size by utilizing the data technique.
We’ve got 15 columns, with some, like has_garage or dist_km_center, being fairly particular.
// Step 2: Mannequin Constructing
The subsequent step is to construct a machine studying mannequin that predicts dwelling costs.
We’ll comply with these steps:
Right here is the code.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.inspection import permutation_importance
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# --- Step 1: Outline columns based mostly on the generated dataset
num_cols = ["sqft", "bedrooms", "bathrooms", "lot_sqft", "year_built",
"school_score", "crime_index", "dist_km_center", "latitude", "longitude"]
cat_cols = ["city", "neighborhood", "property_type", "condition", "has_garage"]
# --- Step 2: Break up the info
X = df.drop(columns=["price_usd"])
y = df["price_usd"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# --- Step 3: Preprocessing pipelines
num_pipe = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
])
cat_pipe = Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")),
("encoder", OneHotEncoder(handle_unknown="ignore"))
])
preprocessor = ColumnTransformer([
("num", num_pipe, num_cols),
("cat", cat_pipe, cat_cols)
])
# --- Step 4: Mannequin
mannequin = RandomForestRegressor(n_estimators=300, random_state=42, n_jobs=-1)
pipeline = Pipeline([
("preprocessor", preprocessor),
("model", model)
])
# --- Step 5: Prepare
pipeline.match(X_train, y_train)
# --- Step 6: Consider
y_pred = pipeline.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"MAE: {mae:,.0f}")
print(f"RMSE: {rmse:,.0f}")
print(f"R²: {r2:.3f}")
# --- Step 7: (Non-obligatory) Permutation Significance on a subset for velocity
pi = permutation_importance(
pipeline, X_test.iloc[:1000], y_test.iloc[:1000],
n_repeats=3, random_state=42, scoring="r2"
)
# --- Step 8: Plot Precise vs Predicted
plt.determine(figsize=(6, 5))
plt.scatter(y_test, y_pred, alpha=0.25)
vmin, vmax = min(y_test.min(), y_pred.min()), max(y_test.max(), y_pred.max())
plt.plot([vmin, vmax], [vmin, vmax], linestyle="--", shade="crimson")
plt.xlabel("Precise Value (USD)")
plt.ylabel("Predicted Value (USD)")
plt.title(f"Precise vs Predicted (MAE={mae:,.0f}, RMSE={rmse:,.0f}, R²={r2:.3f})")
plt.tight_layout()
plt.present()
Right here is the output.
Mannequin Efficiency:
- MAE (85,877 USD): On common, predictions are off by about $86K, which is cheap given the variability in housing costs
- RMSE (113,512 USD): Bigger errors are penalized extra; RMSE confirms the mannequin handles appreciable deviations pretty nicely
- R² (0.853): The mannequin explains ~85% of the variance in dwelling costs, exhibiting robust predictive energy for artificial information
// Step 3: Visualize the Information
On this step, we are going to present our course of, together with EDA and mannequin constructing, utilizing the Streamlit dashboard. Why are we utilizing Streamlit? You may construct a Streamlit dashboard shortly and simply deploy it for others to view and work together with.
Utilizing Gemini CLI
To construct the Streamlit utility, we are going to use Gemini CLI.
Gemini CLI is an AI-powered open-source command-line agent. You may write code and construct purposes utilizing Gemini CLI. It’s easy and free.
To put in it, use the next command in your terminal.
npm set up -g @google/gemini-cli
After putting in, use this code to provoke.
It is going to ask you to log in to your Google account, and you then’ll see the display screen the place you’ll construct this Streamlit app.
Constructing a Dashboard
To construct a dashboard, we have to create a immediate that’s tailor-made to your particular information and mission. Within the following immediate, we clarify all the things AI must construct a Streamlit dashboard.
Construct a Streamlit app for the California Actual Property dataset by utilizing this dataset ( path-to-dataset )
Right here is the dataset info:
• Area: California housing — Los Angeles, San Francisco, San Diego, San Jose, Sacramento.
• Location: metropolis, neighborhood, lat, lon, and dist_km_center (haversine to metropolis heart).
• Residence options: sqft, beds, baths, lot_sqft, year_built, property_type, has_garage, situation.
• Context: school_score, crime_index.
• Goal: price_usd.
• Value logic: metropolis premium + dimension + rooms + lot dimension + faculty/crime + distance to heart + property kind + situation + noise.
• Information you've got: ca_housing_synth.csv (information) and real_estate_model.pkl (educated pipeline).
The Streamlit app ought to have:
• A brief dataset overview part (form, column listing, small preview).
• Sidebar inputs for each mannequin function besides the goal:
- Categorical dropdowns: metropolis, neighborhood, property_type, situation, has_garage.
- Numeric inputs/sliders: lat, lon, sqft, beds, baths, lot_sqft, year_built, school_score, crime_index.
- Auto-compute dist_km_center from the chosen metropolis utilizing the haversine formulation and that metropolis’s heart.
• A Predict button that:
- Builds a one-row DataFrame with the precise coaching columns (order-safe).
- Calls pipeline.predict(...) from real_estate_model.pkl.
- Shows Estimated Value (USD) with hundreds separators.
• One chart solely: What-if: sqft vs value line chart (all different inputs mounted to the sidebar values).
- High quality of life: cache mannequin load, primary enter validation, clear labels/tooltips, English UI.
Subsequent, Gemini will ask your permission to create this file.
Let’s approve and proceed. As soon as it has completed coding, it would routinely open the streamlit dashboard.
If not, go to the working listing of the app.py
file and run streamlit run app.py
to begin this Streamlit app.
Right here is our Streamlit dashboard.
When you click on on the info overview, you possibly can see a bit representing the info exploration.
From the property options on the left-hand facet, we are able to customise the property and make predictions accordingly. This a part of the dashboard represents what we did in mannequin constructing, however with a extra responsive look.
Let’s choose Richmond, San Francisco, single-family, glorious situation, 1500 sqft, and click on on the “Predict Value” button:
The expected value is $1.24M. Additionally, you possibly can see the precise vs predicted value within the second graph for your entire dataset when you scroll down.
You may alter extra options within the left panel, just like the yr constructed, crime index, or the variety of loos.
// Step 4: Deploy the Mannequin
The subsequent step is importing your mannequin to manufacturing. To do this, you possibly can comply with these steps:
# Last Ideas
On this article, we’ve found totally different strategies to create artificial datasets, reminiscent of random, rule-based, simulation-based, or AI-powered. Subsequent, we’ve constructed a portfolio information venture by ranging from information exploration and constructing a machine studying mannequin.
We additionally used an open-source command-line-based AI agent (Gemini CLI) to develop a dashboard that explores the dataset and predicts home costs based mostly on chosen options, together with the variety of bedrooms, crime index, and sq. footage.
Creating your artificial information helps you to keep away from privateness hurdles, stability your examples, and transfer quick with out expensive information assortment. The draw back is that it might mirror your assumptions and miss real-world quirks. When you’re in search of extra inspiration, take a look at this listing of machine studying tasks you could adapt on your portfolio.
Lastly, we checked out methods to add your mannequin to manufacturing utilizing Streamlit Neighborhood Cloud. Go forward and comply with these steps to construct and showcase your portfolio venture at present!
Nate Rosidi is a knowledge scientist and in product technique. He is additionally an adjunct professor instructing analytics, and is the founding father of StrataScratch, a platform serving to information scientists put together for his or her interviews with actual interview questions from high corporations. Nate writes on the most recent developments within the profession market, provides interview recommendation, shares information science tasks, and covers all the things SQL.