Beginner’s Guide to Machine Learning with Python (2024)

Beginner’s Guide to Machine Learning with Python (1)

Image by Author

Predicting the future isn't magic; it's an AI.

As we stand on the brink of the AI revolution, Python allows us to participate.

In this one, we’ll discover how you can use Python and Machine Learning to make predictions.

We’ll start with real fundamentals and go to the place where we’ll apply algorithms to the data to make a prediction. Let’s get started!

What is Machine Learning?

Machine learning is a way of giving the computer the ability to make predictions. It is too popular now; you probably use it daily without noticing. Here are some technologies that are benefitting from Machine Learning;

  • Self Driving Cars
  • Face Detection System
  • Netflix Movie Recommendation System

But sometimes, AI & Machine Learning, and Deep learning can not be distinguished well.
Here is a grand scheme that best represents those terms.

Beginner’s Guide to Machine Learning with Python (2)

Classifying Machine Learning As a Beginner

Machine Learning algorithms can be clustered by using two different methods. One of these methods involves determining whether a 'label' is associated with the data points. In this context, a 'label' refers to the specific attribute or characteristic of the data points you want to predict.

If there is a label, your algorithm is classified as a supervised algorithm; otherwise, it is an unsupervised algorithm.

Another method to classify machine learning algorithms is classifying the algorithm. If you do that, machine learning algorithms can be clustered as follows:

Like Sci-kit Learn did, here.

Beginner’s Guide to Machine Learning with Python (3)

Image source: scikit-learn.org

What is Sci-kit Learn?

Sci-kit learn is the most famous machine learning library in Python; we’ll use this in this article. Using Sci-kit Learn, you will skip defining algorithms from scratch and use the built-in functions from Sci-kit Learn, which will ease your way of building machine learning.

In this article, we’ll build a machine-learning model using different regression algorithms from the sci-kit Learn. Let’s first explain regression.

What is Regression?

Beginner’s Guide to Machine Learning with Python (4)

Regression is a machine learning algorithm that makes predictions about continuous value. Here are some real-life examples of regression,

Now, before applying Regression models, let’s see three different regression algorithms with simple explanations;

  • Multiple Linear Regression: Predicts using a linear combination of multiple predictor variables.
  • Decision Tree Regressor: Creates a tree-like model of decisions to predict the value of a target variable based on several input features.
  • Support Vector Regression: Finds the best-fit line (or hyperplane in higher dimensions) with the maximum number of points within a certain distance.

Before applying machine learning, you need to follow specific steps. Sometimes, these steps might differ; however, most of the time, they include;

  • Data Exploration and Analysis
  • Data Manipulation
  • Train-test split
  • Building ML Model
  • Data Visualization

In this one, let’s use a data project from our platform to predict price here.

Beginner’s Guide to Machine Learning with Python (5)

Data Exploration and Analysis

In Python, we have several functions. By using them, you can become acquainted with the data you use.

But first of all, you should load the libraries with these functions.

import pandas as pdimport sklearnfrom sklearn.linear_model import LinearRegressionfrom sklearn.ensemble import RandomForestRegressorfrom sklearn import svmfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import r2_scorefrom sklearn.metrics import mean_squared_error

Excellent, let’s load our data and explore it a little bit

data = pd.read_csv('path')

Input the path of the file in your directory. Python has three functions that will help you explore the data. Let’s apply them one by one and see the result.

Here is the code to see the first five rows of our dataset.

data.head()

Here is the output.

Beginner’s Guide to Machine Learning with Python (6)

Now, let’s examine our second function: view the information about our datasets column.

data.info()

Here is the output.

RangeIndex: 10000 entries, 0 to 9999Data columns (total 8 columns): # Column Non-Null Count Dtype- - - - - - - - - - - - - - - - - - - 0 loc1 10000 non-null object 1 loc2 10000 non-null object 2 para1 10000 non-null int64 3 dow 10000 non-null object 4 para2 10000 non-null int64 5 para3 10000 non-null float64 6 para4 10000 non-null float64 7 price 10000 non-null float64 dtypes: float64(3), int64(2), object(3) memory usage: 625.1+ KB

Here is the last function, which will summarize our data statistically. Here is the code.

data.describe()

Here is the output.

Beginner’s Guide to Machine Learning with Python (7)

Now, you are more familiar with our data. In machine learning, all your predictor variables, which means the columns you intend to use to make a prediction, should be numerical.

In the next section, we’ll make sure about it.

Data Manipulation

Now, we all know that we should convert the “dow” column to numbers, but before that, let’s check if other columns consist of numbers only for the sake of our machine-learning models.

We have two suspected columns, loc1, and loc2, because, as you can see from the output of the info() function, we have just two columns that are object data types, which can include numerical and string values.

Let’s use this code to check;

data["loc1"].value_counts()

Here is the output.

loc121607014861122371081394558464773872796906620S 1T 1Name: count, dtype: int64

Now, by using the following code, you can eliminate those rows.

data = data[(data["loc1"] != "S") & (data["loc1"] != "T")]

However, we must ensure that the other column, loc2, does not contain string values. Let's use the following code to ensure that all values are numerical.

data["loc2"] = pd.to_numeric(data["loc2"], errors='coerce')data["loc1"] = pd.to_numeric(data["loc1"], errors='coerce')data.dropna(inplace=True)

At the end of the code above, we use the dropna() function because the converting function from pandas will convert “na” to non-numerical values.

Excellent. We can solve this issue; let’s convert weekday columns into numbers. Here is the code to do that;

# Assuming data is already loaded and 'dow' column contains day names# Map 'dow' to numeric codesdays_of_week = {'Mon': 1, 'Tue': 2, 'Wed': 3, 'Thu': 4, 'Fri': 5, 'Sat': 6, 'Sun': 7}data['dow'] = data['dow'].map(days_of_week)# Invert the days_of_week dictionaryweek_days = {v: k for k, v in days_of_week.items()}# Convert dummy variable columns to integer typedow_dummies = pd.get_dummies(data['dow']).rename(columns=week_days).astype(int)# Drop the original 'dow' columndata.drop('dow', axis=1, inplace=True)# Concatenate the dummy variablesdata = pd.concat([data, dow_dummies], axis=1)data.head()

In this code, we define weekdays by defining a number for each day in the dictionary and then simply changing the day names with those numbers. Here is the output.

Beginner’s Guide to Machine Learning with Python (8)

Now, we are almost there.

Train-Test Split

Before applying a machine learning model, you must split your data into training and test sets. This allows you to objectively assess your model's efficiency by training it on the training set and then evaluating its performance on the test set, which the model has not seen before.

X = data.drop('price', axis=1) # Assuming 'price' is the target variabley = data['price']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Building Machine Learning Model

Now everything is ready. At this stage, we’ll apply the following algorithms at once.

  • Multiple Linear Regression
  • Decision Tree Regression
  • Support Vector Regression

If you are a beginner, this code might seem complicated, but rest assured, it is not. In the code, we first assign model names and their corresponding functions from scikit-learn to the model's dictionary.

Next, we create an empty dictionary called results to store these results. In the first loop, we simultaneously apply all the machine learning models and evaluate them using metrics such as R^2 and MSE, which assess how well the algorithms perform.

In the final loop, we print out the results that we have saved. Here is the code

# Initialize the modelsmodels = { "Multiple Linear Regression": LinearRegression(), "Decision Tree Regression": DecisionTreeRegressor(random_state=42), "Support Vector Regression": SVR()}# Dictionary to store the resultsresults = {}# Fit the models and evaluatefor name, model in models.items(): model.fit(X_train, y_train) # Train the model y_pred = model.predict(X_test) # Predict on the test set # Calculate performance metrics mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) # Store results results[name] = {'MSE': mse, 'R^2 Score': r2}# Print the resultsfor model_name, metrics in results.items(): print(f"{model_name} - MSE: {metrics['MSE']}, R^2 Score: {metrics['R^2 Score']}")

Here is the output.

Multiple Linear Regression - MSE: 35143.23011545407, R^2 Score: 0.5825954700994046Decision Tree Regression - MSE: 44552.00644904675, R^2 Score: 0.4708451884787034Support Vector Regression - MSE: 73965.02477382126, R^2 Score: 0.12149975134965318

Data Visualization

To see the results better, let’s visualize the output.

Here is the code where we first calculate RMSE (square root of MSE) and visualize the output.

import matplotlib.pyplot as pltfrom math import sqrt# Calculate RMSE for each model from the stored MSE and prepare for plottingrmse_values = [sqrt(metrics['MSE']) for metrics in results.values()]model_names = list(results.keys())# Create a horizontal bar graph for RMSEplt.figure(figsize=(10, 5))plt.barh(model_names, rmse_values, color='skyblue')plt.xlabel('Root Mean Squared Error (RMSE)')plt.title('Comparison of RMSE Across Regression Models')plt.show()

Here is the output.

Beginner’s Guide to Machine Learning with Python (9)

Data Projects

Before wrapping up, here are a few data projects to start.

Also, if you want to do data projects about interesting datasets, here are a few datasets that might become interesting to you;

Conclusion

Our results could be better because too many steps exist to improve the model's efficiency, but we made a great start here. Check out Sci-kit Learn's official document to see what you can do more.

Of course, after learning, you need to do data projects repeatedly to improve your capabilities and learn a few more things.

Nate Rosidi is a data scientist and in product strategy. He's also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.


More On This Topic

  • A Beginner's Guide to End to End Machine Learning
  • Essential Machine Learning Algorithms: A Beginner's Guide
  • A Beginner's Guide to the Top 10 Machine Learning Algorithms
  • A Beginner’s Guide to Web Scraping Using Python
  • Making Predictions: A Beginner's Guide to Linear Regression in Python
  • Mastering GPUs: A Beginner's Guide to GPU-Accelerated DataFrames in Python
Beginner’s Guide to Machine Learning with Python (2024)
Top Articles
Latest Posts
Article information

Author: Jeremiah Abshire

Last Updated:

Views: 5824

Rating: 4.3 / 5 (74 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Jeremiah Abshire

Birthday: 1993-09-14

Address: Apt. 425 92748 Jannie Centers, Port Nikitaville, VT 82110

Phone: +8096210939894

Job: Lead Healthcare Manager

Hobby: Watching movies, Watching movies, Knapping, LARPing, Coffee roasting, Lacemaking, Gaming

Introduction: My name is Jeremiah Abshire, I am a outstanding, kind, clever, hilarious, curious, hilarious, outstanding person who loves writing and wants to share my knowledge and understanding with you.