Implementing Logistic Regression in scikit-learn

Logistic Regression, despite its name, is not a regression model in the traditional sense. It is a classification algorithm, primarily used for binary classification tasks. Imagine a scenario where we are trying to predict whether a student will pass or fail an exam based on their study hours. The outcome here is binary: pass (1) or fail (0). Logistic Regression provides a probabilistic framework that helps us make such predictions.

At the heart of this methodology lies the logistic function, also known as the sigmoid function, which transforms any real-valued number into a value between 0 and 1. This function is defined as:

def sigmoid(z):

return 1 / (1 + np.exp(-z))

def sigmoid(z): return 1 / (1 + np.exp(-z))

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

Here, z represents the linear combination of input features. The output of the sigmoid function indicates the probability that a given instance belongs to the positive class (1). If the output probability is greater than 0.5, we classify the instance as belonging to class 1; otherwise, we classify it as class 0.

Mathematically, we can express the logistic regression model as:

import numpy as np

class LogisticRegression:

def __init__(self, learning_rate=0.01, num_iterations=1000):

self.learning_rate = learning_rate

self.num_iterations = num_iterations

self.weights = None

self.bias = None

def fit(self, X, y):

num_samples, num_features = X.shape

self.weights = np.zeros(num_features)

self.bias = 0

for _ in range(self.num_iterations):

linear_model = np.dot(X, self.weights) + self.bias

y_predicted = self.sigmoid(linear_model)

dw = (1 / num_samples) * np.dot(X.T, (y_predicted - y))

db = (1 / num_samples) * np.sum(y_predicted - y)

self.weights -= self.learning_rate * dw

self.bias -= self.learning_rate * db

def sigmoid(self, z):

return 1 / (1 + np.exp(-z))

import numpy as np class LogisticRegression: def __init__(self, learning_rate=0.01, num_iterations=1000): self.learning_rate = learning_rate self.num_iterations = num_iterations self.weights = None self.bias = None def fit(self, X, y): num_samples, num_features = X.shape self.weights = np.zeros(num_features) self.bias = 0 for _ in range(self.num_iterations): linear_model = np.dot(X, self.weights) + self.bias y_predicted = self.sigmoid(linear_model) dw = (1 / num_samples) * np.dot(X.T, (y_predicted - y)) db = (1 / num_samples) * np.sum(y_predicted - y) self.weights -= self.learning_rate * dw self.bias -= self.learning_rate * db def sigmoid(self, z): return 1 / (1 + np.exp(-z))

import numpy as np

class LogisticRegression:
    def __init__(self, learning_rate=0.01, num_iterations=1000):
        self.learning_rate = learning_rate
        self.num_iterations = num_iterations
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        num_samples, num_features = X.shape
        self.weights = np.zeros(num_features)
        self.bias = 0

        for _ in range(self.num_iterations):
            linear_model = np.dot(X, self.weights) + self.bias
            y_predicted = self.sigmoid(linear_model)

            dw = (1 / num_samples) * np.dot(X.T, (y_predicted - y))
            db = (1 / num_samples) * np.sum(y_predicted - y)

            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

In this implementation, we initialize weights to zero and iteratively adjust them based on the gradients of the loss function, which quantifies how well our model is performing. The goal is to minimize this loss function, thereby optimizing the weights and bias.

The logistic regression model assumes that the log-odds of the dependent variable can be expressed as a linear combination of the independent variables. This relationship is what gives logistic regression its power and flexibility, allowing it to model complex relationships in data while maintaining interpretability.

Preparing the Data

Before we delve into the depths of building our logistic regression model, we must first navigate the essential terrain of data preparation. This stage is akin to laying the groundwork for a grand architectural masterpiece; without a solid foundation, no structure can stand tall and proud. In the context of logistic regression, this means ensuring our data is clean, appropriately formatted, and suitable for the task at hand.

One of the first steps in preparing our data is to handle any missing values. Incomplete datasets can lead to biased results and unreliable predictions. We can approach this problem in several ways: removing rows with missing values, filling them with the mean or median of the column, or employing more sophisticated imputation techniques. For this article, let’s consider a simpler approach where we fill missing values with the mean:

import pandas as pd

# Load the dataset

data = pd.read_csv('student_data.csv')

# Fill missing values with the mean of the respective column

data.fillna(data.mean(), inplace=True)

import pandas as pd # Load the dataset data = pd.read_csv('student_data.csv') # Fill missing values with the mean of the respective column data.fillna(data.mean(), inplace=True)

import pandas as pd

# Load the dataset
data = pd.read_csv('student_data.csv')

# Fill missing values with the mean of the respective column
data.fillna(data.mean(), inplace=True)

Next, we must ensure that our features are appropriately scaled. Logistic regression is sensitive to the scale of the input data, as it utilizes gradient descent for optimization. If one feature has a much larger scale than others, it can disproportionately influence the model. Standardization or normalization of features is often employed to mitigate this issue. Here’s how we can standardize our features to have a mean of 0 and a standard deviation of 1:

from sklearn.preprocessing import StandardScaler

# Assuming 'features' is a DataFrame containing our input variables

scaler = StandardScaler()

scaled_features = scaler.fit_transform(data[['study_hours', 'previous_scores']])

from sklearn.preprocessing import StandardScaler # Assuming 'features' is a DataFrame containing our input variables scaler = StandardScaler() scaled_features = scaler.fit_transform(data[['study_hours', 'previous_scores']])

from sklearn.preprocessing import StandardScaler

# Assuming 'features' is a DataFrame containing our input variables
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data[['study_hours', 'previous_scores']])

Moreover, categorical variables often find their way into our datasets, and logistic regression requires numerical input. Thus, we must convert these categorical variables into a format that our model can understand. This process is known as encoding. One popular method is one-hot encoding, which creates binary columns for each category. For instance, if we have a categorical variable ‘gender’ with values ‘male’ and ‘female’, we can apply one-hot encoding as follows:

# One-hot encoding for categorical variables

data = pd.get_dummies(data, columns=['gender'], drop_first=True)

# One-hot encoding for categorical variables data = pd.get_dummies(data, columns=['gender'], drop_first=True)

# One-hot encoding for categorical variables
data = pd.get_dummies(data, columns=['gender'], drop_first=True)

After these transformations, we arrive at a dataset that is not only clean but also well-prepared for training our logistic regression model. It’s essential to keep our features and target variable separate, as this distinction is important when fitting our model:

# Separating features and target variable

X = data.drop('pass_fail', axis=1) # Assuming 'pass_fail' is our target variable

y = data['pass_fail']

# Separating features and target variable X = data.drop('pass_fail', axis=1) # Assuming 'pass_fail' is our target variable y = data['pass_fail']

# Separating features and target variable
X = data.drop('pass_fail', axis=1)  # Assuming 'pass_fail' is our target variable
y = data['pass_fail']

Building the Logistic Regression Model

In this phase, we shall embark on the exciting journey of building our logistic regression model. Once the data has been meticulously prepared, we can turn our attention to the implementation of the model itself. Using scikit-learn, a powerful library that offers a rich suite of tools for machine learning, we can streamline this process and focus on the essence of logistic regression.

The construction of our model is surprisingly intuitive. With scikit-learn, we can leverage the built-in LogisticRegression class, which encapsulates all the necessary functionalities we require. Let’s start by importing the library and initializing our model:

from sklearn.linear_model import LogisticRegression

# Initialize the logistic regression model

model = LogisticRegression() # You can specify hyperparameters like solver, C, etc.

from sklearn.linear_model import LogisticRegression # Initialize the logistic regression model model = LogisticRegression() # You can specify hyperparameters like solver, C, etc.

from sklearn.linear_model import LogisticRegression

# Initialize the logistic regression model
model = LogisticRegression()  # You can specify hyperparameters like solver, C, etc.

Next, we shall fit our model to the training data. This fitting process is where the model learns from the input features and the corresponding target labels. It’s akin to teaching a child how to recognize patterns based on examples provided. In our case, we are teaching the model how to discern whether a student passes or fails based on their study hours and previous scores:

# Fitting the model

model.fit(X, y)

# Fitting the model model.fit(X, y)

# Fitting the model
model.fit(X, y)

Upon fitting the model, we can now utilize it to make predictions. That’s the moment where the magic happens—a transformation from abstract data into concrete classifications. Given a new set of study hours and previous scores, we can predict the likelihood of passing the exam:

# Example new data for prediction

new_data = np.array([[5, 80, 1]]) # 5 hours of study, previous score 80, gender encoded

# Making predictions

predictions = model.predict(new_data)

predicted_probabilities = model.predict_proba(new_data)

print(f'Predicted class: {predictions[0]}')

print(f'Predicted probabilities: {predicted_probabilities[0]}')

# Example new data for prediction new_data = np.array([[5, 80, 1]]) # 5 hours of study, previous score 80, gender encoded # Making predictions predictions = model.predict(new_data) predicted_probabilities = model.predict_proba(new_data) print(f'Predicted class: {predictions[0]}') print(f'Predicted probabilities: {predicted_probabilities[0]}')

# Example new data for prediction
new_data = np.array([[5, 80, 1]])  # 5 hours of study, previous score 80, gender encoded

# Making predictions
predictions = model.predict(new_data)
predicted_probabilities = model.predict_proba(new_data)

print(f'Predicted class: {predictions[0]}')
print(f'Predicted probabilities: {predicted_probabilities[0]}')

In the above code, predict gives us the predicted class, while predict_proba reveals the probabilities associated with each class. It is this probabilistic output that embodies the inherent beauty of logistic regression, allowing us to gauge not just the outcome, but the confidence with which the model makes its predictions.

Moreover, we can delve deeper into the model’s parameters—specifically the learned weights. These weights represent the influence of each feature on the final prediction. In a sense, they encapsulate the relationship between study hours, previous scores, gender, and the likelihood of passing:

# Inspecting the coefficients

coefficients = model.coef_

intercept = model.intercept_

print(f'Coefficients: {coefficients}')

print(f'Intercept: {intercept}')

# Inspecting the coefficients coefficients = model.coef_ intercept = model.intercept_ print(f'Coefficients: {coefficients}') print(f'Intercept: {intercept}')

# Inspecting the coefficients
coefficients = model.coef_
intercept = model.intercept_

print(f'Coefficients: {coefficients}')
print(f'Intercept: {intercept}')

This insight into the coefficients can be illuminating, as it allows us to interpret the model’s decision-making process. A positive coefficient indicates a direct relationship between the feature and the likelihood of passing, while a negative coefficient suggests an inverse relationship. Thus, we can understand the dynamics at play in our model and engage in meaningful discussions about the factors influencing student performance.

Evaluating Model Performance

As we transition from the art of constructing our logistic regression model into the realm of evaluating its performance, we find ourselves at an important juncture. The elegance of our model is not merely in its ability to predict; it lies equally in how well it performs against the backdrop of reality. To appreciate this, we must delve into the metrics that illuminate the efficacy of our model, revealing whether our endeavor has borne fruit or if we are merely engaged in an exercise of futility.

In the context of classification, the confusion matrix serves as a foundational tool, offering a visual representation of the model’s performance. It delineates the true positives, true negatives, false positives, and false negatives—each category a vital piece of the puzzle that helps us understand how our model is making its predictions. We can construct this matrix using scikit-learn’s confusion_matrix function:

from sklearn.metrics import confusion_matrix

# Assuming we have our true labels and predictions

y_pred = model.predict(X) # Predictions on training set

cm = confusion_matrix(y, y_pred)

print('Confusion Matrix:')

print(cm)

from sklearn.metrics import confusion_matrix # Assuming we have our true labels and predictions y_pred = model.predict(X) # Predictions on training set cm = confusion_matrix(y, y_pred) print('Confusion Matrix:') print(cm)

from sklearn.metrics import confusion_matrix

# Assuming we have our true labels and predictions
y_pred = model.predict(X)  # Predictions on training set
cm = confusion_matrix(y, y_pred)
print('Confusion Matrix:')
print(cm)

This matrix will allow us to ascertain not only how many predictions were correct but also how many were misplaced. It is here that we can begin to derive other vital metrics, such as accuracy, precision, recall, and the F1 score—each offering a unique perspective on our model’s performance.

Accuracy, the most simpler metric, is calculated as follows:

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y, y_pred)

print(f'Accuracy: {accuracy:.2f}') # Proportion of correct predictions

from sklearn.metrics import accuracy_score accuracy = accuracy_score(y, y_pred) print(f'Accuracy: {accuracy:.2f}') # Proportion of correct predictions

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y, y_pred)
print(f'Accuracy: {accuracy:.2f}')  # Proportion of correct predictions

However, accuracy can be misleading, especially in datasets where the classes are imbalanced. This is where precision and recall come into play. Precision reflects the proportion of true positive predictions against all positive predictions, while recall (also known as sensitivity) measures the model’s ability to capture all relevant cases:

from sklearn.metrics import precision_score, recall_score

precision = precision_score(y, y_pred)

recall = recall_score(y, y_pred)

print(f'Precision: {precision:.2f}')

print(f'Recall: {recall:.2f}') # True positive rate

from sklearn.metrics import precision_score, recall_score precision = precision_score(y, y_pred) recall = recall_score(y, y_pred) print(f'Precision: {precision:.2f}') print(f'Recall: {recall:.2f}') # True positive rate

from sklearn.metrics import precision_score, recall_score

precision = precision_score(y, y_pred)
recall = recall_score(y, y_pred)
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')  # True positive rate

The F1 score, a harmonic mean of precision and recall, provides a single metric that balances both concerns, particularly useful when we seek a compromise between false positives and false negatives:

from sklearn.metrics import f1_score

f1 = f1_score(y, y_pred)

print(f'F1 Score: {f1:.2f}') # Balance between precision and recall

from sklearn.metrics import f1_score f1 = f1_score(y, y_pred) print(f'F1 Score: {f1:.2f}') # Balance between precision and recall

from sklearn.metrics import f1_score

f1 = f1_score(y, y_pred)
print(f'F1 Score: {f1:.2f}')  # Balance between precision and recall

Yet, the evaluation does not conclude here. Delving deeper, we can also explore the Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC), which together offer a nuanced perspective on our model’s performance across various threshold levels. The ROC curve illustrates the trade-off between the true positive rate and the false positive rate, while the AUC encapsulates the model’s ability to distinguish between the classes:

from sklearn.metrics import roc_curve, roc_auc_score

import matplotlib.pyplot as plt

y_prob = model.predict_proba(X)[:, 1] # Probability estimates for positive class

fpr, tpr, thresholds = roc_curve(y, y_prob)

auc = roc_auc_score(y, y_prob)

plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc:.2f})')

plt.plot([0, 1], [0, 1], 'k--') # Diagonal line

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('Receiver Operating Characteristic')

plt.legend(loc='lower right')

plt.show()

from sklearn.metrics import roc_curve, roc_auc_score import matplotlib.pyplot as plt y_prob = model.predict_proba(X)[:, 1] # Probability estimates for positive class fpr, tpr, thresholds = roc_curve(y, y_prob) auc = roc_auc_score(y, y_prob) plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc:.2f})') plt.plot([0, 1], [0, 1], 'k--') # Diagonal line plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Receiver Operating Characteristic') plt.legend(loc='lower right') plt.show()

from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

y_prob = model.predict_proba(X)[:, 1]  # Probability estimates for positive class
fpr, tpr, thresholds = roc_curve(y, y_prob)
auc = roc_auc_score(y, y_prob)

plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')  # Diagonal line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc='lower right')
plt.show()

Implementing Logistic Regression in scikit-learn

Preparing the Data

Building the Logistic Regression Model

Evaluating Model Performance

Comments

Leave a Reply Cancel reply

Generative AI with Python

Data Analytics for Marketing

Python Programming for Beginners

Python Object-Oriented Programming