Building a Simple Logistic Regression Model from Scratch with NumPy

4 min readJul 15, 2024

Logistic regression is a fundamental algorithm used for binary classification problems where the outcome belongs to one of two classes. It predicts the probability that a given input falls into a certain class. Unlike linear regression, which predicts continuous outcomes, logistic regression maps input features to probabilities between 0 and 1 using the sigmoid function.

For more details, you can check the source code on GitHub.

GitHub - elcaiseri/Machine-Learning-from-Scratch: Machine Learning using NumPy

Machine Learning using NumPy. Contribute to elcaiseri/Machine-Learning-from-Scratch development by creating an account…

github.com

The Sigmoid Function

At the heart of logistic regression is the sigmoid function. This function converts any real-valued number into a value between 0 and 1, representing the probability of the input belonging to the positive class. The formula for the sigmoid function is:

The Sigmoid Function

The Logistic Regression Model

In logistic regression, we model the relationship between the input features (independent variables) and the binary outcome (dependent variable) using a linear combination of the input features, followed by the application of the sigmoid function. Mathematically, this can be represented as:

Here:

X : represents the input features.
w : represents the weights associated with the features.
b : is the bias term.
sigma : is the sigmoid function.

Training the Model

Training a logistic regression model involves finding the optimal weights w and bias b that minimize the error in the predictions. This is typically done using a method called gradient descent. Gradient descent is an iterative optimization algorithm used to minimize a function by adjusting the parameters in the opposite direction of the gradient of the function with respect to those parameters.

Gradient Descent in Logistic Regression

The key steps in training a logistic regression model using gradient descent are:

Initialize Parameters: Start with initial guesses for the weights and bias, usually set to zero.

self.w = np.zeros(m)  # Initialize weights
self.b = 0  # Initialize bias

2. Compute Predictions: Use the current weights and bias to compute the predicted probabilities for the input data using the sigmoid function.

lr = (self.w @ X.T + self.b) / n
yhat = self._sigmoid(lr)

3. Calculate the Loss: Measure the difference between the predicted probabilities and the actual outcomes using a loss function. For logistic regression, the common loss function is the binary cross-entropy loss.

4. Compute Gradients: Calculate the gradient of the loss function with respect to each parameter. The gradients indicate how much the loss would change if the parameter were adjusted slightly.

db = 1 / n * np.sum(-y + yhat) 
dw = 1 / n * (-y + yhat) @ X

5. Update Parameters: Adjust the weights and bias by moving them in the opposite direction of their respective gradients, scaled by a learning rate. This step is repeated for a specified number of iterations or until the loss converges.

self.b -= self.learning_rate * db
self.w -= self.learning_rate * dw

Making Predictions

After training the model, making predictions on new data involves computing the linear combination of the input features using the learned weights and bias, applying the sigmoid function to get predicted probabilities, and converting these probabilities into class labels based on a chosen threshold (typically 0.5).

def predict(self, X):
    lr = self.b + self.w @ X.T
    return self._sigmoid(lr) > self.th

Evaluating the Model

To assess the performance of a logistic regression model, metrics like accuracy, precision, recall, and the F1 score are commonly used. These metrics provide a comprehensive view of the model’s classification performance.

Practical Application

Let’s put logistic regression into practice by predicting whether a tumor is malignant or benign using the breast cancer dataset. Here’s a step-by-step implementation:

if __name__ == "__main__":
    from sklearn.model_selection import train_test_split as tts 
    from sklearn import datasets as ds 
    import matplotlib.pyplot as plt
    
    # Create dataset
    np.random.seed(1234)   
    bc = ds.load_breast_cancer()
    X, y = bc.data, bc.target
    X_train, X_test, y_train, y_test = tts(X, y, test_size=0.2, random_state=1234)
    
    # Model
    lr = LogisticRegression(1000, 0.01, 0.5)
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)
    
    # Evaluate
    acc = (y_test == y_pred).mean()
    print("Logistic Regression Test Accuracy:", acc)

Conclusion

Logistic regression is a powerful algorithm for binary classification problems. By understanding its mathematical foundation and the steps involved in training the model, you can appreciate how logistic regression makes predictions and why it is widely used in various fields, from medical diagnostics to marketing. Building a logistic regression model from scratch enhances your understanding and equips you to implement more complex machine learning models.

Experiment with different datasets, tweak parameters, and explore advanced techniques like regularization to prevent overfitting. The more you practice, the more proficient you will become in leveraging logistic regression to solve real-world problems. Happy learning!