How is maximum likelihood estimation used in machine learning?

Maximum likelihood estimation (MLE) is a probability-based approach to determining the values ​​of model parameters. Parameters could be defined as blueprints for the model because, based on this, the algorithm works. MLE is a widely used technique in machine learning, time series, panel data, and discrete data. The motive of MLE is to maximize the probability that the values ​​of the parameter achieve the desired results. Here are the topics to discuss.

Contents

  1. What is the probability?
  2. How maximum likelihood estimation works
  3. Maximum Likelihood Estimation in Machine Learning

To understand the concept of Maximum Likelihood Estimation (MLE), you must first understand the concept of likelihood and how it relates to probability.

What is the probability?

The likelihood function measures how well the data provides support for different values ​​of the parameter. It indicates the probability that a particular population will produce a sample. For example, if we compare the likelihood function to two-parameter points and find that for the first parameter the likelihood is greater than the other, this could be interpreted as the first parameter being a more plausible value for the learner as the second parameter. More likely, one could say that he uses a hypothesis to conclude the result. Frequentist and Bayesian analyzes take into account the likelihood function. The likelihood function is different from the probability density function.

Difference Between Probability and Probability Density Function

Probability describes how to find the best data distribution for a feature or situation in data given a certain value of a characteristic or situation while probability describes how to find the chance of something given a distribution sample data. Let us understand the difference between probability and probability density function with the help of an example.

Consider a data set containing the weight of customers. Let’s say the mean of the data is 70 and the standard deviation is 2.5.

When the probability is to be calculated for any situation using this dataset, the mean and standard deviation of the dataset will be constant. Suppose the probability of weight >70 kg is to be calculated for a random record in the data set, then the equation will contain weight, mean and standard deviation. Considering the same data set, now if we need to calculate the probability of weight > 100 kg, then only the height part of the equation will be changed and the rest would be unchanged.

But in the case of probability, the equation of probability reversals conditional on the equation in the probability calculation, i.e. the mean and standard deviation of the data set, will be modified to obtain the maximum probability for a weight > 70 kg.

Are you looking for a comprehensive repository of Python libraries used in data science, check here.

How maximum likelihood estimation works

Maximizing the likelihood estimate is the main goal of MLE. Let’s understand this with an example. Consider that there is a binary classification problem in which we need to classify data into two categories 0 or 1 based on a characteristic called “salary”.

So MLE will calculate the possibility for each data point in the salary, then using that possibility it will calculate the probability of those data points to rank them as 0 or 1. It will repeat this probability process until the learner’s line is the best. team. This process is known as likelihood maximization.

The above explains the scenario, as we can see that there is a threshold of 0.5, so if the possibility turns out to be higher, it is labeled 1 else 0. Let’s see how MLE could be used for classification.

Maximum Likelihood Estimation in Machine Learning

MLE is the basis of many supervised learning models, one of which is logistic regression. Logistic regression maximum likelihood technique for classifying data. Let’s see how logistic regression uses MLE. Specific MLE procedures have the advantage of being able to exploit the properties of the estimation problem to provide better efficiency and numerical stability. These methods can often calculate explicit confidence intervals. The “solver” parameter of logistic regression is used to select different solving strategies for classification for better MLE formulation.

Import Library:

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

Read data:

df=pd.read_csv("Social_Network_Ads.csv")
df.head()

The data is linked to advertisements on social networks that indicate the gender, age and estimated salary of users of this social network. Gender is a categorical column that needs to be labeled encoded before passing the data to the learner.

Data encoding:

le = preprocessing.LabelEncoder()
df['gender']=le.fit_transform(df['Gender'])

The encoded results are stored in a new feature called “genre” so that the original remains unchanged. Now divide the data into training and testing to train and validate the learner.

Split data:

X=df.drop(['Purchased','Gender'],axis=1)
y=df['Purchased']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

This is divided into a ratio of 70:30 according to standard rules.

Adjusting data in the learner:

lr=LogisticRegression(max_iter=100,solver="lbfgs")
lr.fit(X_train,y_train)
lr_pred=lr.predict(X_test)
df_pred=pd.merge(X_test,pd.DataFrame(lr_pred,columns=['predicted']),left_index=True,right_index=True)

The predicted results are added to the test dataset under the “predicted” feature.

Draw the learner line:

sns.regplot(x="Age", y='predicted',data=df_pred ,logistic=True, ci=None)

In the graph above which is between the age of the feature and the prediction, the learner line is formed using the principle of maximum likelihood estimation which helped the logistic regression model to classify the results. So, behind the scenes, the algorithm chooses an age-scaled probability of observing “1” and uses that to calculate the probability of observing “0”. This will suffice for all data points and finally it will multiply all data probabilities given in the row. This multiplication process will continue until the maximum likelihood is not found or the line of best fit is not found.

Last words

The maximum likelihood approach provides a persistent approach to parameter estimation as well as mathematical and optimizable properties. With a practical implementation of this concept in this article, we could understand how maximum likelihood estimation works and how it is used as the backbone of logistic regression for classification.

References