Classification

Purpose:

The purpose of this part of application is to find the genre of input anyone types in. In this the user can enter the text and the prediction will be the result that in which genre those set of words can most likely be.

Data:

In this I used the same dataset for movies and on the basis of the input the task was to predict the genre it can belong to.
There was one important step before using the classifier. In the dataset there were around 25 different genres on testing I encountered wrong names in the genre of the movies. There were total of 8 such words. So, in total there were 17 different genres and while looping through the genres I made a condition to do nothing when encounter any word out of the list of those 8 words.

Algorithm:

Naive Bayes Classifier:

Since the dataset was having the textual data which was then transformed to the number of the occurrences of each token and document. The best classifier for this was Naive Bayes.
I have implemented Multinomial Naive Bayes algorithm to classify my textual data. Naive Bayes algorithm basically works on bayes theorem.

We are using multi label classification as the text entered is not exactly the plot of any movie. The text can belong to various genres. We will calculate the probability of that text in each genre and will predict the top 5 results.

The formula used of will be for conditional independence:
prob(genre | token_1, token_2, …, token_n) = prob(token_1 | genre) * prob(token_2 | genre) * …, prob(token_n | genre) * prob(genre)

STEPS:

  • I used the training set to create the model. My training set was 20% of the total.
  • We will check the probability of each word in each genre and then the occurrence of each genre in each document.
  • The probability of term belonging to a class can be calculated as:

Here Tct : no of occurrences of token t in training documents in genre c.

  • In NB apply model, we take multiplication of prior probability of class and probability of terms belonging to that class. Below is the equation:
  • After getting the probabilities we send the top 5 results and display them.
  • Last thing is to call the test evaluation fuction to measure the accuracy for our own purpose.

Contributions:

  • Implementation of the naive-bayes algorithm without using any library for that and implemented it on the training dataset to get the results.
  • Measuring performance of the classification model on the test dataset for which I implemented the calculation for accuracy and also printed the confusion matrix for my own use.
  • In order to increase the computation speed, The probabilities of each term and Genre was stored in pickle file (.pkl) so that it can work faster after first iteration.
  • Added the charts while displaying for better visual results.

Experiments:

  • Smoothing: Laplace smoothing played key role in making the results look better. When entered the query which was not present in the data it should predict the low values for all the genres rather than predicting normally. I wasn’t able to add this feature in my website as it was not functioning properly for me in pythonanywhere. I was getting higher values for one genre and near to zero for others. I tried this on my system just to see the results and it was working well.
  • Performance Measure: To see how accurate the system was I calculated the accuracy and also made the confusion matrix for the same.
    However, the accuracy was low so I changed the split ratio to 10,20 and 30 percentage of total to get the results. The accuracy were low 36.4, 34.3, 35.6 respectively. This was the situation when only predicted one was checked with one of the actual genre. But there were multiple genres for each movie in the database. On iterating among all of them the accuracy was increased to 54.28% for 20% of test data.
  • Also tested performance using Sklearn library with different classifier(SVM). It came out to be near 52% for one label only.

Challenges faced:

  • The main challenge was to implement the smoothing in the application as it eliminates the wrong predictions for wrong inputs.
  • The main task was to increase the accuracy for which I had to look for all the actual genres and get better results.
  • Saving time while running, to do so again I had to store the prior and posterior probabilities in a pickle file and read it while classifying.

References: