Data Mining

Purpose:

The application will have a search bar which will be used to take free text as an input and the result will have related documents ranked using TF-IDF calculation. This will be used if someone is looking for a movie by any keyword(s).

Preprocessing of Data

My dataset consists of movies with almost 50,000 entries. The dataset can be downloaded from the kaggle. It is a good dataset for movies as it has many columns and huge amount of movies to search from. The plots of the movies is given in the column “overview” which has 2-3 lines about most of the movie, which makes it easy to search. Other columns are release date, no. of weeks , release dates, languages, etc.

STEPS:

Converted the text to lower case
Removed stop words
Created unique tokens in the plot
Stored the frequency of each document
Stored the number of occurrences of each token in each document
To save time I stored all of this to a pickle file which the application will read every time rather than creating again.

Algorithm:

TF-IDF Algorithm:

TF-IDF is very popular algorithm used in information retrieval.

Term Frequency(tf): Term frequency relates to the number of occurrences of the term in the document.
tf(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
Document Frequency(df): Document Frequency relates to the number of documents in which a term is present. df doesn’t consider occurrences of terms but mere presence.
Inverse document Frequency(IDF): IDF relates to the normalization of total documents(N) in the corpus to the document frequency.
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

IDF = log(N/df)

TF-IDF score represents weighing scheme used to score the terms in the document based on tf and idf. The terms having more scores are considered important.

Bag of words is often used in representing the document in the form of collection of words. While building the inverted index, bow is very helpful. Inverted index represents the 2 dimensional data structure in which terms represent one dimension and posting list represents another. Inverted index plays important role in reducing search time.

TF-IDF score = tf * idf

After calculating tf-idf weights for each term, I show 15 documents having highest scores as a result.

Contributions:

Inverted Index was very big without any preprocessing and this led to high computational time. So to avoid this preprocessing steps were done which removed the stopwords. With the help of stemming and lemmatization the processing speed was increased.
Displayed the Tf-Idf seperately for each token
Also the list of tokens from the search query was preprocessed and kept for later highlighting the searched results.

Experiments:

I tried implementing the search for the quoted text to get the exact words in the search. This consumed most of my time for this phase. Since during the process every thing was tokenized the quoted text was difficult to search. I tried implementing it again and looked for exactly simmilar string in the plot of the movies but it was taking over 10 minutes and still was giving wrong output.
Tried running the application without storing the inverted index in a pickle file which took almost 5 minutes for each search.
There were poor results if lemmatization and stop words removal were not done.
Also tried running the application on irrelevant inputs, there were no results for them.

Challenges Faced:

I found major difficulty in storing index file to disk. It was not optimal to store it on a text file so on looking up on the internet I found that python uses pickle packages. Pickle is basically used to serialize and de-serialize the object in the disk. So I used pickle to dump and load the index and I got correct tf-idf results faster.
I also faced few deployment issues with pythonanywhere and while making a flask application.
Initially, we were only showing the Tf*Idf for the each result but changing it to storing each tokens Tf and IDF value took time.

References:

Project Website

GITHUB