Find which categories an article come from using NLP
This is 3rd project as data scientist. I am summarizing what I have learned and what process I go thorough when I tackle on entirely.
- Problem statement
- Method
- Assemble model
- Evaluation
- Conclusion
Problem statement - This project theme is NLP. I was thinking what problem I can solve related to NLP and got an idea. I assume that some certain words are likely to used in some certain category. For example, a give post which is categorized in ‘movies’ would have words such as Ironman, Hollywood. I believed I could create model which predicts category of posts by their body text which I though would have keywords. If my assumption is true, I can think that SNS company can group users by their interest and improve user experience by providing optimized advertisement or related event, service.
Here is my method to answer the problem statement and how I assembled a model. I got posts from Reddit where subreddit was movies, and I needed another subreddit so I randomly chose but something many people are interested in as hobby. After all, I got 100 posts from subreddits which are movies and fitness using pushshift api.
I set target variable as 1 if it predicts an article’s category is movie. Many of Reddit posts did not contain body text so I used only title texts. I transformed the post’s title to separated words using CountVectorizer so that I can see how often each word appear in the posts and see frequency of words based on the categories. There must be keywords which are strongly related to the category.
Guess, what would be common words in category movies and fitness? I guessed Ironman, Frozen or something like famous title of movie in movies and weight, diet in fitness something like that. Half of that is correct, but I do not know because after all number of words ended up with more than 6000 so I am not willingly to check all of those. Here is the common words which often showed up on each category.
Movies- movie, movies, trailer, film, official, borat, scene, teaser. Fitness- routine, gym, body, exercise, advice, workout, weight, muscle.
I saw some words are apparently showed frequently in movies. I counted (I also calculated Tfidf value) and above words values were much higher than others so I believed I could create such model.
I got first result by using LogisticRegressoin and KNN. Logreg model was high variance and KNN was high bias. I believed number of posts was too few, however I saw the number of words was more than 700 and this made logreg model have high variance so I was thinking if I should reduce this words to make it lower variance. I had no idea if 700 features were enough to train NLP model. Actually I thought it was too many features as I handled much fewer features in previous projects,
I am a totally beginner so I thought I should try whatever method I came up with.
When I tested bigram and trigram and the best estimator for bigram was below.
C=1, penalty=l2 for logreg, k=25, metric=euclidean for KNN
For trigram, best estimator of logreg was same as above and
k=5, metric=euclidean for KNN
Anyway it does not matter because the score did not improve. I guessed bigram would work better as some movie titles are bigram such as Toy Story, Harry Potter however I totally forgot about sub titles and I found there were many movies which title had more than 2 words, like Pirates of Caribbean or Harry Potter and Secret bruh bruh. There was possibility that most of hot movies during period I gathered posts might be monogram title.
Next, this method improved my model most other than any other ideas. Very simple but most import trial, I just gathered more posts to feed my model. I used 10 times of the original posts which was 1000 for each. This improved a lot, it makes high bias and low variance model. Actually I am already satisfied with the score and I could finish working on this project. But somehow, I just wanted to try to get score closer to 1 for both of training and test data set. I started next trial. This ended up with getting worse score though. I calculated Sensitivity and Specificity.
By the way, getting 1000 posts took much time. I needed to set interval to get the articles as it showed error if I tried to get without interval.
Specificity was almost 1 and Sensitivity was a little lower that that so I tried to improve Sensitivity. As I assembled the model which returns True when the posts was likely to be movie, I added more movies to improve Sensitivity as I thought if the model learned more with movies posts, it could catch more noise at the category. However, this improved a lot Sensitivity of KNN, but gave bad affect at any other score. It did not improve Sensitivity of logreg model, also worsened the accuracy. This trial failed. I will not use 500 more articles from movies. 1000 of each posts ended up with better score.
I applied new method that I learned recently. I used RandomForest with my data and see how different the result would be. Also I tried TfidfVectorizer to compare the result of CountVectorizer. After all TfidfVectorizer woked better than CountVectorizer in my model. It increased the score and almost impossible to make it better anymore. Not as well as LogisticRegression though, RandomForest worked very good as well. The score was pretty close to 100 %. I do not see difference between Tfidf and CountVectorizer when I made a model with RandomForest.
I am happy with the result but I was thinking that the model works good only for the subreddit, movies and fitness. These topic is pretty different so it would not be hard to make good model with this 2. I wanted to see how well my model works on other pair of subreddit so I chose new topic which are kind of similar. The new 2 subreddits were highshool and university. I knew this choice of subreddit makes my work harder as they would have similar conversation compared to fitness and movies.
As I expected, it turned out to be true. I used same model. TfidfVectorizer and LogisticRegression, then it gave me .20 minus score. I did not expected much this however RandomForest worked better than LogisticRegression this time by .10 on accuracy. Also as for subreddit of movies and fitness, TfidfVectorizer worked better but CountVectorezer did for new subreddit. It looks appropriate model is varies and this is not something that I can guess which one is better, at least so far.
Conclusion- I was able to create model that predicts at over 90% accuracy in both pairs of subreddit so that I can say yes to my problem statement. It is possible to classify posts by their text. You can group people by their interest, improve UX.
I know, it is not enough result. I used only 2 categories to make a model. If I added more categories, it would not work as good as this research. If I want to bring this to real world, I have to use infinite number of hobby columns to train a model. Movies, fitness, drama, dance, tennis, hiking, curving, game… I do not want to count up.