Thestate of the art in cyberbullying involves training a machine learning modelusing supervised learning. The research work mostly focuses on featureengineering, i.e., finding features that can separate bullying comments fromnon-bullying comments. Finding good features is difficult and problematic.
Features that work well for YouTube comments might not work for Twittercomments, due to different social media platforms being likely to have varyingvocabulary and expressions in part caused by restrictions on communication,different age groups, and user’sinterest.Itcan be noted that the model performed well on a training dataset, generating ascore between 77% – 90% but failed to generalize the test dataset. This is acase of over-fitting or a large amount ofvariance in which the model tries its best to fit the training dataset butcannot classify the test dataset correctly. There are a number of possiblereasons for such behavior:1. Size of the dataset. Any machine learning algorithmperforms well on a dataset containing a huge number of samples. Whereas thetraining dataset used for training the algorithm contains a very limited numberof samples.
2. Differences in the dataset due to mixing socialcomments from two different social media platforms.3. The dataset requires further data cleaning andpreprocessing. Upon looking at the normalized dataset after data preprocessingstep, it is found that although the preprocessing did a good job in normalizingthe dataset, a lot of samples still remain inconsistent. A large number ofabusive words and insults is missed out from the vocabulary because of itsvastness of usage in many different forms.
Apart from the abusive words, therewere Unicode characters that remained in the preprocessed data. All thesefactors contributed greatly towards the poor performance of the model. 4. There were comments present in foreign languages likeFrench and Spanish. The model only learned to classify English comments.
5. The requirementof different kinds of features, for example, Latent Dirichlet allocation (LDA),Latent Semantic Analysis (LSA), Predictive Word Embeddings like Word2Vecfeatures and Doc2Vec features, etc. The conclusion of theexperiments and overall work is that out of the method that was have evaluated, support vector machine and gradientboosting machine trained on the feature stack performed better than logisticregression and random forest classifier in this particular case.