state of the art in cyberbullying involves training a machine learning model
using supervised learning. The research work mostly focuses on feature
engineering, i.e., finding features that can separate bullying comments from
non-bullying comments. Finding good features is difficult and problematic.
Features that work well for YouTube comments might not work for Twitter
comments, due to different social media platforms being likely to have varying
vocabulary and expressions in part caused by restrictions on communication,
different age groups, and user’s
can be noted that the model performed well on a training dataset, generating a
score between 77% – 90% but failed to generalize the test dataset. This is a
case of over-fitting or a large amount of
variance in which the model tries its best to fit the training dataset but
cannot classify the test dataset correctly. There are a number of possible
reasons for such behavior:
Size of the dataset. Any machine learning algorithm
performs well on a dataset containing a huge number of samples. Whereas the
training dataset used for training the algorithm contains a very limited number
Differences in the dataset due to mixing social
comments from two different social media platforms.
The dataset requires further data cleaning and
preprocessing. Upon looking at the normalized dataset after data preprocessing
step, it is found that although the preprocessing did a good job in normalizing
the dataset, a lot of samples still remain inconsistent. A large number of
abusive words and insults is missed out from the vocabulary because of its
vastness of usage in many different forms. Apart from the abusive words, there
were Unicode characters that remained in the preprocessed data. All these
factors contributed greatly towards the poor performance of the model.
There were comments present in foreign languages like
French and Spanish. The model only learned to classify English comments.
of different kinds of features, for example, Latent Dirichlet allocation (LDA),
Latent Semantic Analysis (LSA), Predictive Word Embeddings like Word2Vec
features and Doc2Vec features, etc.
The conclusion of the
experiments and overall work is that out of the method that was have evaluated, support vector machine and gradient
boosting machine trained on the feature stack performed better than logistic
regression and random forest classifier in this particular case.