DEVELOPMENT OF CYBERBULLYING DATASET (CYTED): FLAMING CLASSIFICATION
Abstract
Cyberbullying is a widespread issue that has significant psychological impacts, particularly in its flaming form, which often occurs on social media platforms like Twitter. Detecting flaming behavior within the Malaysian context is challenging due to the scarcity of reliable datasets, especially in the Malay language. This paper aims to address this gap by developing a small dataset of keywords in both Malay and English related to flaming cyberbullying. The objectives of this paper are to extract relevant keywords from Twitter, to develop a flaming classification dataset, and to evaluate this dataset by applying various machine learning algorithms. A total of 3,600 samples (1,800 in Malay and 1,800 in English) were collected through keyword-based searches using the TweetHarvest tool. The processes of data preprocessing, feature extraction, and classification using Logistic Regression, Random Forest, and Support Vector Machine (SVM) were carried out with 10-fold cross-validation. Based on the conducted experiments, Logistic Regression achieved the highest accuracy, with a rate of 94% for Malay keywords and 95% for English keywords. This paper successfully developed a dataset for flaming classification, which can serve as a basis for creating a cyberbullying detection model.



