Teaching a Machine to Spot Hate Speech in Filipino Text

NLP for a language that doesn't play by the usual rules — mixing Filipino and English mid-sentence, and the challenges that creates for text classification.

I got into this project because of a question that had been bugging me: how well do NLP models work on Filipino text? Most natural language processing research focuses on English, Chinese, or a handful of European languages. Filipino — and especially Taglish, the Filipino-English mix that most Filipinos actually write in — is almost invisible in the research literature.

I wanted to find out what happens when you try standard text classification techniques on a language that switches between two grammars within a single sentence.

The Data

I used the Filipino-Text-Benchmarks dataset compiled by Cruz and Cheng (2020). It's one of the few large-scale labeled datasets for Filipino NLP, containing over 42,000 text samples across 6 different NLP tasks: sentiment analysis, hate speech detection, dengue-related tweet classification, and others.

42,000+
Labeled text samples across 6 Filipino NLP benchmark tasks

I focused primarily on the hate speech detection task, since that's where the practical stakes are highest. Philippine social media is extremely active — and extremely heated. The dataset labels tweets as hate speech or not, with subcategories for the type of hate (racial, political, gender-based, and so on).

The Taglish Problem

Here's what makes Filipino text hard for machines. A typical tweet might look like this: "Ang galing naman ng government natin, very nice talaga." That's a single sentence using Filipino syntax, Filipino words, and English words, all mixed together. This is called code-switching, and it's how the majority of Filipinos communicate online.

Standard NLP approaches struggle with this. English tokenizers don't know what to do with Filipino morphology. Filipino tokenizers can't handle the English fragments. And pre-trained language models, which are the backbone of modern NLP, were almost all trained on monolingual text.

38.4%
Prevalence of Taglish code-switching in the dataset — over a third of all samples mix languages

I tested three approaches:

  1. TF-IDF with a linear SVM — the baseline. Simple bag-of-words with no language awareness. Surprisingly decent.
  2. Multilingual BERT (mBERT) — a pre-trained transformer that's seen Filipino text during training, though not a lot of it.
  3. Fine-tuned Filipino BERT — a model specifically pre-trained on Filipino web text, then fine-tuned on our task.

Results and Surprises

The fine-tuned Filipino BERT hit 87.2% accuracy on hate speech detection. That's solid, but the more interesting finding was where it failed.

The model struggled most with sarcasm and irony. Filipinos use a lot of "mock praise" in online arguments — saying something positive while clearly meaning the opposite. The model read these as positive sentiment or non-hate speech. This isn't unique to Filipino, but the Taglish mixing made it even harder to detect, since the sarcasm markers are often split across languages.

Another finding: hate speech in the dataset was heavily concentrated around political topics. Election-related discussions generated the most toxic content by a wide margin. Gender-based hate speech was the second most common category. The model's precision varied a lot between these categories — it was better at catching political hate speech (which tends to be more explicit) than gender-based attacks (which are often more coded).

The sentiment analysis task revealed something else: the overall sentiment distribution in Filipino social media text skews negative. About 55% of the samples in the sentiment dataset were classified as negative, compared to 30% positive and 15% neutral. That matches what anyone who's spent time on Filipino Twitter would expect.

What I Learned

The biggest takeaway is that Filipino NLP needs more data. 42,000 samples is a good start, but English hate speech datasets have millions of labeled examples. The performance gap isn't really about model architecture — it's about training data volume.

The second thing is that code-switching is genuinely hard and there's no easy shortcut. You can't just translate the Filipino parts to English and run an English model. The meaning is often embedded in the language choice itself — switching to English for emphasis, switching to Filipino for intimacy or insult. A model that ignores that is missing real signal.

If I were to continue this work, I'd focus on building a larger Taglish-specific pre-training corpus. The Filipino BERT model used mostly formal Filipino text (news articles, Wikipedia). But social media text is overwhelmingly Taglish, and the gap between formal Filipino and actual online Filipino is huge.