How we autodetect spam using Google's Vertex AI

ยท

5 min read

One of the downsides of our recent growth was an increase in spam. Since we all hate spam, and it was diluting the quality of content, we started cleaning up manually. But soon we realised that it wouldn't scale, and we ended up wasting an hour or so every day cleaning spam. So, one day while having a casual conversation with my co-founder Fazle, he asked me if we could automate it somehow. Because of our manual efforts, we had generated a lot of historical data with respect to spam classification. "Can you feed that data and train an ML model?", asked Fazle. I thought it was an excellent idea to take the help of AI to automate the process and save us more time.

I am a big fan of no-code tools and off-the-shelf APIs. After all, I want to utilise my time to solve business problems and not reinvent the wheel. That's the reason we migrated to MongoDB Atlas last December instead of hosting a DB on premise. And because of the same reason we decided to use Vercel, instead of doing things manually on AWS or DigitalOcean. So, after a quick research, I decided to give Google's Vertex AI a try! Because that seemed like the easiest option.

Google's Vertex AI has a lot of use cases. One of them is text classification. So, once you have trained your model with sufficient data, it can accurately predict the label for a piece of text. That's exactly what we needed.

Preparing dataset

The first step is preparing data. We decided to extract 10K posts from our DB (5K spam, and 5K normal), and create the dataset. Google expects you to format your content in .jsonl format. It simply means each line in your file is a JSON object. Here's what it looks like:

posts.jsonl

{ "classificationAnnotation": { "displayName": "spam" }, "textContent": "Nhu cau su dung dich vu cho thue xe 29 cho tai Ha Noi", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } }
{ "classificationAnnotation": { "displayName": "spam" }, "textContent": "How to Get Online Assignment Help", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } }
{ "classificationAnnotation": { "displayName": "spam" }, "textContent": "Facebook Shops: A Sure Way To Grow Your Ecommerce Business", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } }
{ "classificationAnnotation": { "displayName": "spam" }, "textContent": "https://secretland.xyz", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } }
{ "classificationAnnotation": { "displayName": "spam" }, "textContent": "Recognize! 7 habits that will make acne sticky about the face", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } }
{ "classificationAnnotation": { "displayName": "spam" }, "textContent": "How to choose the right Payment Gateway?", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } }
{ "classificationAnnotation": { "displayName": "spam" }, "textContent": "How Can AI Make Us Smarter - Future of Education?", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } }
...

We are trying to detect spam based on post title. displayName property is our label. The AI will parse our file, and use this label to categorise the existing dataset. After we uploaded the dataset, Vertex categorised the post titles properly as expected.

Screenshot 2021-12-02 at 2.32.42 PM.png

Train the Model

The next step is training the model. It took ~6 hours for Vertex to train the model. After the training was done, we saw the following result:

Screenshot 2021-12-02 at 2.36.24 PM.png

As you may note, 90% of the titles were used for training, and 10% were used for predicting the label. The accuracy was ~93%, which was pretty useful for our use-case. So, we decided to deploy the model and test it out.

Deploy and test

Once you deploy the model, you will get an HTTP endpoint to make requests. You can call it from terminal like the following:

curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://us-central1-aiplatform.googleapis.com/ui/projects/${PROJECT_ID}/locations/us-central1/endpoints/${ENDPOINT_ID}:predict -d '{
  "instances": {
    "mimeType": "text/plain",
    "content": "YOUR_TEXT_CONTENT"
  }
}'

If you just want to test, you can also do so from Vertex dashboard. For serious applications, you might want to use Google's SDK. For example, we use Google's Node.js SDK to examine post titles every time a new post is published on our platform. Since your access token expires in an hour, you need to take care of the following in order to use the SDK from your Node.js app.

  • Generate a service account, and give it access to Vertex.

Screenshot 2021-12-02 at 3.38.51 PM.png

  • After creating, download the json file containing your service account key

  • Set process.env.GOOGLE_APPLICATION_CREDENTIALS to the path of the json file.

I struggled a bit here as I don't have much experience using Google services. If you are facing auth issues, make sure to follow the above steps.


The accuracy is amazing. It automatically detects spam content, hides them from Hashnode, and pings us on discord. In the initial few days, we used to observe discord for any anomaly. We did notice a few false positives, but after tuning the confidence level, it has been working pretty well.

Screenshot 2021-12-02 at 2.51.49 PM.png


So, spam clean up is pretty much on autopilot these days. It saves us time, and we can channelize our efforts where it matters the most. This is one of the reasons why I am such a big advocate of hosted APIs. It would have taken us a few days to implement this on our own. By using Vertex, we went from preparing the dataset to getting accurate predictions within several hours.

Did you build something similar that helped you save hours of your time? I would love to know your experience. ๐Ÿ™Œ