Stuck In The Middleware

Stuck In The Middleware

ยท

7 min read

Featured on Hashnode

Welcome to the first article of the series - Debugging Days At Hashnode

In this series, we are going to talk about one of the important aspects of software engineering - Debugging and how it helps us solving series of weird bugs at Hashnode.

Setting Up Some Context

Not so long ago, Sandeep (Co-founder & CTO at Hashnode) shared this twitter thread ๐Ÿ‘‡

on how we leveraged Next JS Middleware to implement cache-busting via query params in order to serve our users light and dark mode blog pages with superior UX.

TL;DR - We made sure inside our middleware that we read the theme preferences from a cookie and append a query param to our outgoing request corresponding to our theme. This will help us return cached blog pages for each theme (light or dark).

Hashnode Community Revamp In Full Throttle

As most of you must be aware that hashnode.com went under a huge overhaul where one of the core components of this process was migrating to Vercel as was done for blogs the last year. We already were using Next JS but our deployment for the community platform was powered by AWS lightsail and we wanted to reap the same benefits of deploying to Vercel as we did for blogs. One of those benefits included the Next JS Middleware feature.

Before the public release, we had an internal beta release for our team on a Thursday to test the revamp in a production environment.

The D-Day

One day after the beta release, we came across a series of bugs where :

  • Hashnode feeds suddenly started showing duplicate posts as we scrolled.

  • The trending articles that load on the right side bar were same for 1week, 1month, 3months and 6months duration.

  • The Explore Tags page got stuck while loading tags for any category we select from the dropdown - week, month or all.

This was reproducible in our beta-production environment but not in our staging-development environment. That didn't make any sense at first glance. How can the same code be responsible for breaking the beta deployment but not the staging one ๐Ÿค”.

One thing that we noticed is that beta deployment had dark mode on and staging one had light mode.

Alright

We switched the light mode to dark mode on the staging environment as well and started seeing the same weirdness over there too and did the reverse for beta environment and everything was working smoothly.

What is happening

How can a UI toggle which just changes the theme presented to an end user have such a dramatic impact ? Well, let's dig deeper ๐Ÿง.

Identifying The Cause

We took the below series of steps in identifying the root cause of this behaviour :

1) Checking Our Own Code Changes

As we know from before, the deployment in beta environment happened one day before this incident. So naturally the first place to look for were the code changes that dealt with this. Since these were all related to setting up production Environment Variables, there were no changes that might have caused this.

2) PR Comparisons

We compared a new PR (Pull Request) post incident with one from day before. These PRs were not involved in any code change around themes as well as our beta deployments. The previous day PR worked flawlessly but the latest one didn't ๐Ÿ˜•. And nothing changed in between that could have caused this. Well nothing, our developers changed, to be accurate (you'll realise later).

3) Checking Local Development Setup

While we have already touched upon the behaviours in our staging and beta environments, it was also time to test the behaviour in our local development setup. Nothing broke there. That's another clue. Both staging and beta deployments are powered by Vercel. So it's across deployments something that changed.

Almost there

4) Why Data Inconsistency Though ?

Remember the bizarre duplication of data happening in dark mode ? It was acting as if we were not passing proper query params from the UI to our Rest APIs. For instance, let's consider the Hashnode feeds. We only showcase n number of articles which are server rendered on the feed when a user loads the page. As a user scrolls on the UI, we send a request to the relevant Rest API with the query params - ?page=1,?page=2 and so on to give you that infinite scroll based feed articles. Even when we inspected the network calls from the UI, it was observed that we were sending ?page=1,?page=2 query params and so on yet they somehow never made it to the backend. And our backend, by default, was returning the data for page = 0.

Similarly, the trending articles widget that you see in the right sidebar showed same data irrespective of the duration selected in dark mode. This is again because the query params never made it to our respective Rest APIs and they returned results for default duration = 6months.

Below you can see the contrasting behaviours for light and dark mode. These were obtained from the preview deployment url of a PR from that time :-

light-dark_hashnode_engineering.gif

5) Something Went Wrong In The Middle

Middleware is the single point where we do read our theme preferences and rewrite a new url by modifying the query params. Obviously, nothing got changed here by us and we already know the behaviour only replicates in latest Vercel deployments.

6) Looking Beyond

looking beyond - xmen cerebro gif

These were enough clues to zoom out and move away from Hashnode codebase and look around what changes Next JS team made in the past 24 hours. As we went through their commits, one of them, peaked our interest.

7) Inspecting New Changes

Here is what got changed in this commit, as stated by the team :-

In previous versions of Next.js, we were merging query parameters with the incoming request for rewrites happening in middlewares, to match the behavior of static rewrites declared in the config. This forced Next.js users to use empty query parameters values to delete keys.

We are changing this behavior to allow extra flexibility and a more streamlined experience for users. So from now on, query parameters will not be merged and thus the warning.

From the above excerpt, query parameters will not be merged was the most important phrasing for us. That's exactly what was happening. Our query params were not reaching the Rest APIs because we were losing them in our middleware but since we were only rewriting the incoming request url if the cookie had dark mode preferences enabled, this only happened for dark mode.

Again, to solve this, Next.js team stated this :-

If you are relying on the old behaviour, please add the query parameters manually to the rewritten URL. Using request.nextUrl would do that automatically for you.

8) Beta And Semver

Alright let's take a pause here. Even though we found what fundamentally changed in the middleware API and it does seems like the root cause of our problem, one thing that doesn't make sense is how can something which was committed to the canary release have impact on prior version deployments on Vercel ?

We were using Next JS 12.0.8 but this commit was made to 12.0.10 canary release. Out of curiosity, we explored the build logs from Vercel dashboard (although this can also be easily seen when running local dev setup). Suddenly, everything started making more sense after we read that :-

warn - using beta Middleware (not covered by semver)

That's it. When Next JS team made this specific change to middleware, it also started reflecting in our latest deployments since it's not governed by semver.

9) Fixing Our Code

Chimp coding

Let's see the middleware code from before :-

export default async function middleware(req: NextRequest) {
  if (typeof req.cookies.isDarkTheme === 'undefined') {
    return null;
  }
  return NextResponse.rewrite(`${req.nextUrl.pathname}?isDarkTheme=${req.cookies.isDarkTheme}`);
}

The above code was bound to fail with the latest deployment since Next JS team stated in the updated commit docs that you need to do query param based changes on the req.nextUrl object and then pass it to NextResponse.rewrite function. Otherwise, the query params are not merged. So except isDarkTheme, all the other query params that we were expecting to reach our Rest APIs got dropped off in the middleware.

Here is how the updated middleware code looked like :-

export default async function middleware(req: NextRequest) {
  if (typeof req.cookies.isDarkTheme === 'undefined') {
    return null;
  }
  const nextUrl = req.nextUrl;
  nextUrl.searchParams.append('isDarkTheme', req.cookies.isDarkTheme);
  return NextResponse.rewrite(nextUrl);
}

When this updated code got deployed, everything started working as intended across all environments.

Wrapping Up

So this is how we spent one of our Fridays on debugging weird bugs at Hashnode. Debugging which feels like detective work is certainly exciting โœจ. Hopefully you enjoyed reading this. We will be sharing more of such debugging stories at Hashnode as we move forward. Follow our official engineering blog to stay looped in ๐Ÿ˜‰.

Oh and btw, did you know you can create FREE engineering blogs for your organisation using the new team blogs feature by Hashnode ? Get to know more about it here :-