From Chaos to Clarity: Understanding GraphQL Error Codes with Stellate, AWS Lambda, API Gateway, and Apollo

From Chaos to Clarity: Understanding GraphQL Error Codes with Stellate, AWS Lambda, API Gateway, and Apollo

ยท

9 min read

GraphQL handles error codes a bit differently compared to REST Apis. While we still get HTTP response codes like

  • 200 OK

  • 400 Bad Request

  • 500 Server Error

It often happens that an error happened even if we received a 200 status code.

This blog post is about the internals of how Hashnode uses error codes for debugging and understanding the system. This is also for you my fellow Hashnode colleagues ๐Ÿค—

This post is not a complete guide to GraphQL error metrics. Each framework handles it a bit differently and I am not considering myself an expert in each of them.

Hashnode's Architecture - Stellate, API Gateway, Lambda, Apollo

To be able to understand how everything interacts I first give you an introduction to our internal Hashnode Architecture.

Hashnode's API Infrastructure

We have a serverless-first mindset. That means everywhere where it makes sense we use serverless services. We don't want to build & manage infrastructure ourselves.

Our API consists of the following services:

  • API Gateway with AWS Lambda as the main API

  • Apollo Server as the GraphQL server

  • Stellate CDN as an Edge Cache, Analytics, and Error Software

To cache our responses on the edge we use Stellate. Stellate gives your GraphQL API superpowers! Automatic alerts, analytics, rate limits, and much more.

In between sits an HTTP API Gateway. API Gateway integrates with Lambda. It forwards each request and sends it to the Lambda function. Lambda creates the Apollo Handler on each invocation (for new containers) and fulfills the request.

API Gateway and Lambda have both status codes. This is where it gets interesting.

  • API Gateway Status Code: This is the status code of the receiver of this API (i.e. Stellate) that receives

  • Lambda Integration Status Code: This is the status code API Gateway receives

This is not necessarily the same status code! There is even a third response status code, which is the response code by Stellate. Most of the time (almost always) this mirrors the status code of the API Gateway.

By using Stellate we were now also aware of all the different error codes our API consumers received.

Different Status Codes by Stellate

That's the reason for this blog post. Let's dive a bit deeper into the errors GraphQL returns to you.

An Overview of HTTP Status Codes

GraphQL is still a typical REST call. You make a POST call to an endpoint and you will receive an HTTP response. This response can have several status codes.

Let's first look over the status codes briefly, then focus on some detailed examples and see how Stellate, API Gateway, and Lambda react.

200 - OK, or Is It? ๐Ÿค”

200 in HTTP means the request was successful. This is one of the confusing codes in GraphQL.

200 doesn't tell that no error happened. It only means that the request was successful. You still need to check your actual response to see if an error was returned. If an error in your GraphQL API happened it will be visible by having an GRAPHQL_ERROR_CODE in the response body.

400 - Bad Request

GraphQL APIs are based on a schema. Our schema for a publication for example is the following:

  type Publication implements Node {
    id: ID!
    title: String!
    ...
}

If you want to access a field that is not implemented in this schema you get an error code 400 - GRAPHQL_VALIDATION_FAILED. 400 in general are validation errors.

500 - Server Error

500 still means server errors. 500 are still bad errors. That means you need to check them. They should not happen. But be aware. Understanding where the error actually appeared can be quite a challenge in the beginning.

500 is a server error but what is our server in the architecture? Is it the lambda handler? The Apollo Handler, or anything in between?

The Lambda function has two components:

  1. Lambda Function: This is everything that happens before the Apollo server starts, e.g. connecting to the DB, fetching secrets, etc.

  2. Apollo Handler: This is the Apollo server handling requests.

500 in our architecture means there is something wrong with the Lambda function itself. This doesn't mean something is wrong with the Apollo Handler but with the Lambda Function. We will see both examples in the example section.

Real Scenarios - Let's see some Examples

Okay, so far the theory. I hope you are still with me. Let's now dig a bit deeper and understand some example scenarios.

Successful Request - The Happy Path ๐Ÿค—

A user sends a correct query, for example, this one:

query {
    publication(host:"jannikwempe.hashnode.net") {
        id
        author {
            name
        }
    }
}

Stellate receives this query, and forwards it to API Gateway, API Gateway creates the event and invokes the Lambda function. Lambda queries the database and sends the response back to API Gateway. This is how everything comes together.

All states are 200 and everything is fine.

Let's see the example in Postman:

We receive the response we expect with the status code we expect ๐Ÿ‘๐Ÿฝ

Error in Apollo Handler - 200 NOT OK โŒ

Now let's see an example of the mysterious 200 response with an error.

We mock an error by throwing an error from the Apollo Handler. Remember, Apollo Handler != Lambda Handler necessarily.

In the Lambda Handler function we create the Apollo Handler like that:

const serverHandler: Handler<any, any> = (...args) => {
...
  return server.createHandler()
...
};

Everything that happens within the server will be an error within the handler. To mock this behavior I've added a throw new Error() somewhere in querying a publication.

The result looks like that:

If we are now calling the same query we see the following result:

For us, this was new. Even for an undefined error and a clear server error, the API responds with a 200 response code. The actual error code is in the response and maps on the error code INTERNAL_SERVER_ERROR. You can define this behavior of course.

That is where Stellate is doing a great Job. Without the need of creating any extra logic, Stellate shows us the error in their Error Dashboard:

Learning from this scenario: You need to have alerts on 200 response codes like INTERNAL_SERVER_ERRORS as well.

Server Error, for Real -> 500

The next scenario we look at is a proper server error. This time not the Apollo Handler but the Lambda Function itself throws an undefined error.

The Lambda function is defined as everything that happens before the Apollo server is created. In our scenario, this is mainly connecting to our database and caching the connection in the Lambda context. We are doing this by using the amazing middy middleware library.

I introduced an error by throwing an undefined error in the middleware that connects to the DB.

Let's see what it looks like in Postman:

Ah, this time we get a proper 500 error code! Which makes sense of course. No GraphQL server started so there is no way to parse errors in a different way than a normal REST API.

In Stellate it looks like that:

Learning from this scenario: A 500 error means there is something wrong with the underlying "infrastructure" which is your Lambda function. This makes debugging a whole lot easier.

GET Request on Stellate vs. directly on API Gateway

GraphQL only uses the HTTP Method POST for serving data. As a company with our scale, we see many people trying out our internal APIs as well and trying to come through ๐Ÿ•ต๐Ÿฝ

Since we'd like to understand how our system behaves in different scenarios we also looked at that one.

If you send a GET request to the Stellate endpoint you will get the following response: 204 - No Content

Interestingly enough, if you send the same request directly to the API Gateway you will get a 400 response code.

The main important thing. Requests like that shouldn't even be able to enter your API.

Learning from this scenario: Understand that some response codes behave differently on Stellate and on API GW.

Validation Errors

Now we will look at some validation errors, i.e. 400 error codes. A validation error means that there is something wrong with your query or mutation. While you get 400 as a response code you will also get a GraphQL error code in your response.

Access Invalid Field

Let's start by trying to access a field that doesn't exist.

I try to send the following query to our API:

{
   publication(host:"jannikwempe.hashnode.net") {
       id
       bla
   }
}

This will result in a response code 400 -> validation failing.

We receive the error GRAPHQL_VALIDATION_FAILED. So far so good.

Type Check Fails

Something similar happens if you provide the wrong type. Instead of passing the username as a string, I've added the username as a number here.

This will also result in the following response

{
    "errors": [
        {
            "message": "String cannot represent a non string value: 123",
            "extensions": {
                "code": "GRAPHQL_VALIDATION_FAILED",
            }
        }
    ]
}

400 but a server error

One error that took some while for us to understand was the following one:

In Stellate we received lots of error messages with the response code 400. But the actual GraphQL error code was INTERNAL_SERVER_ERROR.

So what now? Validation error because of the response code 400? Or an internal server error? But why isn't that a 200 error then like in the first examples?

By taking a closer look we saw that none of these exceptions had a query attached. By trying to reproduce it we saw that some people are trying out to send empty queries to our API. This results in a server error with the response code 400.

Cases like these can be adjusted manually in Apollo. But by default, it behaves like that.

Learning from this scenario: Understand that the GraphQL error code is the most important piece to understand. In this case, INTERNAL_SERVER_ERROR is an expected code and nothing is wrong with your server.

Playground deactivated - 403 vs. 400

One last error we faced a lot at times is a 403 error from Stellate:

This error indicates that somebody is doing a GET request on our API. But when I tried to reproduce that I got a 204 - No Content like seen above. It took some time to understand that this error message results if somebody tries to access a disabled GraphQL playground. When I try to access the Stellate Playground I get a 403 Forbidden error.

But if I try to access the GraphQL Playground directly in Apollo and it is deactivated I get a 400 error.

This is nothing major. It is still important for us to understand the differences. With that, we are able to act on real 403 issues.

Summary

In summary, it is really important to understand the ins and outs of your API. Stellate makes our lives much easier by automatically parsing GraphQL error codes and showing the resulting response codes. In the end, everything is dependent on the implementation of your GraphQL server.

To be able to act quickly on incidents or abnormal behavior it is critical to understand what each error code and error message actually means. Many errors are expected. Especially in cases like:

  • Unauthenticated access

  • Validation Errors

But there are also many cases where they are not expected.

We have created several alarms and in special cases also an automated generation of Incidents (๐Ÿ‘‹๐Ÿฝ Better Uptime) so that we are able to act quickly on all upcoming incidents.

Thanks for sticking with me, see you soon ๐Ÿ‘‹๐Ÿฝ