A few days ago, I was struggling to breathe because of a bug in the production environment. First, it was very random, only some users encountered it, and it was impossible to reproduce locally. Second, the hard deadline was only a few days away. If I rolled back, I was afraid I would miss the deadline. If I didn't roll back, I had no idea how to fix it.
What's even more awkward is that this is the first project I've worked on since joining the company, and I'm in charge of the overall project. I haven't built up any connections yet, and the company's knowledge base isn't very systematic. Just when I needed to establish my authority through a project, I encountered this problem, so the pressure was immense.
Event flow?
The project required an SSO upgrade from other authentication methods to OpenID. After the front-end and back-end were completed and released, some users suddenly reported an infinite loop, unable to complete authentication and enter the program. Strangely, the engineering team could not reproduce the problem, and only a portion of the users experienced this issue. Even more bizarrely, the problem was intermittent.
We collected some intermittent data and derived the following clues:
Clue 1: Initially, I noticed the issues all occurred on Windows machines, and all were using Chrome. This puzzled me, as shouldn't Chrome's explanations differentiate between Mac and Windows? After checking the release notes, I found that Chrome did indeed release a new version for Windows around this time!
Clue 2: The infinite loop occurred during previous testing because the load balancer didn't have a stickiness cookie set, causing session transmission issues to the correct server in a multi-server environment. Therefore, my focus shifted back to the load balancer; perhaps its stickiness cookie becomes invalid under certain conditions?
Clue 3: Our solution was to use the mod_auth_openidc component on Apache. I found other teams that were using the same architecture, but their recently launched features did not encounter similar problems. Only me and another team in the entire company encountered similar problems, so we wondered if the problem was caused by configuration differences.
We found a team experiencing the same issue, which was a slight relief; at least the problem was somewhat widespread, but not entirely so, which still puzzled us. Since it was an isolated incident, both teams confirmed that no engineer could reproduce the problem. We investigated from Wednesday until the following Monday, starting with Chrome, then AWS ELB, and finally only an Apache configuration issue remained.
We were very confused because we couldn't reproduce the problem. Even if we found the correct configuration, we couldn't be sure if the problem was solved. We couldn't keep pushing the version to the production environment and making the users the mice, could we?
Having ruled out other causes, we focused all our efforts on Apache configuration. Unfortunately, none of us were experts in this area, so we had to revisit the mod_auth_openidc documentation, reading it line by line to understand its workings. Persistence paid off; after understanding the principles, we tried various methods to disrupt the system and finally reproduced the problem. Once the problem was reproduced, we found a solution, because the problem was indeed quite silly.
The problem
Before explaining the problem, let's understand how mod_auth_openidc works:
- When a client sends a request to a server, the server, upon receiving the request, sends the client a temporary session cookie, which we call a state cookie.
- The client uses this temporary state cookie to verify its identity at the verification center. After verification, the client uses this state cookie to compare and verify its identity with the server.
- If the verification passes, the server will delete the state cookie and then issue a permanent session cookie.
- The client uses this persistent session cookie for communication authentication.
The problem lies in this persistent session cookie. After repeated verification, we can successfully obtain this session cookie from the server. However, even with this session cookie, we are still assigned a state cookie by the server and are endlessly redirected to the verification center page. This results in a repetitive cycle of: generating a state cookie -> converting to a session cookie -> generating a new state cookie -> overwriting the previous session cookie.
Now that the problem has been identified, what could be the cause?
We discovered that this was because the server wasn't using the newly generated session cookie for verification at all! Instead, it was using a different session cookie, causing the verification to fail. The following method can reproduce the problem:
- OpenID verification was performed on product A (a.xxx.com), and a session cookie was obtained. However, product A set the domain of the session cookie to the top-level domain: xxx.com instead of a.xxx.com.
- When attempting to verify on product B (b.xxx.com), an infinite loop occurred because product B's session cookie was using the subdomain b.xxx.com.
In simple terms, the problem is that when there are two OpenID session cookies, the server prioritizes the one with the top-level domain for verification. However, product A is using the wrong top-level domain, which prevents our product with the correct domain from being verified.
In other words, the reason we cannot reproduce the problem is because we did not use the "problematic" product A and therefore could not obtain the problematic cookie!
A solution emerged: simply give our session cookie a separate name, and it won't matter if the domain changes.
The truth came out, and I was truly heartbroken.
Further Thoughts
From a technical standpoint, this is a very basic error, but it's also beyond our control. We used strict subdomain restrictions to limit cookies, but once another team adopted a wider set of settings, it caused problems. This issue not only affects our team but also all teams in the company using OpenID. The reason we encountered it is because our users happened to be using the "problematic" product at the same time.
After fixing the issue, I also sent an email to the mod_auth_openidc official team, hoping that the problem could be resolved at its source. While it's not inherently problematic to prioritize the main domain, the domain from which the request originates should be verified first, or at the very least, all subdomains and the main domain's cookies should be verified. In the worst-case scenario, a strong reminder and best practices should be provided when configuring domain settings.
From a technical standpoint, front-end development is incredibly complex and tedious. Mishandling even the smallest details can sow the seeds of future problems. This particular case involved configuring a front-end and a DevOps server, which is quite challenging for someone who isn't a specialist. The current trend is towards the integration of Dev and Ops, but is this really a good thing? Without specialized expertise, how can configuration issues like these be avoided?
From a team perspective, I'm starting to understand why many engineers stay at a certain level for life. There's always a level where salary and responsibility are relatively balanced. As you move up, you take on more responsibility, and the salary doesn't necessarily increase proportionally. The higher you go, the more powerless you feel when you encounter problems. You could have relied on higher-level people to help you solve them, but as you advance in your level, fewer people can help you, while more and more people need to depend on you.
But if you don't move up the ladder, how can you achieve self-realization through decision-making?
This siteOriginal articleAll follow "Attribution-NonCommercial-ShareAlike 4.0 License (CC BY-NC-SA 4.0)Please retain the following annotations when sharing or adapting:
Original author:Jake Tao,source:"A chronicle of the days and nights plagued by bizarre bugs"
Comments list (2 items)
I admire your team; you didn't just tell the customer to switch browsers.
@Jenny :It's not that the voice isn't there, but the lack of a fundamental solution makes people uneasy.