About a week ago, a few tests for Sydent (an identity server for Matrix that I am working on) randomly started failing. This, of course, was an occasion for some consternation on my part. It’s a little bit disturbing when tests start to fail but you know you haven’t changed anything! However, in this case the failures turned out to be a blessing as I learned a great deal in the process of hunting down and rectifying the issue. In this blog post I am going to go through the process, in great detail, of how I fixed this error. Since the code for Sydent is open-source, you can even follow along in the code if you wish!
When I noticed the failing tests, the first order of business was to get really clear on whether it was any new changes in my code that were causing the failure. To test this, I used Git checkout to switch back to an older commit from a week ago where I knew for a fact that the tests were passing. Once I switched back to this commit and ran the tests they still failed, which meant that whatever was broken, it wasn’t something I wrote. So I had to dig deeper.
I started by looking at exactly which tests were failing. To begin with, the test get_terms, the very first test in this file: https://github.com/matrix-org/matrix-is-tester/blob/main/matrix_is_tester/test_terms.py was failing. This was also a test which other further tests relied on. In the test suite only the tests pertaining to the terms of service were failing, which made me confident that if I focused my attention there I could find the problem. After a little investigation it became obvious to me that all the other terms-related tests were failing because we couldn’t retrieve the terms to run the tests in the first place.
A deeper look into the code shows that ‘get_terms()’ itself hits the /matrix/identity/v2/terms (kind of, technically it depends on whether you are using the v1 or v2 api but that’s beside the point) endpoint. You can see this in the code here: https://github.com/matrix-org/matrix-is-tester/blob/abd522aa057df71d7f5534b7c480ecde1b1d96b8/matrix_is_tester/is_api.py#L169. This endpoint should return a copy of the terms, so it seemed that the issue was in the Sydent server itself. All of the other tests were passing, which means that the server was, for the most part, working, so the issue really had something to do with this particular endpoint.
Venturing deeper into the Sydent code, I found that Sydent uses a terms servlet to handle these requests: https://github.com/matrix-org/sydent/blob/b3b3a270538f6abf8963d99872477e07bb9b0b99/sydent/http/servlets/termsservlet.py#L41. After running the tests and setting a few breakpoints I found that this servlet seemed to be loading and functioning fine, so that wasn’t the problem.
Next I checked inside the render_GET() function of the terms servlet, and found that it calls a function get_terms(), which is defined in the module terms: https://github.com/matrix-org/sydent/blob/b3b3a270538f6abf8963d99872477e07bb9b0b99/sydent/terms/terms.py#L98.
This is where I found the problem! On line 110 of the above file, this function assigns the local variable termsYaml to the return value of yaml.full_load(fp) where fp is the return value of a call to open() on the local file where the terms are kept. HOWEVER, this particular call was failing. The execution broke down and threw an error message in the debugger letting me know that the function ‘full_load()’ didn’t exist. A call to dir(yaml) confirmed this, as it wasn’t amongst the available methods.
This was baffling to me. This code has been in use for quite awhile, how could this be possible? I felt like I had witnessed a glitch in the matrix. But I knew there had to be an explanation. I did some research and found that according to our setup.py, we were using pyyaml to read and manipulate yaml files. A trip to pyyaml’s documentation https://pyyaml.org/wiki/PyYAMLDocumentation (which leaves much to be desired, tbh) confirmed for me that full_load() is not a part of the API. The documentation did let me know that the API is unstable, so maybe somehow it got removed??? Truly a glitch in the matrix. I changed the function from yaml.full_load() to yaml.safe_load(), and viola! The tests went green.
Before this experience, I had read phrases like “upstream api changes breaking CI”, but never witnessed what that really meant. I knew what each of those words meant individually, and could even guess at what they meant together, but working through a problem like this really shone a different light on what that means. So much of programming seems like it is not only knowing the concepts, but having a working knowledge of the contexts in which you might encounter said concepts and understanding how it might impact your code.
This experience taught me a few things: the value of stable API’s, the value of communication with regards to releases and changes in APIs or programs that people might rely on, and the power of following the thread of execution back through a program. It is one thing to have a general idea of how code works, and another thing entirely to walk through the code and watch how the sausage is made, so to speak. I now have a much better idea of how http requests are handled in Sydent, which is very cool. I followed the path from a URL back into the server and saw how the code worked to serve the requested file, and it was very enlightening.