Realpath Cache in Depth - or - Fixing Neos Deployment Instability

On one of our sites where we employ "classical" deployment (without docker containers), we have found weird problems related to the deployment: Sometimes, the updated markup was not delivered by the server, but the server was still delivering the old markup. This story is a deep-dive of what we found, what we learned and how we finally fixed it. In the end, it's just a tiny NGINX configuration change ;-)

Our Deployment works by Symlinking

To allow a smooth zero-downtime deployment, we use the method popularized by Capistrano (and implemented in many deployment tools such as TYPO3 Surf, or Ansistrano which we use here): We build up a copy of the website code, and when everything is ready, all caches are filled etc, we switch a "current" symlink to the newer version for an atomic switch. Thus, the folder structure looks roughly like the following one:

/deploymentRoot /current --> symlink to ./releases/3 (the current release) /releases /1 /2 /3

The application is not fully self-contained in the above folders. Additionally, we use redis for caching - where we use the OptimizedRedisCacheBackend for Neos.

Problem 1: Wrong page cache content in Redis

Symtoms: After a deployment, sometimes, the page still showed the old markup, and not the new markup. The problem did not go away after waiting for a longer time,  but instead the content cache (in Redis) needed to be flushed manually.

Problem Analysis: Our deployment worked in detail in the following steps:

  1. transfer code into new release folder
  2. clear caches
  3. warmup caches
  4. migrate database
  5. switch current symlink

Now, clearing the caches cleared all caches, so also the content cache inside Redis. If a user hit the website after the cache was flushed but before the current symlink was switched (so after step 2 and before step 5), the cache would be filled again with the old content.

Solution Idea 1: Move Clear-Cache after the switch

Our first idea after analyzing the above scenario was simply moving the clear-cache after the switch step. While this fixes most of the issues, it did not fix the problem in all scenarios. To understand this, we have to look closer into the time where the switch happens. Let's assume the following happens:

  1. a user requests the page directly before the switch happens. The page is started to be generated by the OLD codebase. Let's assume the process waits here for a bit (e.g. because generation takes longer).
  2. we switch the current symlink, and we clear the cache.
  3. The request from 1. continues working, and writes the old state to the shared cache.

Boom - our cache contains the wrong state persistently It's very hard to fully migitate this problem because we cannot easily know how long a request may stall (and we do not want to rely on PHP's max_execution_time or stuff like that).

So we need to come up with something else.

Solution Idea 2: Segment the cache based on the installation folder

During investigation, we stumbled upon this commit, where Flow's Application Identifier (which I did not know about before btw - super handy!) is used as a prefix for cache identifiers. In our case, the Application Identifier changes for every release, because the base path for our Flow installation changes with every release.

We updated OptimizedRedisCacheBackend to use the same mechanism, so this way, each release uses a new ApplicationIdentifier, which then uses a separate part of the cache.

In our error example from scenario 1, the old request would still write the old cache, which are never read from anymore - so that's not a problem anymore.

Problem Solved. Or so we thought ;-)

Problem 2: The Problem is Back. Somehow. A bit.

After deploying to production for some more time, we figured out that sometimes, we received the old markup still (and our new CSS, which did not match the old markup anymore).

We again checked the Redis cache and quickly saw that in there, everything was correctly working. The cache was properly segmented into "old" and "new" part, and it did contain the correct markup.

How could it happen then that we still saw the wrong markup? We needed more data to understand this; and thus we wrote a script which hammered the server with requests in an endless loop, so that we could understand the "switch-point" in better detail:

#!/bin/bash rm -Rf work; mkdir work; cd work; export i=$((100000)) while true; do curl -o $i-`date '+%s'`.html https://server-experiencing-this-problem/ i=$((i+1)) done # the grep below shows us the currently used CSS cache bust, and the markup following the string "three-column" # grep -E -o '(.{0,3}App\.css\?bust.{0,4}|three-column.{0,14})' work/*.html%

After running this script around the switch-point, we were able to see which CURL requests got the old page returned, and which requests got the new page returned. On the following screenshot, you see the results: The red arrow displays the switchpoint, and all the yellow markers on the right side show the old website state. You can ignore the actual console output itself.

That was surprising! After the switch, quite some requests still returned the old website. Over time, the old website responses got less and less.

At this point, we did not have a clear understanding what was going on. We just knew it was a server-side issue, so we thought about four scenarios:

  1. an issue involving the PHP Opcache
  2. an issue with PHP-FPM
  3. an issue with nginx
  4. or a combination of the above.

We now tested these scenarios one-by-one, always rolling back the website state and deploying the same version.

When we disabled the opcache, results were like above - with no visible change. So we knew at least that the opcache did not contribute significantly to the problem.

For testing whether PHP-FPM might be the issue, we thought that there might be some state residing in main memory after the end of a PHP request, which somehow influences the next request. We had no clue yet what this was, but we simply restarted the PHP-FPM processes manually directly after the switch-point. And voila: this is what we saw - the first arrow depicts the switch and the second arrow shows when PHP-FPM has been restarted.

Now, we were hooked: Restarting PHP actually fixed the issue? That totally clashed with our mental model of how PHP behaves (stateless, etc...).

We then checked PHPinfo, showing all PHP options and installed extensions, to figure out whether an extension was causing this behavior or some sort of configuration option. After looking closely through the list (we did not have many extensions installed), we stumbled upon the PHP Realpath Cache. After some googling, we found this article by the fine folks of tideways which hints that the realpath cache might cause issues when deploying with symlinks (which we did in this project).

Background: What is the Realpath Cache?

The Realpath Cache is a per-process PHP cache which remembers for a given path what its absolute path is. It is used throughout every relative file operation, so every include, require, fopen, ... to improve performance. The Realpath cache can be flushed manually as well, but only for the currently running process (and not for all processes in the worker pool).

In our case, this can lead to problems, if the cache contains entries such as "releases/current/SomeFile.php => releases/2/SomeFile.php": If a new release appears ("releases/3/SomeFile.php"), only the realpath cache of the current PHP process is flushed. Thus, the current PHP process correctly contains the cache entry "releases/current/SomeFile.php => releases/3/SomeFile.php", while all other PHP processes contain the cache entry pointing towards "releases/2/SomeFile.php".

The individual cache TTL of the worker processes thus explains the behavior we saw above: After deployment the old content is still delivered in some requests and then less and less until the problem eases out, when all worker processes have been restarted.

How to fix the realpath cache problem?

We could simply disable the realpath cache and call it a day, but we felt this would be an unelegant solution where we did not know the performance implications. Luckily, via our google query above, we also found this mail thread from #PHP-Internals from 2015, where Sebastian Bergmann (phpUnit) discusses the exact same issue with Rasmus Lerdorf (one of the PHP founders) - and where Rasmus points out the solution to use $realpath_root in the nginx configuration instead of $document_root.

This means our NGINX configuration changed as follows:

# OLD: location ~ \.php$ { include /usr/local/etc/nginx/fastcgi_params; try_files $uri =404; fastcgi_pass unix:/var/run/php-fpm/php-fpm.socket; fastcgi_index index.php; fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name; } # NEW: location ~ \.php$ { include /usr/local/etc/nginx/fastcgi_params; try_files $uri =404; fastcgi_pass unix:/var/run/php-fpm/php-fpm.socket; fastcgi_index index.php; fastcgi_param SCRIPT_FILENAME $realpath_root$fastcgi_script_name; }

How does nginx $realpath_root fix the issue?

Background: PHP-FPM opens a socket and spawns a few processes - and then nginx passes on requests via the FPM protocol to PHP-FPM, along with some configuration (fastcgi_param). These fastcgi_params are a bit of black magic (at least for me - I usually copy/paste them, and I don't know of a clear reference on the web). The SCRIPT_FILENAME parameter points to the physical location of the script on disk.

If we use $document_root$fastcgi_script_name (old), it points f.e. to /var/www/current/MyScript.php, whereas, in the new configuration via $realpath_root$fastcgi_script_name, nginx itself resolves symlinks, and then calls the script at /var/www/releases/3/MyScript.php.

This means that the realpath cache uses totally different cache entries, because the file which invokes everything else is located in a different location. A very elegant solution to this problem - thanks Sebastian and Rasmus for discussing it on the php-internals list :-)

Another related fix: disabling opcache.revalidate

Before this change going live, we forced the opcache to revalidate all paths (in a prior problem where the PHP Opcache was the problem). Turns out, with the fix from above in nginx, we can disable opcache.revalidate_path, because again, every release has a different base URL - leading to non-overlapping opcaches - we could thus disable the fix below by commenting it all out:

; !!! we do not need this fix anymore, because we now use $realpath_root in nginx PHP-FPM params. ; [opcache] ; opcache.revalidate_freq = 0 ; opcache.revalidate_path = On

Closing Thoughts

That was a long ride we have done - we have learned a lot along the way and understood the PHP runtime again a bit better. I hope you enjoyed the read - I've written the things down which we learned to ensure I'll remember them myself and hopefully stumble upon this blog post if I have this problem ever again :-)