Understanding Unknown Source Code

Sebastian Kurfürst19.09.2019

In our daily work, we are often confronted with big piles of unknown source code: We use big frameworks where we usually know some surface area, but never all the nitty-gritty details across all layers. Sometimes, we see projects written by other companies or co-workers, and we need to navigate these as well. Last, when we did not work in a project for some months, and start touching the code again, we may have forgotten many intricate details – so the approach outlined here fits there as well.

In this post, I outline our method on how to read, understand and navigate source code in unknown, big projects.

tl;dr: Locate the place in the code the where the bug/issue surfaces. Verify your hypothesis by putting in break/die/echo/debug statements. Once you are sure to be at the right place continue looking at the environmental code, and try to understand the how and why.

tl;dr (2): At the bottom of this article, you'll find a video of the workshop in German; and also a practical example.

We need to answer a specific question or fix a bug.

There are two distinct questions which you often ask when checking out foreign source code:

I need to answer a specific question; or fix a bug.
I want to understand the architecture of foreign software.

Let us start by seeking the answer to a specific question. At the end of this post, we outline additional strategies for understanding the architecture of foreign software.

So, let's get started.

1. Ensure you have the source code locally

The source code which contains the answer to your question should be available locally on your disk, ready for easy searching. While this sounds trivial, big projects are often composed of many packages, and many Git repositories as well. I usually do the following things if finding the source code is not obvious:

Try to find the GitHub organization of the project in question.
Check out the pinned repositories, they are usually most important.
Also, have a look at the GitHub Stars - they also are a good indicator of importance.

Then, after I found the source code, I usually do the following steps:

Figure out in which language the source code is written by having a rough look at the repository.
Clone the source code.
Install the right IntelliJ IDEA plugin. This is not strictly necessary, but to me it often helps to load the code in a proper IDE with code insight.

2. Finding an entry point

Now, I try to find some entry point related to my question – usually by fulltext-searching the codebase. I often either use grep or the built-in search of IDEA. The following tips help to find entry points:

search for static text you see in the UI, i.e. a menu entry name, a css class, some console output string. You will often find these strings in a localization file. There, you can usually find an internal identifier – and when you search for that one in the code base, you usually get some source code match.
Look out for stack traces: In case of an error, check the output, the browser console (if applicable), and the log files for stack traces. These contain extremely valuable information, usually including the method, file and line where the error occurred. Be sure to learn how to read stack traces.

Now, we hopefully have an assumption that the code we found is related to the thing we want to find.

Stack Traces contain extremely valuable information. Be sure to learn how to read them!

3. Experiment: Is this the right entry point?

After having found an entry point, be sure to validate this assumption - so try to ensure that the programming is really executing the code you are seeing at the position you are expecting.

I have described my approach for these experiments in another blog post.

This depends a little on the language/platform the application is written in:

For compiled languages like Java, it is usually easiest to place a breakpoint in the debugger.
For interpreted languages like PHP or JavaScript, it might be easier to insert logging code (console.log, var_dump, ...) and/or exit() statements in the code.

In all cases, check how often the method you are modifying/debugging is called! I often had the assumption that the code in question would be called exactly once, then I started debugging (and wondering) for quite a while; and finally I figured out that the code was called 5 times, where the 3rd call is the one I would be looking for.

For code which is called many times, I often add conditional breakpoints or logging inside custom if-statements, to let me focus more on a specific code path.

Side note: keeping an overview

Before we continue with the next steps, here are a few techniques which help me keeping an overview while navigating bigger, unknown code bases:

write down notes or small sketches on a piece of paper
use the bookmark function in the IDE (for IntellIJ IDEA it is called "Favourites")
add comments in the source code with notes or TODOs

Now, let us continue finding the relevant code!

4. Navigating to the core of the problem

After having found and validated our entry point into the problem, we usually have to go a few steps (through different classes/files/methods) until we have found the core of our problem.

In many cases, we are looking for a single algorithm or colocated code, in the same class or file.

To get an overview of the class at hand (i.e. our entrypoint at the start), the following techniques are helpful:

Skim through the file, to get a rough understanding. Do not read every line!
What is the public API of this class/file? (i.e. what is marked public or exported?) Sometimes, a long class has only very small parts which are public, and which serve as the entry points to the inner code – this helps to reduce complexity.
Try to find "Why" comments in the class/file, which explain the purpose of the class and the relation to other classes.
If you do not understand certain syntax of the programming language, look that up – it is absolutely crucial that you are not hindered by not being able to understand what is written there.
Search for related code (again using grep, or the IDE's "find usages of this method" and "go to implementation" tools). Then start again - until you found the core you are looking for.

When navigating through the code, do small experiments using output or the debugger, as explained in 3. Use them to validate or falsify your assumptions.

Accept code naming, do not question it

While navigating through foreign code, try to accept names of classes / methods / variables, instead of questioning them ("Why is this named PolicyJoinPointAspect?") or putting too much weight in the names ("It is called FooService, so it must be a stateless.").

Why? Let me explain that with a thought experiment. When I personally write some code, I usually have a pretty clear understanding what I need to touch, how the algorithm will roughly need to work, and how I can achieve my goal. I personally sometimes even start with names such as foo or bar for variables and methods; because naming things is hard, and I need to get the core thought from my brain into the program – so thinking about the "best" name can distract me during the creative process of finding a good solution.

Then, when the solution is working, sometimes I forget to "polish" the code, think about better names, or I simply use the first names which get into my mind which seem OKish – but often they are far from good. Naming things is hard – and requires lots of discipline to pull through. It is often easier to come up with some custom naming scheme or nomenclature, instead of trying to figure out which established design pattern has actually been implemented.

Thus, my advice is to largely ignore names when reading foreign code – concentrate on the functionality and relationships to other files/classes instead.

Naming things is hard – and requires lots of discipline. When reading code, do not over-emphasize (method|class|file|variable) names.

Do not try to understand everything, but focus on the broad overview.

Remember: We are trying to find a specific algorithm or code-part; While being on the lookout, I usually do not try to understand everything I see – but I just try to get the rough gist. This helps to keep me focused on my goal.

On the other hand, when I found the core for the problem I am looking for, then I am scrutinizing the code in-detail (explained later on).

Meta: Mindset for analyzing source code

When navigating in code unknown to us, or code bases we have not seen for a longer time, having a specific mindset helps to find your way around quicker. I suggest that you specifically try to distinguish between the following two topics:

Facts, which you have seen and measured (through experiments, as explained above)
Assumptions which you deduced from the facts.

Especially when relying on assumptions, be aware that they are just that – unproven assumptions which can send you on the wrong track.

5. Review multiple examples

Sometimes, it is helpful to look for multiple examples which exercise the core code of your problem – such that you can start to identify similarities or differences. I usually try to find code-patterns, and ideally find a representative example which I can then start to adopt/adjust to my use-case.

6. Build a mental model for the problem core

Now that you have identified the core code parts relevant to your problem, it is time to deep-dive into them and build a more detailed mental model. The goal is to figure out what the original creator might have thought when writing the code. The following questions help me for this:

What parent classes and sub-classes exist?
Do I recognize any well-known design patterns inside the class or in its usages?
Where roughly does the file reside in the package structure?

I usually do not ask the question "How would I do it?", because chances are very high, that the original author would have done things differently. Very likely, the author had a different or more sophisticated mental model compared to my own. When I asked this question, it often misguided me into a dead end.

7. If possible, modify the code in the core

To fix the issue or answer the specific question, I usually try to modify the code directly in the core algorithm of the system – pretty much like: "In my case, the code I see here is not fully matching my special condition; instead the condition here needs an additional case..."

The following approaches help there:

For interpreted languages, changes are directly picked up; so you can simply modify the code.
For Java, Hot Code Reloading of IntelliJ helps modifying code at run-time.
You can also execute code while being in a breakpoint. This way, it is even possible to modify state, and then continue executing the program.

Then, I test around whether the change actually fixes the issue; and also whether it has unintended side-effects or consequences (like newly introduced bugs). We again try to do this directly in the original source code to keep the feedback loop tight.

If working with docker containers, ensure to check that the core code you are modifying is also mounted into the container, to ensure that your code changes can be picked up. I have made great experiences with Docker WS for VS Code.

8. Extract the modification into your own addon/extension/package.

Now, we know that a certain change in the core solves our issue – but we do not want to patch the core application. Instead, we need to find a way to extract our change into a re-usable package. For this, the core question is: "How does the extensibility system work?" To help figuring this out, answering the following questions can help:

Check out the documentation; usually extensibility is somehow covered. Try to understand whether the system provides planned or unplanned extensibility (or both). In case you have the choice, prefer planned extensibility over unplanned one, because that is better-supported and more long-lasting.
Watch out for common extensibility terms like "Plugin", "Extension", "Module", "Slot", "Factory", or "Aspect".

Finally, extract the core modification into your own package. Do not forget to document it properly :)

Understanding high-level architecture

Understanding the big picture of how an application is built, is usually a different approach than trying to answer a specific question. Out of curiosity, I am sometimes interested how some bigger applications are structured – on the lookout for inspiration.

For understanding high-level architecture, I usually try to answer the following questions:

How does the system start up? Do I recognize any frameworks at this point?
Look for HTTP endpoints like Controllers or Routes.
Try to skim the package structure from the outside inwards.
Try to figure out the rough layers of the application and whether it follows a pattern like MVC.

Then, I usually poke around at specific points using the method described in this post – this way, you get both a coarse-grained understanding of the application, as well as a more detailed view in some specific areas.

Thanks!

Thanks for making it this far. The approach described here has possibly some blind spots – so please share your feedback with us on how you navigate unknown code!

Looking forward to great discussions, and all the best,
Sebastian

Im Video unten ist leider nur meine Stimme, und nicht die Kommentare meiner Teamkollegen zu hören - ich hoffe man kann trotzdem sinnvoll folgen.

Ab 44:40 min zeige ich das beschriebene Konzept nochmals anhand eines praktischen Beispiels, und zwar der "Scroll from Source"-Funktionalität in IntelliJ / PHPStorm.