Within this follow up post, we dive more thoroughly into one particular problem our Virus Analysts are commonly faced with, namely finding a large quantity of either similar or identical samples. We lay out how we use our graph database to tackle this problem and support our analysts.
From a threat actor’s perspective, a crucial requirement is to obfuscate malicious software, to have a chance at bypassing static detection mechanisms. On the one hand, obfuscation ensures that the analysis of the malware and extraction of features from it is not trivial (depending on the complexity of the used obfuscation method). On the other hand, it can additionally ensure that two functionally equal (or even identical) samples do have divergent representations on the file system. In this blogpost we will focus on the latter aspect, and we will demonstrate how we use our graph database to mitigate the effects of obfuscation partially.
When malware (or any other software) evolves over time, it is obvious that the appearance on the file system changes from version to version, while keeping a common core functionality. Taking additionally into account that functionally identical samples can have different representations from one another, a challenging problem for our Virus Analysts is to find a large number (and ideally all) samples belonging to a malware family. This is our denotation for samples with a substantial common core of functional equality. We directly benefit from solving this problem, as our analysts are enabled to write family specific signatures matching on a large quantity (ideally all) members of a family. Thereby we implicitly improve our detection capabilities.
In this series we take a look at how you can use a graph database in the context of malware analysis:
In the initial post of this series we introduced our feature extraction and enriching infrastructure, where we automatically process roughly about 500000 samples per day from arbitrary input sources we collect. For each sample our infrastructure associates the sample with its features and its exhibited behavior. And finally this association is mapped into our graph database under a particular schema. Let’s have a look at Figure 1. It shows a simplified version of the resulting association in the graph.
The vertices with the different shades of red represent arbitrary features (i.e. static and dynamic features). Additionally, the different red tones suggest that there is some sort of categorization within them. In this example C might represent a static feature category (like an import hash). A and D on the other hand (both with the same red tone) might represent some sort of dynamic feature category (like resolved hostnames). If we now recall the fact from the previous blog posts that we ensure uniqueness of features in the graph, we can deduce that samples with any degree of equality have at least one common neighbor in the graph as depicted in Figure 2.
To stay consistent with the example above we re-assume that red vertices represent the action of resolving hostnames. Here it can exemplary be google.com. Both samples exhibit this behavior, hence they both have an edge to the vertex in the middle. Therefore, there is a path between both samples in the graph. And basically this observation is all we need. Starting from one sample, we can find other samples that are equal or even identical by examining their features (neighbors in the graph) and then traverse towards other samples. If there is only one feature (and here the hostname google.com is very well known and likely to be resolved) the conclusion that samples 6 and 8 are equal is not very precise. But with an increasing number of features, the precision of a conclusion increases too. This effectively enables us to compute different degrees of equality. And ultimately, given that there are a certain amount of features in the graph. Furthermore, samples have either a very high number of these features or even all of them in common, we safely assume that they belong to the same family. Let’s re-exercise our observations in another example.
The example depicted in Figure 3 shows that the samples 1 and 3 are likely to be identical, because they have all features in common. In addition to that we can state that sample 4 has a higher degree of equality (with 1 and 3) than sample 2. However, it looks like sample 4 is not equal enough to come to the conclusion that it belongs to the family, where 1 and 3 are a member of.
For the technical realization we here just reference this section from the Apache TinkerPop documentation. Thereby, we additionally point out that the underlying concept is similarly applicable in other domains. The setting in the documentation is an online-shop product recommendation system. For our purposes we adapted the recipe to output the customers that bought the same products rather than the products. In our domain the customers correspond to samples. And the products they bought correspond to sample features.
Yet to prove our point and for fun we introduce another domain. If you are planning on kicking of an online dating service (yes, we know…everyone does), all you have to do is to take our example and replace malware with humans and the malware features with human features, like characteristics, interests and hobbies. All that is left to do is then to group pairs together based on the highest number of common features in the graph. If you now think that we might have discovered the underlying fabric of “good” and/or “successful” relationships, recall that we are just computer scientists. Let us know though if you try it out.
Finally, we share our experiences when it comes to the technical realization. When there are lots of different features in your graph you might be tempted to introduce some kind of structure (schema). This is indicated by the different shades of red throughout our examples. And we will not make a case for how to design your schema, since this majorly depends on your use-cases. However, from our experience we strongly suggest to be consistent and prevent the result in Figure 4 at any cost.
For simplicity we are reusing the example from Figure 1 with its feature types. Now there is a vertex in between the sample and the resolved hostnames, serving as a grouping mechanism. Feature B is still linked to sample 1. And in contrast to Figure 1 vertex C is now represented as a property of the sample vertex itself. When you now try to re-exercise our informal description of finding equal samples, you will see that it is not so easy anymore for two reasons. On the one hand, you have to take into account that the length of paths from a sample to the features differ. You can reach feature B from sample 1 with only one hop. The features A and D require two hops. On the other hand, sample 1 stores a feature. This requires you to access and compare it for all samples.
Nevertheless, finding similar samples with this schema is still possible, but it is more difficult. With a consistent schema it is much easier. It is more intuitive. And it has a positive impact on the general usability when you commonly interact with your graph.
When you decide to employ feature groups we suggest to repurpose them as an indicator of equality. The basic principle of this approach is depicted in Figure 5. In it, the vertices in the middle do no longer serve the single purpose of grouping the features on the right hand side, but to also aggregate them. We now interpret them as the representation of the feature combination they aggregate. Therefore, we refer to them as aggregators. By ensuring that they are unique and reused, just like any other entity in the graph, we deduce equality more efficiently. There is no longer the need to traverse to the single features. So, if two samples have a common aggregator vertex, they also have the features that the aggregator represents in common. Applying this on the example in Figure 5 means, we can deduce that the samples 7 and 11 resolve the same hostnames with only one hop to RH{A,D}.
The advantages of this approach become more obvious when you drastically increase the amount of features on the right hand side. The complexity to check if equality is present is still reduced to checking if they have an aggregator in common. Therefore, the technical realization of this equality check is easier too. It is intuitive and user friendly. But here there is the pitfall of ensuring that the aggregators do meet the described requirements. This comes with some effort. So, it boils down to trading of that effort to have more efficiency and better usability when getting data out of your graph.
In this blog post we showed how we utilize our graph database. In it, samples with common features reside in the same neighborhood within the graph. They are connected with each other through their common features. We exploit this relationship, by finding either equal or even identical samples, for any given start sample. Implicitly this improves our detection capabilities.
And from a philosophical point of view, we posed the question if true love is merely a collection of common vertices in a graph.