Automatic sandbox services should not be treated like "antivirus scanners" to determine maliciousness for samples. That’s not their intended use, and they perform poorly in that role. Unfortunately, providing an "overall score" or "verdict" is misleading.
From a malware analysts’ standpoint, a sandbox verdict doesn’t provide any definitive insight into a file’s maliciousness. Sandbox systems often lack context and label behaviors as "malware behavior" that also occur in legitimate files. They do so with excessive confidence because there are no serious consequences to being wrong, unlike antivirus software. The worst case for the latter is having inoperable systems all over the world.
A more accurate term would be "interesting indicators" instead of "malware behavior". For example, game executables are sometimes flagged as keyloggers, because they need to read the keystrokes that allow them to control characters in the game. As an analyst the question is always: Is this expected behavior for this kind of application or is it out of the norm?
Many conclusions in sandbox report summaries are also of limited value. What’s truly important is understanding why those conclusions were drawn—what data underlies them. Using the game example again, the conclusion "logs keystrokes" can be drawn from several indicators. To name a few:
contains strings that are typically found in key logs, e.g., "[ENTER]"
has imports that might be used to monitor keystrokes
the sample hooked APIs related to recognizing keystrokes during the sandbox run
the sample has code patterns of known keylogger malware
the sample is detected by antivirus scanners as a keylogger
the sample has a debug path that contains the word "keylogger"
the sample dropped a file containing keystrokes during the sandbox run
Each of the indicators above might have been responsible. But for me as a malware researcher, it makes a difference if the sample dropped a text file with keystrokes or just contains imports in its header that may never be used. For that reason, I skip the conclusions and go straight to the facts when reading sandbox reports.
Some examples that commonly lead to incorrect verdicts in automatic sandbox systems include:
Programs that do not run because they require prior setup or are missing other prerequisites on the sandbox system — sandbox systems cannot extract much useful information in these cases.
Malware that does not run due to anti-sandbox techniques may appear as clean.
Computer repair software that deletes, queries, and modifies certain registry keys and files, which may look suspicious when taken out of context.
Backup software, because its behavior resembles ransomware — it modifies a large number of files and often saves them under different names.
Clean programs that detect virtual machines, as the presence of anti-VM techniques is often enough to result in a malware verdict.
Software that employs protection mechanisms, which may be interpreted as malware evasion or hiding techniques.
Programs that uninstall antivirus software, as they can be mistaken for malware that disables antivirus protection.
Potentially unwanted programs (PUP) — sandbox systems often do not differentiate between PUP and malware, flagging them the same.
Automatic sandbox systems frequently rely on antivirus detection rates as a major factor in their overall score. They often assess the same characteristics that initially influenced the antivirus scanner’s decision.
This isn’t easily adjustable because antivirus systems are black boxes. Sandbox systems can’t accurately gauge when they are assigning too much weight for certain behaviors or indicators because the antivirus scanner already used them, nor can they fine-tune it in this case. The antivirus rate is more of a workaround to account for undetected samples.
What these sandbox systems excel at is triage—compiling a list of indicators and key points worth of further analysis. However, they are expert systems, unlike antivirus scanners, which are the primary non-expert method for determining maliciousness.
I wish automatic sandbox systems would stop pretending that their score has any real significance and tailor their systems towards the experts that need these systems instead of marketing themselves as substitutes for antivirus scanners. However, for sandbox vendors, attracting a broader audience of non-experts makes sense from a business standpoint, so this is likely not going to change.
A more realistic wish towards sandbox vendors is an option to disable the scoring, verdicts and conclusions. The big red warning colors and "malicious" flagged indicators introduce bias to malware analysts, particularly those who are new to the field. But even for more experienced analysts the effect should not be underestimated.
Because I cannot turn it off, I actively try to counteract this bias by setting myself the challenge to find proof for the opposite verdict. Sometimes I internally translate “malicious” to “found interesting indicators” and “clean” to “found nothing interesting” because this seems more accurate from my experience. But I do not know whether any of that really helps or just biases me in a different direction.