Using AI to detect malicious documents and scripts

With malicious attacks constantly on the rise, more and more cybercriminals are using social engineering via email and other channels of communication as a first and foremost vector of attacks. Very often, tricky emails contain malicious attachments which, when opened, will trigger the execution of a malware exploitation of a vulnerability in order to perform malicious actions. A malicious document also can be downloaded by the link a user receives over communication channels.

According to Acronis experts, as of the middle of 2022, weaponized or malicious documents like Microsoft Office or Adobe PDF files were used in approximately one third of all attacks. Other industry players have seen a similar picture. Malware is typically embedded in a file, although the chain of its execution can vary. For example, a vulnerability is initially exploited, privileges are raised, and the malware is then downloaded and executed. Office and PDF files have been used for years in cyberattacks as they can contain embedded macros, shellcodes, JavaScript, and even whole files within them. However, the problem has worsened in recent years, as hackers now use search engine optimization (SEO) to rank malicious files — especially PDFs — higher in search engine results.

Another, similar problem involves malicious scripts executed by legitimate tools like PowerShell. The number of attacks using malicious scripts, as with malicious documents, has steadily grown from year to year. And Acronis experts have seen almost double the growth in such attacks over the past two years. The goal with malicious scripts is the same: infiltrate the system, raise privileges, download, and then execute the payload.

For cybersecurity companies, both issues detailed above are identical, because they require you to distinguish between good and bad documents or script. The good news is that the problem can be solved much more effectively with the usage of machine learning (ML) and artificial intelligence (AI), or, as we say at Acronis, with machine intelligence (MI).

Use of machine intelligence for threat detection

Acronis started to use machine learning back in 2017, when the company introduced its Active Protection ransomware technology. Soon after that, we created a whole AI-based static detection engine which is currently used in the Acronis flagship product, Acronis Cyber Protect.

This engine is constantly updated, and with every new machine learning model introduced, it gets better and better in terms of performance and detection rate. For example, in the beginning, it was used to analyze stack traces of executed processes, and later, it started to analyze whole files and libraries to catch malicious ones. Now it can analyze strings extracted from executables and process images. The frequency of a few hundred selected words in strings extracted as additional features has led to a more than 3% improvement in detection rate!

As you can likely imagine, a similar approach can be used to detect malicious documents and scripts. Let’s start with Microsoft Office documents. Although Microsoft recently announced and started to slowly roll out the approach as macros have been disabled by default in internet-borne documents, this, unfortunately, can’t solve all of the problems. The mark of the web (MOTW) attribute, as it is called, is added by Windows to files from an untrusted location, such as the internet or a restricted zone — for example, browser downloads or email attachments. The point here is that the attribute only applies to files saved on an NTFS file system, not files saved to FAT32 formatted devices. And what will happen if the file comes from a legitimate office email? Not to say this necessarily will work for an environment where a job is heavily reliant on macros — accounting, for example. Nevertheless, such a policy-enabled environment is made more secure, and in some typical cases of phishing attacks, like the ones used in Emotet campaigns, can be successfully prevented. But as we’ve already said, it’s unfortunately not enough and will raise other complications for many businesses. That’s why Acronis AI specialists have enhanced our detection engine to be able to spot malicious documents, thanks to the power of machine learning.

What macro script could be perceived as malicious? The one that performs one or more of the below:

Creates processes
Executes scripts in PowerShell, VBA, etc.
Downloads files from remote servers
Embeds itself in other Office files or Office template files

But of course, real detection is not that simple, as you need to factor in a lot of other parameters. For instance, the Acronis machine learning model is checking the following attributes of DOCX files to arrive at a verdict:

Various text and VBA function features
Ratio features like comments, code, etc.
The entropy of a macro itself, its code and comments
Any obfuscation in place and what is obfuscated
Known indicators of compromise (IoC) parameters like URLs, executables, etc.

This is not a full list, of course, but it helps explain that a lot of things are analyzed on a big dataset, which is constantly being revised and updated. As a result, we are achieving an excellent detection rate with a model size of less than 1 MB compressed. Without AI and ML, to achieve such results is close to impossible. You should also keep in mind that this is just one part of robust multilayered protection which will only be triggered if thethreat is not detected by other technologies beforehand, like email security scanning engines and sandboxes, or URL filtering.

A very similar approach was recently initiated by Acronis experts to detect malicious AutoIT scripts, which are very often used in a service provider environment. With a tiny model of around 0.6 MB, we are already able to provide a 92% detection rate, and the same with DOCX, where the model constantly improved.

Eliminating the weaponized PDF threat

Apart from Microsoft Office Word documents, the Adobe portable document format, or PDF, is a very popular tool for cybercriminals to compromise a system or plant malware on a user’s machine. Based on PostScript language, a PDF can contain a lot of information, including text, hyperlinks, multimedia, images, attachments, metadata, etc. — making it a very powerful format. PDF format has an “actions” feature that allows the opening of a web link or file, running JavaScript code, and many other operations which, as you can imagine, can be performed with malicious intent.

PDF documents can be viewed with browsers and a variety of reading software — all of which may or actually have vulnerabilities threat actors can exploit. These include arbitrary code execution, buffer overflow, memory corruption, out-of-bounds read and many others. Currently, there are hundreds of CVEs for PDF readers, with almost 300 known vulnerabilities for Adobe Acrobat Reader alone. Security researchers and threat actors are finding new PDF-related exploits practically every day.

Here is an example of a CVE-2021-28550 vulnerability seen exploited in the wild: Acrobat Reader DC versions 2021.001.20150 (and earlier); 2020.001.30020 (and earlier); and 2017.011.30194 (and earlier) are affected by a Use After Free vulnerability. An unauthenticated attacker could leverage this vulnerability to achieve arbitrary code execution in the context of the current user. The exploitation of this issue requires user interaction, in that a victim must open a malicious file.

Acronis’ machine learning-based malicious PDF detection model, as with other file types described above, checks a variety of parameters to arrive at the correct verdict:

Entropy
Total character count
Special keyword counts
Number of lines, special assignment lines
Etc.

As a result, an identical and very highly effective detection rate, as described above, is achieved.

Human cybersecurity awareness training can help significantly

Of course, integrating this type of detection model into a good cyber protection solution is essential. However, overall company and individual security postures can benefit greatly if users are aware of threats and the way they present themselves, and react properly — as in most cases, engaging with a malicious document or script requires user input. And if there is no such input, there will be no threat.

This awareness training requires adhering to basic, well-known security community rules:

Always check where an email or link came from. Do you know this person? Are you expecting this file? Be sure to inspect the real email address it is coming from, not just the name or alias.
In the event the user has already clicked on the file, and it mentions some kind of incompatibility, asks you to enable something, displays some links or fake captcha, or something similar, it should be closed right away, deleted, and a security team should be informed immediately.
If you’ve clicked on the file but see nothing opening after that, this is also a bad sign. If a file was opened and it’s not what you expected, you may already have fallen victim and should inform a security team immediately.

So, in summary, the best rule is to not open attachments and not click on links right away. Double-check everything and if you’re still not sure, talk to a security team. Stay vigilant and stay safe.

Environments

Physical

Endpoints

Cloud

Virtual

Applications

Mobile

Compare Acronis

Using AI to detect malicious documents and scripts

Use of machine intelligence for threat detection

Eliminating the weaponized PDF threat

Human cybersecurity awareness training can help significantly