Detection of HTTPS malware traffic without decryption

Master Thesis


Permanent link to this Item
Journal Title
Link to Journal
Journal ISSN
Volume Title
Each year the world's dependency on the internet grows, especially its functionality relating to critical infrastructure and social connections. More than 80% of internet traffic is encrypted using Transport Layer Security (TLS) protocol, and it is predicted that this number will increase [8]. However, threat actors are also increasingly using the TLS protocol to hide malicious activities such as Command and Control, loading malware into a network, and exfiltration of sensitive data. The use of TLS by threat actors poses a challenge to security professionals as traditional techniques used in the detection of HTTP malware cannot be applied in detecting Hypertext Transfer Protocol Secure (HTTPS) encrypted malware. To manage this, companies are using a traditional method called Transport Layer Security Inspection (TLSI), which involves decrypting packets to do full packet inspection. TLSI is expensive in computational performance and complexity, and over and above all, it violates the users' privacy. Researchers from Cisco have proposed that it is possible to identify malicious encrypted traffic by techniques other than TLSI and that the unencrypted TLS handshake messages, certificates, and flow metadata of malicious traffic are distinct from benign. These differences can be effectively used in machine learning to classify malicious and benign encrypted traffic [35]. This dissertation aims to assess the feasibility and effectiveness of the proposed alternative to TLSI. We sourced thousands of malware and benign flows and then used the Cisco tool called Joy to extract the features from the unencrypted TLS handshake messages, certificates, and flow metadata. To understand the characteristic behaviour between malicious and benign flows, we did a data exploration, summarized the unique values of the features from our datasets, and compared them with the feature values from the Cisco datasets used in the research paper [35]. We then selected features that had the most differentiating power in our dataset. The selected features were inputs into the two supervised classifiers: logistic regression and random forest. The classifiers were trained and tested on the offline datasets of benign and malware features, and we observed that the random forest performed better with an average accuracy of 98.92%. We concluded that it is viable and effective to use alternative techniques to detect HTTPS malware without TLSI.