TY - GEN
T1 - Towards the Detection of Malicious URL and Domain Names Using Machine Learning
AU - Ghalati, Nastaran Farhadi
AU - Ghalaty, Nahid Farhady
AU - Barata, José
N1 - Funding Information:
info:eu-repo/grantAgreement/FCT/6817 - DCRRNI ID/UID%2FEEA%2F00066%2F2019/PT#
info:eu-repo/grantAgreement/FCT/6817 - DCRRNI ID/UIDB%2F00066%2F2020/PT#
funding PRDC/EEI-AUT/32410/2017.
PY - 2020
Y1 - 2020
N2 - Malicious Uniform Resource Locator (URL) is an important problem in web search and mining. Malicious URLs host unsolicited content (spam, phishing, drive-by downloads, etc.) and try to lure uneducated users into clicking in such links or downloading malware which will result in critical data exfiltration. Traditional techniques in detecting such URLs have been to use blacklists and rule-based methods. The main disadvantage of such problems is that they are not resistant to 0-day attacks, meaning that there will be at least one victim for each URL before the blacklist is created. Other techniques include having sandbox and testing the URLs before clicking on them in the production or main environment. Such methods have two main drawbacks which are the cost of the sandboxing as well as the non-real-time response which is due to the approval process in the test environment. In this paper, we propose a method that exploits semantic features in both domains and URLs as well. The method is adaptive, meaning that the model can dynamically change based on the new feedback received on the 0-day attacks. We extract features from all sections of a URL separately. We then apply three methods of machine learning on three different sets of data. We provide an analysis of features on the most efficient value of N for applying the N-grams to the domain names. The result shows that Random Forest has the highest accuracy of over 96% and at the same time provides more interpretability as well as performance benefits.
AB - Malicious Uniform Resource Locator (URL) is an important problem in web search and mining. Malicious URLs host unsolicited content (spam, phishing, drive-by downloads, etc.) and try to lure uneducated users into clicking in such links or downloading malware which will result in critical data exfiltration. Traditional techniques in detecting such URLs have been to use blacklists and rule-based methods. The main disadvantage of such problems is that they are not resistant to 0-day attacks, meaning that there will be at least one victim for each URL before the blacklist is created. Other techniques include having sandbox and testing the URLs before clicking on them in the production or main environment. Such methods have two main drawbacks which are the cost of the sandboxing as well as the non-real-time response which is due to the approval process in the test environment. In this paper, we propose a method that exploits semantic features in both domains and URLs as well. The method is adaptive, meaning that the model can dynamically change based on the new feedback received on the 0-day attacks. We extract features from all sections of a URL separately. We then apply three methods of machine learning on three different sets of data. We provide an analysis of features on the most efficient value of N for applying the N-grams to the domain names. The result shows that Random Forest has the highest accuracy of over 96% and at the same time provides more interpretability as well as performance benefits.
KW - Cyber-security
KW - Machine learning
KW - URL classification
UR - http://www.scopus.com/inward/record.url?scp=85084807860&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-45124-0_10
DO - 10.1007/978-3-030-45124-0_10
M3 - Conference contribution
AN - SCOPUS:85084807860
SN - 978-3-030-45123-3
T3 - IFIP Advances in Information and Communication Technology
SP - 109
EP - 117
BT - Technological Innovation for Life Improvement - 11th IFIP WG 5.5/SOCOLNET Advanced Doctoral Conference on Computing, Electrical and Industrial Systems, DoCEIS 2020, Proceedings
A2 - Camarinha-Matos, Luis M.
A2 - Farhadi, Nastaran
A2 - Lopes, Fábio
A2 - Pereira, Helena
PB - Springer
CY - Cham
T2 - 11th Advanced Doctoral Conference on Computing, Electrical and Industrial Systems, DoCEIS 2020
Y2 - 1 July 2020 through 3 July 2020
ER -