Towards the Detection of Malicious URL and Domain Names Using Machine Learning

Nastaran Farhadi Ghalati, Nahid Farhady Ghalaty, José Barata

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Malicious Uniform Resource Locator (URL) is an important problem in web search and mining. Malicious URLs host unsolicited content (spam, phishing, drive-by downloads, etc.) and try to lure uneducated users into clicking in such links or downloading malware which will result in critical data exfiltration. Traditional techniques in detecting such URLs have been to use blacklists and rule-based methods. The main disadvantage of such problems is that they are not resistant to 0-day attacks, meaning that there will be at least one victim for each URL before the blacklist is created. Other techniques include having sandbox and testing the URLs before clicking on them in the production or main environment. Such methods have two main drawbacks which are the cost of the sandboxing as well as the non-real-time response which is due to the approval process in the test environment. In this paper, we propose a method that exploits semantic features in both domains and URLs as well. The method is adaptive, meaning that the model can dynamically change based on the new feedback received on the 0-day attacks. We extract features from all sections of a URL separately. We then apply three methods of machine learning on three different sets of data. We provide an analysis of features on the most efficient value of N for applying the N-grams to the domain names. The result shows that Random Forest has the highest accuracy of over 96% and at the same time provides more interpretability as well as performance benefits.

Original languageEnglish
Title of host publicationTechnological Innovation for Life Improvement - 11th IFIP WG 5.5/SOCOLNET Advanced Doctoral Conference on Computing, Electrical and Industrial Systems, DoCEIS 2020, Proceedings
EditorsLuis M. Camarinha-Matos, Nastaran Farhadi, Fábio Lopes, Helena Pereira
Place of PublicationCham
PublisherSpringer
Pages109-117
Number of pages9
ISBN (Electronic)978-3-030-45124-0
ISBN (Print)978-3-030-45123-3
DOIs
Publication statusPublished - 2020
Event11th Advanced Doctoral Conference on Computing, Electrical and Industrial Systems, DoCEIS 2020 - Costa de Caparica, Portugal
Duration: 1 Jul 20203 Jul 2020

Publication series

NameIFIP Advances in Information and Communication Technology
PublisherSpringer
Volume577
ISSN (Print)1868-4238
ISSN (Electronic)1868-422X

Conference

Conference11th Advanced Doctoral Conference on Computing, Electrical and Industrial Systems, DoCEIS 2020
CountryPortugal
CityCosta de Caparica
Period1/07/203/07/20

Keywords

  • Cyber-security
  • Machine learning
  • URL classification

Fingerprint

Dive into the research topics of 'Towards the Detection of Malicious URL and Domain Names Using Machine Learning'. Together they form a unique fingerprint.

Cite this