Openly accessible

P-gram: positional N-gram for the clustering of machine-generated messages

Jiang, Jiaojiao, Versteeg, Steve, Han, Jun, Hossain, Md Arafat, Schneider, Jean-Guy, Leckie, Christopher and Farahmandpour, Zeinab 2019, P-gram: positional N-gram for the clustering of machine-generated messages, IEEE access, vol. 7, pp. 88504-88516, doi: 10.1109/ACCESS.2019.2924928.

Attached Files
Name Description MIMEType Size Downloads
farahmandpour-pgrampositional-2019.pdf Published version application/pdf 1.52MB 3

Title P-gram: positional N-gram for the clustering of machine-generated messages
Author(s) Jiang, Jiaojiao
Versteeg, Steve
Han, Jun
Hossain, Md Arafat
Schneider, Jean-GuyORCID iD for Schneider, Jean-Guy orcid.org/0000-0002-9827-5496
Leckie, Christopher
Farahmandpour, Zeinab
Journal name IEEE access
Volume number 7
Start page 88504
End page 88516
Total pages 13
Publisher Institute of Electrical and Electronics Engineers
Place of publication Piscataway, N.J.
Publication date 2019
ISSN 2169-3536
2169-3536
Keyword(s) Machine-generated messages
Positional n-gram
Clustering
Summary An IT system generates messages for other systems or users to consume, through direct interaction or as system logs. Automatically identifying the types of these machine-generated messages has many applications, such as intrusion detection and system behavior discovery. Among various heuristic methods for automatically identifying message types, the clustering methods based on keyword extraction have been quite effective. However, these methods still suffer from keyword misidentification problems, i.e., some keyword occurrences are wrongly identified as payload and some strings in the payload are wrongly identified as keyword occurrences, leading to the misidentification of the message types. In this paper, we propose a new machine language processing (MLP) approach, called P-gram, specifically designed for identifying keywords in, and subsequently clustering, machine-generated messages. First, we introduce a novel concept and technique, positional n-gram, for message keywords extraction. By associating the position as meta-data with each n-gram, we can more accurately discern which n-grams are keywords of a message and which n-grams are parts of the payload information. Then, the positional keywords are used as features to cluster the messages, and an entropy-based positional weighting method is devised to measure the importance or weight of the positional keywords to each message. Finally, a general centroid clustering method, K-Medoids, is used to leverage the importance of the keywords and cluster messages into groups reflecting their types. We evaluate our method on a range of machine-generated (text and binary) messages from the real-world systems and show that our method achieves higher accuracy than the current state-of-the-art tools.
Language eng
DOI 10.1109/ACCESS.2019.2924928
Indigenous content off
HERDC Research category C1 Refereed article in a scholarly journal
Copyright notice ©2019, IEEE
Free to Read? Yes
Use Rights Creative Commons Attribution licence
Persistent URL http://hdl.handle.net/10536/DRO/DU:30128559

Connect to link resolver
 
Unless expressly stated otherwise, the copyright for items in DRO is owned by the author, with all rights reserved.

Every reasonable effort has been made to ensure that permission has been obtained for items included in DRO. If you believe that your rights have been infringed by this repository, please contact drosupport@deakin.edu.au.

Versions
Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 0 times in TR Web of Science
Scopus Citation Count Cited 0 times in Scopus
Google Scholar Search Google Scholar
Access Statistics: 13 Abstract Views, 5 File Downloads  -  Detailed Statistics
Created: Mon, 05 Aug 2019, 10:54:27 EST

Every reasonable effort has been made to ensure that permission has been obtained for items included in DRO. If you believe that your rights have been infringed by this repository, please contact drosupport@deakin.edu.au.