As soon as someone mentions “Log analysis”, we picture a developer, going through 1000s of lines of logs to figure out a problem. Does it always have to be like this? Our topic of discussion is what can AI/ML do to help us in Log analysis.
Need for Automated Log Analysis
In largescale systems, the seemingly obvious way of log analysis is not so scalable. Imagine a broadband network managed by an operator like Comcast, having 100s of WiFi Access Points and Routers/Switches and 4G/5G small cells, from multiple equipment providers, say Commscope, Aruba or CISCO. If we are collecting logs at multiple nodes, there will be GBs of data created every minute. The possible issues are mostly hidden, they may not be something as obvious as a crash. It may be a problem that occurred and went away and no one could even detect it, other than the fact that there were several complaints received by the Network Operators. These systems are developed by multiple developers (100(0)s), so it is difficult to be analyzed by a single person. They pull out modules from various third parties and make extensive use of the opensource. And then the parts of the systems are on continuous upgrade cycles. So there is a clearly established need for automated analysis in largescale networks.
Mapping Log Analysis problem to AI/ML problem
Machine learning sees the problems in two ways: supervised or unsupervised.
Supervised learning is applicable if we have a labelled data set i.e. input data, where we know the label (or value). With this data, we can train the model. Once trained, the model can take the new input and predict the label (or value). Unsupervised learning means, we do not have labelled data sets. The model classifies data into different classes. When the new data arrives, it finds the correlation with the existing classes and puts it into one of those classes.
For log analysis, we are basically looking for anomalies in the log, something that is not normally expected. We may or may not have labelled data sets, and accordingly, we need to pick supervised or unsupervised Anomaly Detection algorithms. For supervised algorithms, we will have data sets, where each set is labelled as “normal” or “Anomaly”. For unsupervised algorithms, we need to configure the model for two classes only, “Normal” or “Anomaly”.
A combined approach is good for the broadband use case, where we use both. For the clear anomalous behaviour we can use supervised methods. And when creating an exhaustive labeled data set may not be possible, we can fall back to unsupervised. These algorithms exist already and there are open source implementations as well. (refer References)
Mapping Logs to AI/ML input
It is clear that we have ongoing online logs coming from various nodes. The only way to make a data set is to timeslice them, into smaller log snippets. Using each snippet somehow to convert into a data set.
However, the logs that we have are distributed, coming from switches, routers, SysLogs and Pcaps and so on. Will we need different models for each kind of log? The answer is No. The logs should be given to a single Model as then the only correlation between different logs can be harnessed.
The logs are unstructured text, Can we use NLP (Natural Language processing) Models to extract data sets from the logs. The answer is again “No”. For NLP models, the text is preprocessed to get features like how many times a word is repeated, which word was followed by which word, and features like that. There are pretrained models which can do this and have been trained over the entire Wikipedia text!. But these can not be used for logs, as logs have technical context and they are not natural language.
Since logs have an underlying structure, we can view the log snippets as a series of predefined events. This way we can retain the information in each log. It also helps aggregate different kinds of logs, as we can consider them logs having different sets of events. The model will train by understanding what events are happening in a given time window and can then detect anomalies.
Constructing AI/ML Training data set from Logs
AI/ML works on vectors/matrices of numbers and additions and multiplications of these numbers. We can not feed these events directly to the model. They need to be converted into numbers. (Gradient Descent and Logistic Regression works with finding derivatives. Deep learning is Matrix multiplications and lots of it. Decision Trees or Random forests partition the data on numbers.)
For computer vision and image processing use cases, these numbers are the RGB value of each pixel in the image. For tabular data, the text is converted into numbers by assigning ordered or unordered series.
One option is to associate each event with an identifier number and give vectors of these identifiers to the model, along with a timestamp. However, synchronizing/aggregating this will be an issue as we will start getting these vectors from each node. Also one event may happen multiple times, in the snippet, so handling of these vectors will become complex.
So a better method is to collate vectors from each node for a given time slice and then go with the count of each kind of event in a master vector.
We explain the approach below in detail. The approach is derived from this popular paper, for more details please refer https://jiemingzhu.github.io/pub/slhe_issre2016.pdf)
 Log Collection – In broadband systems, we have multiple sources of logs (SysLog, Air captures, wired captures, Cloud Logs, Network element i.e. switches/Routers/Access Points logs). We need to first be able to gather logs from each of the sources.
We need to make an exhaustive list of all sources as
[S1, S2, S3.. Sn]
 Event Definition – For each source, we need to come up with predefined event types. In the networking world, broadly event types in the logs, can be defined as follows
 Protocol message
 Errors/Alerts
 Each Type is one event type
 Layer
 Management
 Each Type is one event type
 Control
 Each Type is one event type
 Data
 Each Type is one Event type
 State Change
 Error Alerts
 Each Type is one event type
 Module
 Each Critical Log Template is an event type
 Each State Transitions is an event type
 Errors/Alerts
 Each type of Error/Alert is one event type
 Error Alerts
 Management
 Errors/Alerts
Each Leaf node corresponds to a different event
With this analysis, for each source, we come up with a list of events, as follows
[S1E1, S1E2, S1E3,.. S1Em,
S2E1, S2E2, S2E3,.. S2En,
… ,
SnE1, SnE2, SnE3,.. SnEp]
 Log to Event conversion – Each line of the time series log will have a constant part and a variable part. The constant part is what we are interested in. Variable parts like IP addresses, source and destination are variable and need to be ignored. We need to parse logs for the constant parts, to check if the log has any event or not, and record only the event. Then log snippet taken over a window of time will start looking like something like this for a source.
[T1, E2
T2, Nil
T3, E2
T4, E4]
 Frequency transform – Invert the parsed log to find event frequency. Basically in a given time window how many times an event happened. So if the window goes from Time 1 to Time 4.
Window 1  
E1  0 
E2  2 
E3  0 
E4  1 
Going for multiple time slices it will look like this
Window 1  Window 2  Window 3  Window 4  Window 5  
E1  0  1  0  1  1 
E2  2  0  1  0  1 
E3  0  2  2  1  1 
E4  1  1  1  2  1 
The window can be fixed with timer intervals. These can be nonoverlapping or sliding. Sliding windows can give better results, but maybe more computationally intensive.
For balancing computation load, it is advisable to do edge compute i.e. derive the Event Count Matrix separately from each source.
 Event Frequency Matrix – Once the event count matrix is being fetched from each source, they should be all combined at a central place, before being fed to the ML world.
Time
Stamp1 
Time
Stamp2 
Time Stamp3  Time
Stamp4 
Time
Stamp5 

S1E1  0  1  0  1  1 
S1Em  1  1  1  2  1 
S2E1  1  0  0  1  1 
S2En  1  1  1  1  2 
SnE1  0  1  0  1  1 
SnEp  1  1  1  2  1 
Highlighted Part is the final Matrix that is an input to the ML system. Each Window is fed with a timestamp. So it becomes a time series input vector. Set of these vectors will make a data set. So finally now we have the data set for log analysis!
Resources
[1] AI/ML Theory Machine Learning by Stanford University
[2] Applied AI/ML Tutorial Deep Learning For Coders—36 hours of lessons for free
[3] Log Analysis AI/ML Research Paper Experience Report: System Log Analysis for Anomaly Detection
[4] LogPai/Loganaly (logpai/loglizer: A log analysis toolkit for automated anomaly detection [ISSRE’16])
[5] AICoE/LAD (AICoE/loganomalydetector: Log Anomaly Detection – Machine learning to detect abnormal events logs)