Log Analysis using Artificial Intelligence/Machine Learning [AI/ML] for Broadband
Whenever you hear about “Log analysis”, we picture a developer, going through 1000s of lines of logs to figure out a problem. Does it always have to be like this? Our topic of discussion is what can Artificial Intelligence/Machine Learning [AI/ML]do to help us in Log analysis.
Need for Automated Log Analysis
In large-scale systems, the seemingly obvious way of log analysis is not so scalable. A broadband network managed by an operator like Comcast, having 100s of Wi-Fi Access Points and Routers/Switches and 4G/5G small cells, from multiple equipment providers, say Commscope, Aruba or CISCO. Collection of logs at multiple nodes, there are GBs of data created every minute. The possible issues are hidden, they may not be something as obvious as a crash. It may be a problem that occurred and went away and could not be detected, other than the fact that there were several complaints received by the Network Operators. These systems are developed by multiple developers (100(0)s), so it is difficult to be analyzed by a single person. They pull out modules from various third parties and make extensive use of the open-source. And then the parts of the systems are on continuous upgrade cycles. So there is a clearly established need for automated log analysis in large-scale networks by the use of smart log analysis techniques .
Mapping Log Analysis problem to Artificial Intelligence/Machine Learning [AI/ML] problem
Machine learning sees the problems in two ways: supervised or unsupervised.
Supervised learning is applicable if we have a labelled data set i.e. input data, where we know the label (or value). With this data, we can train the model. After Training , the model can take the new input and predict the label (or value).
Unsupervised learning means, we do not have labelled data sets. The model classifies data into different classes. When the new data arrives, it finds the correlation with the existing classes and puts it into one of those classes.
For log analysis, we are basically looking for anomalies in the log, something that is not normally expected. We may or may not have labelled data sets, and accordingly, we need to pick supervised or unsupervised learning.
Anomaly Detection algorithms. For supervised algorithms, we will have data sets, where each set is labelled as “normal” or “Anomaly”. For unsupervised algorithms, we need to configure the model for two classes only, “Normal” or “Anomaly”.
A combined approach is good for the broadband use case, where both can be used. For the clear anomalous behaviour we can use supervised methods. And when creating an exhaustive labeled data set may not be possible, we can fall back to unsupervised.
These algorithms exist already and there are open source implementations as well. (refer References)
Mapping Logs to Artificial Intelligence/Machine Learning [AI/ML] input
There are many ongoing online logs coming from various nodes. The only way to make a data set is to time-slice them, into smaller log snippets. Using each snippet we have to convert it into a data set.
Now the logs are distributed, coming from switches, routers, SysLogs and Pcaps and Others. Do we need different models for each kind of log? No. The logs have to be given to a single Model as only then the correlation between different logs can be harnessed.
The logs are unstructured text, can we use (Natural Language processing) Models to extract data sets from the logs. The answer is again “No”. For NLP models, the text is preprocessed to get features like the number of times a word is repeated, the different words followed by each other and other features . There are pre-trained models which can do this and have been trained over the entire Wikipedia text!. But these can not be used for logs, as logs have technical context and not the natural language.
Since logs have an underlying structure, we can view the log snippets as a series of predefined events. This way we can retain the information in each log. It also helps aggregate different kinds of logs, as we can consider the logs having different sets of events. The model will be trained by understanding based on events that are happening in a given time window and can then detect anomalies.
Constructing Artificial Intelligence/Machine Learning [AI/ML] Training data set from Logs
Artificial Intelligence/Machine Learning [AI/ML] works on vectors/matrices of numbers and additions and multiplications of these numbers. We can not feed these events directly to the model. They need to be converted into numbers. (Gradient Descent and Logistic Regression works with finding derivatives. Deep learning is Matrix multiplications and lots of it. Decision Trees or Random forests partition the data on numbers.)
For computer vision and image processing use cases, these numbers are the RGB value of each pixel in the image. For tabular data, the text is converted into numbers by assigning ordered or unordered series.
One option is to associate each event with an identifier number and give vectors of these identifiers to the model, along with a timestamp. However, synchronizing/aggregating this will be an issue as we will start getting these vectors from each node. Also one event may happen multiple times, in the snippet, so handling of these vectors will become complex.
So a better method is to collate vectors from each node for a given time slice and then go with the count of each kind of event in a master vector.
We explain the approach below in detail. The approach is derived from this popular paper, for more details please refer https://jiemingzhu.github.io/pub/slhe_issre2016.pdf)
1.Log Collection –
In broadband systems, we have multiple sources of logs (SysLog, Air captures, wired captures, Cloud Logs, Network element i.e. switches/Routers/Access Points logs). We need to first be able to gather logs from each of the sources.
We need to make an exhaustive list of all sources as
[S1, S2, S3.. Sn]
2.Event Definition – For each source, we need to come up with predefined event types. In the networking world, broadly event types in the logs, can be defined as follows
- Protocol message
- Errors/Alerts
- Each Type is one event type
- Layer
- Management
- Each Type is one event type
- Control
- Each Type is one event type
- Data
- Each Type is one Event type
- State Change
- Error Alerts
- Each Type is one event type
- Module
- Each Critical Log Template is an event type
- Each State Transitions is an event type
- Errors/Alerts
- Each type of Error/Alert is one event type each Leaf node corresponds to a different event
- Error Alerts
- Management
- Errors/Alerts
With this analysis, for each source, we come up with a list of events, as follows
[S1E1, S1E2, S1E3,.. S1Em,
S2E1, S2E2, S2E3,.. S2En,
… ,
SnE1, SnE2, SnE3,.. SnEp]
3.Log to Event conversion – Each line of the time series log will have a constant part and a variable part. The constant part is what we are interested in. Variable parts like IP addresses, source and destination are variable and need to be ignored. We need to parse logs for the constant parts, to check if the log has any event or not, and record only the event. Then the log snippet taken over a window of time will start looking like something like this for a source.
[T1, E2
T2, Nil
T3, E2
T4, E4]
4.Frequency transform – Invert the parsed log to find event frequency. Basically in a given time window how many times an event happened. So if the window goes from Time 1 to Time 4.
Window 1 | |
E1 | 0 |
E2 | 2 |
E3 | 0 |
E4 | 1 |
Going for multiple time slices it will look like this
Window 1 | Window 2 | Window 3 | Window 4 | Window 5 | |
E1 | 0 | 1 | 0 | 1 | 1 |
E2 | 2 | 0 | 1 | 0 | 1 |
E3 | 0 | 2 | 2 | 1 | 1 |
E4 | 1 | 1 | 1 | 2 | 1 |
The window can be fixed with timer intervals. These can be non-overlapping or sliding. Sliding windows can give better results, but maybe more computationally intensive.
For balancing computation load, it is advisable to do edge compute i.e. derive the Event Count Matrix separately from each source.
5.Event Frequency Matrix – Once the event count matrix is being fetched from each source, they should be all combined at a central place, before being fed to the ML world.
Time
Stamp1 |
Time
Stamp2 |
Time Stamp3 | Time
Stamp4 |
Time
Stamp5 |
|
S1E1 | 0 | 1 | 0 | 1 | 1 |
S1Em | 1 | 1 | 1 | 2 | 1 |
S2E1 | 1 | 0 | 0 | 1 | 1 |
S2En | 1 | 1 | 1 | 1 | 2 |
SnE1 | 0 | 1 | 0 | 1 | 1 |
SnEp | 1 | 1 | 1 | 2 | 1 |
Highlighted Part is the final Matrix that is an input to the ML system. Each window is fed with a timestamp. So it becomes a time series input vector. Set of these vectors will make a data set. So finally now we have the data set for log analysis!
Resources
[1] AI/ML Theory Machine Learning by Stanford University
[2] Applied AI/ML Tutorial Deep Learning For Coders—36 hours of lessons for free
[3] Log Analysis AI/ML Research Paper Experience Report: System Log Analysis for Anomaly Detection
[4] LogPai/Loganaly (logpai/loglizer: A log analysis toolkit for automated anomaly detection [ISSRE’16])
[5] AICoE/LAD (AICoE/log-anomaly-detector: Log Anomaly Detection – Machine learning to detect abnormal events logs)