Southern Utah University Senior CAPSTONE Project (2020)
Senior Year Capstone Project (SUU 2020): Visualization and Plotting Engine to visualize and detect anomalies in historic and live data.
GaurdPlot is a terminal based, Python3 log data analyzer. It will create databases, analyze log data, display visuals, guess at what went wrong, and keep a log of hosts, anomaly severity, and the time at which they occurred.
This software is currently only designed for Linux systems, specifically Debian, but can be run on Docker using the included docker-compose file.
Clone or fork tybayn/guardplot to your local machine. Extract the files if necessary.
If not using Docker, use pip to install all the libraries listed in 'requirements.txt'
Ensure that the following file structure is maintained and files are created:
./ | └──guardplot | | | └──data | | | anomReasons.csv | | | └──dbbuilder | | | dbbuilder.py | | | └──example_data | | | └──utils | | | __init__.py | | | generate_data.py | | | gsclasses.py | | | gslink.py | | | | config.ini | | DockerFile | | guardplot.py | | requirements.txt | | docker-compose.yaml
The anomReasons.csv file needs to exist for the program to have enough data to run. Below is the default contents, once values are determined for each severity level, they can be added to the file.
sevh3,dos attack,ddos attack sevh2,dos attack,ddos attack,network is being probed sevh1,network is being probed sevl1,unknown reason sevl2,unknown reason sevl3,unknown reason sev0,network is being held,recording host system error glob,recording host system error nores,system is being held,system is currently off
In the guardplot/ directory is the config.ini file that contains the following lines (the attributes are set to defaults):
[database] import = data/example_data database = data/HostData.db [anomalies] logs = data/anomLogs.csv reasons = data/anomReasons.csv
This is to set the file locations that are used by the program. Altering these will alter where the files are located.
The code is set up to read through all directories with the date format "YYYY-mm-dd" as the names, with each containing a series of json files that contain logs for times throughout that date. The dbbuilder.py software reads in the json formatted data, preprocesses it, and saves it into a sqlite3 database.
The Database reader is expecting the following json structure for each event:
[ {"host":"host_name_1","events":5,"date":"2022-10-03","epoch":1664820000}, {"host":"host_name_1","events":1,"date":"2022-10-03","epoch":1664823600}, ... ]
Bring up the project by running the following command in the terminal in the root of the project:
sudo -E docker-compose up --build
During the building processes, example data simulating 90 days of events for 30 hosts is loaded into 'example_data' and the sqlite3 database is loaded with that example data. All the needed python libraries (listed in requirements.txt) are also installed into the image.
Once the container is running, exec into the "guardplot" container:
sudo docker exec -it guardplot bash
Once the database is created, populated, and all data analyzed, you can start your analysis of the data using the guardplot.py software. The Usage is below:
usage: GuardPlot.py [-h] [-a] [-g] [-i IP [-o | -d DATE | -l -e EVENTS [-p PREVIOUSEVENTS] -t TIME]] optional arguments: -h, --help show this help message and exit -a, --anomaly-view displays details about anomalies rather than a plot -g, --global performs global anomaly detection -i IP, --ip IP evalute the given ip address -o, --overall if ip is provided, view daily stats from last 33 date entries -d DATE, --date DATE if ip is provided, evaluate the given date -l, --live if ip is provided, performs live anomaly detection, event and time required -e EVENTS, --events EVENTS when in live mode, evaluate the live data point -p PREVIOUSEVENTS, --previous-events PREVIOUSEVENTS when in live mode, evaluate the data point compared to previous data point -t EPOCH, --time EPOCH when in live mode, provide an epoch for comparison
Once you are ready to load your own data, place your correctly formated files within the projects and update the 'config.ini' to reflect the location of your data. Then rerun the 'dbbuilder.py'.
In the case that you want to preload your own data into the database, place your data files into the correct location and change the path in 'config.ini'. Once the files are in place, use the dbbuilder.py program to rebuild the database:
usage: dbbuilder.py [-a] [-s] [-h] optional arguments: -h, --help show this help message and exit -a, --all-reload reload all entries in the database -s, --stat-reload reload all stats in the database
This will construct a database with the following tables:
NOTE: Using the command without any parameters will load any not included data from the Json resource and update the tables (the OVERALLSTATS table is completely recalculated).
The conversion of Json into sqlite3 is incredibly fast; however, the actual analysis performed on each and every data point is not. On average expect this process to take 6-10 hours to complete for the very first time use. After that the database is simply updated and shouldn't take long if done routinely (recommended everytime logs are recieved).
This next section is the documents and research done regarding the project required by the University (literally, all of it). So unless you enjoy walls of text and what essentially amounts to journal entries, then you are good to leave at this point!
For my project I will be completing a Python Application for GuardSight, Inc. This application is to look over historical data of hosts and be able to find anomalies in the data such as hosts that have gone down, hosts that are reporting higher amounts of attacks against them as compared to their average report in the past, hosts that fail to report as often as they have historically, and any other anomalies that are detected in the data when compared to a host’s historical averages.
The application will also need to be able to scan data live when a report comes in and compare it to historical values and alert if a particular anomaly has been determined to exist on the reporting host. This alert can then trigger an investigation or action for that host.
This application should be able to be run on a terminal so that it can be run on a server directly through ssh or remote connection. Eventually this application should have the ability to create visuals, graphically and/or terminally based, to view the trends in hosts over time.
GuardSight, Inc. has described the project as using historical data to determine inconsistencies and anomalies in the data a particular host reports, then using that knowledge to write an application that can detect those same inconsistencies and anomalies for hosts live.
A portion of this project will require the research of what kinds of network events will cause what types of anomalous data from hosts. Using this will allow the software to make a best guess of what is happening when a host responds in an unexpected way. With this knowledge, either an application or a team of people may be able to respond faster with a more correct response. For example, a server restart or a network attack may return similar anomalous data for the host, but we don’t need to send an alert if a server is restarting.
In order to complete this project, I will need to research Data Analysis and algorithms for distance, frequency, and time. These algorithms will form the base for this project and my understanding of how they work and their application is vital to my ability to complete this project effectively. I will also need to research and refresh on statistical analysis in order to determine the standard deviations of a set of historical data, and how to adjust those values live as historical values are added to over time.
This application will be written in Python 3. Having never used the language, I will need to learn how to develop and run applications in Python for both Linux and Windows operating systems. I will be spending the time to research the basics of Python, as well as the Data Science and Data Visualization applications of Python.
This program will challenge me to try multiple fields that I have not used before, such as Data Science and Python. But it will also utilize skills I have learned during my time in school such as advanced algorithms, object-oriented programming, data bases/Json, and patterns-based programming to name a few.
Overall, this project is to provide a way for anomalous data to be detected from historic data and use that historical data to determine those same anomalies on hosts as they send in data.
Jan 29 - Jan 25
Jan 26 - Feb 1
Feb 2 - Feb 8
Feb 9 - Feb 15
Feb 16 - Feb 22
Feb 23 - Feb 29
Mar 1 - Mar 7
Mar 8 - Mar 14
Mar 15 - Mar 21
Mar 22 - Mar 28
Mar 29 - Apr 4
Apr 5 - Apr 11
The above depicts the expected timeline and design requirements. The actual timeline and design documents are outlined in the following sections.
One of the more major segments of the project is to analyze the data that was given to me by GuardSight and find ways to detect anomalies in the data, what ever those may be. The article entitled ‘5 Ways to Detect Outliers/Anomalies That Every Data Scientist Should Know (Python Code)’. This article discusses five different ways to detect anomalies and outliers in data, what they could possibly mean, and how to implement that type of analysis using python code. This will be a great asset to my project once I start needing to implement ways to detect anomalies and outliers in the data given to me, and it is a nice plus that the article has example python code that I can use to give me a base on how I want to implement my detection.
Badr, W. (2019, April 12). 5 Ways to Detect Outliers That Every Data Scientist Should Know (Python Code). Retrieved from https://towardsdatascience.com/5-ways-to-detect-outliers-that-every-data-scientist-should-know-python-code-70a54335a623
One of the best ways to understand the data is to view it graphically. While I was not able to find a software that I liked entirely and will likely have to make my own, I found a resource that has a lot of ideas and parts that could be used as inspiration. The Python library “visidata” contains lots of dynamic and colored graphs and tools for the linux terminal that I could use to build an ideal software for the data at hand.
Pwanson, S. (n.d.). VisiData. Retrieved from https://www.visidata.org/
I am currently taking a Big Data analytics class as SUU. In this class the required reading includes many valuable lessons and resources on how to analyze large amounts of data. I have been using some of the ideas in the book already to configure temporary databases and how they should be maintained. I also plan to use this resource when I start working on statistical analysis. It gives a good overview of different kinds of analysis and how to apply them.
Sharda, R., Delen, D., & Turban, E. (2018). Business intelligence, analytics, and data science a managerial perspective (4th ed.). Boston: Pearson.
After meeting again with GuardSight, it was determined that we would need to create an alternate database structure to hold the data so it can be read in faster through the use of Python rather than Linux commands. We agreed that SQLite would be the best option. I will be using the website called “Tutorials Point” to learn how to install SQLite3 onto Linux and how to address the database from Python.
SQLite - Installation. (n.d.). Retrieved February 24, 2020, from https://www.tutorialspoint.com/sqlite/sqlite_installation.htm
The entirety of my project is written in Python, which is a language that I have had very little experience with. As such I have needed to be able to learn the basics of the language, as well as a few other tidbits of knowledge in order to program effectively. The majority of my python learning has come from the ‘Automate the boring stuff with Python’ book by Sweigart. I have been using the lessons taught in this book to create and try different python projects, and ultimately applying it to create the analysis software for my capstone project.
Sweigart, A. (2015). Automate the boring stuff with Python: practical programming for total beginners. San Francisco, CA: No Starch Press, Inc.
The above image shows what the plotter engine produces when looking at a single day's data.
The above image shows the idea and execution of live anomalies being detected using command line arguments.
The above image shows the anomaly detail view of events that occur in the data.
My first thoughts are that this project is very interesting and I am learning a lot about python, data management, temporary files, multi-threading, read-in times, and statistical analysis. I have been able to apply what I have learned so far, and I am making a lot of progress. I am enjoying the project, but I feel there are small things here and there that are causing some issues with the end product. The first is reading in data in the form it is currently is incredibly slow, and Python is not great at reading in that much data at once. I have been able to get it to work super quickly in Linux using a system call out to grep, but with the need to be multiplatform I am needing to find a way to do it solely in Linux.
I have been doing some research on how to do this, and without downloading large packages to handle it, I think I will need to create a middle man / temp database that can store certain amounts of data for when the data is needed quickly. Otherwise I suppose it is fine if the program takes some time.
I also feel I need to spend some more time with the client asking about the data set. I feel I understand the dataset pretty well, but there is a single detail I am not sure on. I’m not sure if a host is reporting all events since the last boot, or if it reporting all events since the last report. This clarification will alter how I continue with the project and how I analyze the data.
Overall, the project is going extremely well and I haven’t run into many roadblocks other than those listed above. I am on track to finish what I need to by the timeline I have set for myself. As far as improvements, I need to start checking in with GuardSight, Inc. frequently not only for testing, but to make sure the software will work with the system in the way they are envisioning it, just so I don’t get super far along and then realize it won’t work. This hasn’t been an issue yet since a lot of my time thus far is just analyzing historical data and writing basic scripts. But starting here in the next two weeks I will need to start that weekly communication.
2020-01-22
2020-01-29
2020-01-30
2020-02-08
2020-02-11
2020-02-17
2020-02-18
2020-02-19
2020-02-20
2020-03-11
2020-03-16
2020-03-17
2020-03-18
2020-03-19
2020-03-28
2020-04-01
The actual code and software project are located on my GitHub account. tybayn/guardplot
The goal was to write software that could go through historical logs and find inconsistencies and anomalies in the data and report on those. This project was actually quite difficult, and it pushed me to many different limits, and forced me to go beyond those. Almost the entire project as a whole was made of new material for me. I had to learn: Python3, Linux command line, big data analysis, and data science. But it also utilized skills I have learned during my time in school such as advanced algorithms, object-oriented programming, data bases/json, and patterns-based programming to name a few.
Overall, I really enjoyed this project. I appreciate how far the project pushed me and my skills and forced me to learn more. I have learned many valuable lessons and skills that I would not have been able to learn before. This project was the perfect display of everything I have learned during my time in school, and I feel that it truly is the Capstone of my degree.
Initial Release