GuardPlot

Anomaly Detection and Plotting Engine

Southern Utah University Senior CAPSTONE Project (2020)


  • Created: 2020-04-01
  • Update: 2022-10-03
  • License: MIT

Senior Year Capstone Project (SUU 2020): Visualization and Plotting Engine to visualize and detect anomalies in historic and live data.

GaurdPlot is a terminal based, Python3 log data analyzer. It will create databases, analyze log data, display visuals, guess at what went wrong, and keep a log of hosts, anomaly severity, and the time at which they occurred.

This software is currently only designed for Linux systems, specifically Debian, but can be run on Docker using the included docker-compose file.


Getting Started

Installation

Clone or fork tybayn/guardplot to your local machine. Extract the files if necessary.

If not using Docker, use pip to install all the libraries listed in 'requirements.txt'


Prerequisites

File Structure

Ensure that the following file structure is maintained and files are created:

./
|
└──guardplot
|  |
|  └──data
|  |  |  anomReasons.csv
|  |
|  └──dbbuilder
|  |  |  dbbuilder.py
|  |
|  └──example_data
|  |
|  └──utils
|  |  |  __init__.py
|  |  |  generate_data.py
|  |  |  gsclasses.py
|  |  |  gslink.py
|  |
|  |  config.ini
|  |  DockerFile
|  |  guardplot.py
|  |  requirements.txt
|
|  docker-compose.yaml
          

anomReasons.csv

The anomReasons.csv file needs to exist for the program to have enough data to run. Below is the default contents, once values are determined for each severity level, they can be added to the file.


sevh3,dos attack,ddos attack
sevh2,dos attack,ddos attack,network is being probed
sevh1,network is being probed
sevl1,unknown reason
sevl2,unknown reason
sevl3,unknown reason
sev0,network is being held,recording host system error
glob,recording host system error
nores,system is being held,system is currently off
          

Configuration File

In the guardplot/ directory is the config.ini file that contains the following lines (the attributes are set to defaults):


[database]
import = data/example_data
database = data/HostData.db

[anomalies]
logs = data/anomLogs.csv
reasons = data/anomReasons.csv
          

This is to set the file locations that are used by the program. Altering these will alter where the files are located.

  • import: The directory that contains the json files that need to be read into the database
  • database: The location of the database file used by the software
  • logs: The location of the anomaly logs recorded by the software
  • reasons: The best guess of the software as to the reasons behind anomalies

Input Data

The code is set up to read through all directories with the date format "YYYY-mm-dd" as the names, with each containing a series of json files that contain logs for times throughout that date. The dbbuilder.py software reads in the json formatted data, preprocesses it, and saves it into a sqlite3 database.

The Database reader is expecting the following json structure for each event:


[
  {"host":"host_name_1","events":5,"date":"2022-10-03","epoch":1664820000},
  {"host":"host_name_1","events":1,"date":"2022-10-03","epoch":1664823600},
  ...
]
          

Running in Docker

Starting the Docker Container

Bring up the project by running the following command in the terminal in the root of the project:

sudo -E docker-compose up --build

During the building processes, example data simulating 90 days of events for 30 hosts is loaded into 'example_data' and the sqlite3 database is loaded with that example data. All the needed python libraries (listed in requirements.txt) are also installed into the image.

Once the container is running, exec into the "guardplot" container:

sudo docker exec -it guardplot bash

Using Guardplot

Once the database is created, populated, and all data analyzed, you can start your analysis of the data using the guardplot.py software. The Usage is below:


usage: GuardPlot.py [-h] [-a] [-g] [-i IP [-o | -d DATE | -l -e EVENTS [-p PREVIOUSEVENTS] -t TIME]]

optional arguments:
  -h, --help            show this help message and exit
  -a, --anomaly-view    displays details about anomalies rather than a plot
  -g, --global          performs global anomaly detection
  -i IP, --ip IP        evalute the given ip address
  -o, --overall         if ip is provided, view daily stats from last 33 date entries
  -d DATE, --date DATE  if ip is provided, evaluate the given date
  -l, --live            if ip is provided, performs live anomaly detection, event and time required
  -e EVENTS, --events EVENTS
                        when in live mode, evaluate the live data point
  -p PREVIOUSEVENTS, --previous-events PREVIOUSEVENTS
                        when in live mode, evaluate the data point compared to previous data point
  -t EPOCH, --time EPOCH
                        when in live mode, provide an epoch for comparison
          

Adding Data

Once you are ready to load your own data, place your correctly formated files within the projects and update the 'config.ini' to reflect the location of your data. Then rerun the 'dbbuilder.py'.


Building the Database

In the case that you want to preload your own data into the database, place your data files into the correct location and change the path in 'config.ini'. Once the files are in place, use the dbbuilder.py program to rebuild the database:


usage: dbbuilder.py [-a] [-s] [-h]

optional arguments:
  -h, --help         show this help message and exit
  -a, --all-reload   reload all entries in the database
  -s, --stat-reload  reload all stats in the database
          

This will construct a database with the following tables:

  • LOGS: The raw data from the Json
  • INDVSTATS: Contains the analysis and stats of each log entry
  • DAILYSTATS: Contains the analysis and stats for each day of each host
  • OVERALLSTATS: Contains the analysis and stats for each host for all time

NOTE: Using the command without any parameters will load any not included data from the Json resource and update the tables (the OVERALLSTATS table is completely recalculated).

  • Using the [-a] parameter will drop all tables and recreate the database.
  • Using the [-s] parameter will drop all stats entries in the database and recalculate them

The conversion of Json into sqlite3 is incredibly fast; however, the actual analysis performed on each and every data point is not. On average expect this process to take 6-10 hours to complete for the very first time use. After that the database is simply updated and shouldn't take long if done routinely (recommended everytime logs are recieved).


About the Project

This next section is the documents and research done regarding the project required by the University (literally, all of it). So unless you enjoy walls of text and what essentially amounts to journal entries, then you are good to leave at this point!

Project Description

For my project I will be completing a Python Application for GuardSight, Inc. This application is to look over historical data of hosts and be able to find anomalies in the data such as hosts that have gone down, hosts that are reporting higher amounts of attacks against them as compared to their average report in the past, hosts that fail to report as often as they have historically, and any other anomalies that are detected in the data when compared to a host’s historical averages.

The application will also need to be able to scan data live when a report comes in and compare it to historical values and alert if a particular anomaly has been determined to exist on the reporting host. This alert can then trigger an investigation or action for that host.

This application should be able to be run on a terminal so that it can be run on a server directly through ssh or remote connection. Eventually this application should have the ability to create visuals, graphically and/or terminally based, to view the trends in hosts over time.

GuardSight, Inc. has described the project as using historical data to determine inconsistencies and anomalies in the data a particular host reports, then using that knowledge to write an application that can detect those same inconsistencies and anomalies for hosts live.

A portion of this project will require the research of what kinds of network events will cause what types of anomalous data from hosts. Using this will allow the software to make a best guess of what is happening when a host responds in an unexpected way. With this knowledge, either an application or a team of people may be able to respond faster with a more correct response. For example, a server restart or a network attack may return similar anomalous data for the host, but we don’t need to send an alert if a server is restarting.

In order to complete this project, I will need to research Data Analysis and algorithms for distance, frequency, and time. These algorithms will form the base for this project and my understanding of how they work and their application is vital to my ability to complete this project effectively. I will also need to research and refresh on statistical analysis in order to determine the standard deviations of a set of historical data, and how to adjust those values live as historical values are added to over time.

This application will be written in Python 3. Having never used the language, I will need to learn how to develop and run applications in Python for both Linux and Windows operating systems. I will be spending the time to research the basics of Python, as well as the Data Science and Data Visualization applications of Python.

This program will challenge me to try multiple fields that I have not used before, such as Data Science and Python. But it will also utilize skills I have learned during my time in school such as advanced algorithms, object-oriented programming, data bases/Json, and patterns-based programming to name a few.

Objectives:

Overall, this project is to provide a way for anomalous data to be detected from historic data and use that historical data to determine those same anomalies on hosts as they send in data.

  • Main Objectives:
    • Learn to develop and run Python 3 applications in Linux via the terminal since the server may not have a graphical desktop (remote connection may or may not be terminal based)
    • Analyze the data set to find historical anomalies in the data
    • Research how best to compare live data to historical data quickly without the need to read through all the data bases every time a report comes in
    • Research and associate what kinds of network events may cause anomalies in host responses and find a way to differentiate between them
    • Research the strengths and usages of potential algorithms for analyzing frequency and time within the data
    • Research the applicable topics of statistical analysis that will be used in analyzing the data
    • Create a Python application that can scan the data and report those anomalies from the historical data along with possible causes
  • Secondary Objectives:
    • Create a Python application that can scan most recent host reports, compare them to historical reports, and report anomalies
    • Allow the Python program to be called from the software that controls the data. (This means implementing command line args)
    • Have the Python application log a history of anomalies
  • Tertiary Objectives:
    • Allow the Python application to create and show terminal friendly graphical representations of the data
    • Allow the Python application to create and save graphical representations of the data (not necessarily terminal or ssh friendly)
    • Create a live visual tracker in Python3 (for Linux with graphical desktop) to see the trends in reports and live reports for a given host of interest

Estimated Timeline:

Jan 29 - Jan 25

  • Begin learning Python3 syntax:
    • Standard syntax, loops, defs, dictionaries, file reading, json files
  • Become familiar with the data set, what it includes, and how best to analyze it
  • Research statistical Python libraries
  • Create example Python projects to test out what I have learned

Jan 26 - Feb 1

  • Continue learning Python3:
    • Multithreading, automatic start, command line arguments, piping, storing data temporarily for fast access.
  • Research fast and efficient ways to read in 2GB+ of json data in Linux, is Python the fastest?
  • Create example Python projects to test what I have learned

Feb 2 - Feb 8

  • Begin structure of data-read-in code, must not take longer than 10 seconds to read in data from the first time, once read in, the data must be stored in local fast access files (.tjson)
  • Program at this point will accept an IPv4 address, gather the data from the json database, and display the average number of events and the standard deviation for that host

Feb 9 - Feb 15

  • Feb 14 is the last day for any major changes to the project
  • Continue working on speed for reading in data (consider new storage method) if needed
  • Research statistical analysis and learning that can be applied to this data
  • Create example Python projects to test statistical analysis

Feb 16 - Feb 22

  • Apply statistical analysis code to main program
  • Research types of network events and how they affect network devices
  • Program at this point will report any anomalies in a given host's historical data

Feb 23 - Feb 29

  • Continue research on causes for anomalous data, including power loss or down time
  • Associate normal activity and time of day for better predictions
  • Begin adding anomaly detection and possible cause to main program

Mar 1 - Mar 7

  • Continue anomaly detection code if needed
  • Allow pipe or command line input of a json entry to begin live data comparison and live detection
  • Add new data to temp data for faster and better history retrieval
  • Research storing history of anomalies for a host
  • Program at this point will be able to report specific anomalies in a given host's historical data

Mar 8 - Mar 14

  • Apply creating history of anomalous data
  • Finalize application of reading in live data
  • Begin application of updating temp data (how and when)
  • Program at this point will be able to report anomalies of live data as compared to historical data, the possible causes, and create a history of that host's anomaly history

Mar 15 - Mar 21

  • Spring break
  • Will use this time to catch up if needed

Mar 22 - Mar 28

  • Begin porting application to Windows, ensure that json data can still be read in in a new environment
  • Research graphical and terminal visualization of data
    • Graphical for Windows
    • Terminal based for Linux
  • Create example Python projects to test visualization
  • Program at this point will work on both Linux and Windows

Mar 29 - Apr 4

  • Ensure the program works with the GuardSight, Inc. system (mainly folder and file paths)
  • Apply visualization to main program
  • Start reports
    • Algorithm and process report for GuardSight, Inc. (private)
    • Standard report of Capstone an project for Southern Utah University
  • Program at this point will be complete

Apr 5 - Apr 11

  • Deliver code and report to GuardSight, Inc.
  • Help setup program at GuardSight, Inc. if necessary
  • Capstone project complete

The above depicts the expected timeline and design requirements. The actual timeline and design documents are outlined in the following sections.


Design Docs

Basic Research

One of the more major segments of the project is to analyze the data that was given to me by GuardSight and find ways to detect anomalies in the data, what ever those may be. The article entitled ‘5 Ways to Detect Outliers/Anomalies That Every Data Scientist Should Know (Python Code)’. This article discusses five different ways to detect anomalies and outliers in data, what they could possibly mean, and how to implement that type of analysis using python code. This will be a great asset to my project once I start needing to implement ways to detect anomalies and outliers in the data given to me, and it is a nice plus that the article has example python code that I can use to give me a base on how I want to implement my detection.

Badr, W. (2019, April 12). 5 Ways to Detect Outliers That Every Data Scientist Should Know (Python Code). Retrieved from https://towardsdatascience.com/5-ways-to-detect-outliers-that-every-data-scientist-should-know-python-code-70a54335a623

One of the best ways to understand the data is to view it graphically. While I was not able to find a software that I liked entirely and will likely have to make my own, I found a resource that has a lot of ideas and parts that could be used as inspiration. The Python library “visidata” contains lots of dynamic and colored graphs and tools for the linux terminal that I could use to build an ideal software for the data at hand.

Pwanson, S. (n.d.). VisiData. Retrieved from https://www.visidata.org/

I am currently taking a Big Data analytics class as SUU. In this class the required reading includes many valuable lessons and resources on how to analyze large amounts of data. I have been using some of the ideas in the book already to configure temporary databases and how they should be maintained. I also plan to use this resource when I start working on statistical analysis. It gives a good overview of different kinds of analysis and how to apply them.

Sharda, R., Delen, D., & Turban, E. (2018). Business intelligence, analytics, and data science a managerial perspective (4th ed.). Boston: Pearson.

After meeting again with GuardSight, it was determined that we would need to create an alternate database structure to hold the data so it can be read in faster through the use of Python rather than Linux commands. We agreed that SQLite would be the best option. I will be using the website called “Tutorials Point” to learn how to install SQLite3 onto Linux and how to address the database from Python.

SQLite - Installation. (n.d.). Retrieved February 24, 2020, from https://www.tutorialspoint.com/sqlite/sqlite_installation.htm

The entirety of my project is written in Python, which is a language that I have had very little experience with. As such I have needed to be able to learn the basics of the language, as well as a few other tidbits of knowledge in order to program effectively. The majority of my python learning has come from the ‘Automate the boring stuff with Python’ book by Sweigart. I have been using the lessons taught in this book to create and try different python projects, and ultimately applying it to create the analysis software for my capstone project.

Sweigart, A. (2015). Automate the boring stuff with Python: practical programming for total beginners. San Francisco, CA: No Starch Press, Inc.

Plotter Design:

The above image shows what the plotter engine produces when looking at a single day's data.

The above image shows the idea and execution of live anomalies being detected using command line arguments.

The above image shows the anomaly detail view of events that occur in the data.

Initial Thoughts:

My first thoughts are that this project is very interesting and I am learning a lot about python, data management, temporary files, multi-threading, read-in times, and statistical analysis. I have been able to apply what I have learned so far, and I am making a lot of progress. I am enjoying the project, but I feel there are small things here and there that are causing some issues with the end product. The first is reading in data in the form it is currently is incredibly slow, and Python is not great at reading in that much data at once. I have been able to get it to work super quickly in Linux using a system call out to grep, but with the need to be multiplatform I am needing to find a way to do it solely in Linux.

I have been doing some research on how to do this, and without downloading large packages to handle it, I think I will need to create a middle man / temp database that can store certain amounts of data for when the data is needed quickly. Otherwise I suppose it is fine if the program takes some time.

I also feel I need to spend some more time with the client asking about the data set. I feel I understand the dataset pretty well, but there is a single detail I am not sure on. I’m not sure if a host is reporting all events since the last boot, or if it reporting all events since the last report. This clarification will alter how I continue with the project and how I analyze the data.

Overall, the project is going extremely well and I haven’t run into many roadblocks other than those listed above. I am on track to finish what I need to by the timeline I have set for myself. As far as improvements, I need to start checking in with GuardSight, Inc. frequently not only for testing, but to make sure the software will work with the system in the way they are envisioning it, just so I don’t get super far along and then realize it won’t work. This hasn’t been an issue yet since a lot of my time thus far is just analyzing historical data and writing basic scripts. But starting here in the next two weeks I will need to start that weekly communication.


Timeline and Updates

2020-01-22

  • I started learning Python3 for Linux, wrote example code to learn basics, researched fastest way to retrieve data from a Json database (grep is infinitely faster than Python), started writing main program segments to test parts, researched mean and standard deviation for a sample of Python data, learned to read Json into Python as a dictionary array. In the processes of testing loading JSON data, I found that reading all data in via Python and then searching for the host is ill advised. It takes close to an hour to read in the data, and most times the server would crash. I found that using Linux's 'grep' to locate and print the data into a temp database file and then read that temp file in is much faster, usually taking no more than 10-20 seconds. To reduce time in the future, if a tempDB (.tJson) file is not found for a host, the software will run a Linux shell command from Python to create the file, and then read that file in. If the file already exists, then it just reads the tJson file instead. This implies that the software will have a full search mode and a quick search mode, where the full search will recreate the .tJson file to update it, while the quick mode will simply read the existing .tJson if it exists.

2020-01-29

  • I started researching different ways to read in Json data. There were a lot of options that included external libraries, but in the interest of using the barebones of Python I decided to stick with the built in libraries. I tried multiple points to try get the data read in, but ultimately decided on the following:
    • The read in needs to multithreaded
    • The read in needs to have each thread focus on one date file
    • The read in needs to be in Python and not a system call to grep, this way it can work cross system
    • Json needs to be streamed
  • Following this idea, I researched how to construct threads within Python. I ultimately created a Python class that extends the Thread class so that I can pull the Json data object from the thread once it joins back and place it in the central Json data variable. I also think that only pulling 30 days of data would be beneficial for the follow reasons:
    • Reading the full Json database in Python takes 33 seconds while grep takes 5 seconds
    • Trends change over time, if a host is more active in the last month, then we want to consider that as normal
    • From my research into Big Data analysis (my Big Data class), the temporary DB for hosts is a good idea, and only holding 30 days worth of data per host will save space over time, and then every 7 days reconstruct those temp files so that it is always up to date
  • This doesn't mean that there shouldn't be the ability to read all the data in, but that needs to be reserved for the not live version of the app since it will take a few minutes per host to read and analyze all of its data, especially as the amount of data grows continuously.
  • I also spent the time to continue learning Python. I learned how to create classes, how to extend classes, how to use date times, how to analyze directories, the proper form for constants, and how to set timers. I used the timer functions to create a way to see how long it takes to read in data. To read in the full DB data for a host via grep is 5.18 s, to read the full DB in for a host via Python is 33.01 s, and reading in 30 days worth of data for a host via Python is 5.33 s.

2020-01-30

  • I started researching piping into the Python project. The process of piping is rather simple and easy to parse. This way I can pipe in an ip itself, or I can pipe an entire Json entry in. What I figure is I can do the following:
    • ​If no pipe or com arg: Ask for IP, do full historic scan
    • If only ip arg or only IP pipe: Use IP to do full historic scan
    • ​If Json is piped: Run a fast analysis, use the temp data base or grab only 30 days of data for a fast scan
  • I also added comments to the code I have completed and simplified the defined methods. The program at this point allows for input and grabs the actual data, temp data is created, and displays average and standard deviation.
  • Note: I might add the ability to pipe in or add a file name from command line for live analysis of all Json entries in that file.

2020-02-08

  • I added in the ability to read an entire Json file for comparisons. Usage: Once a Json file is created, you can pipe that file name into the software and it will run a quick analysis on every host in the file.
  • Note: It takes an incredibly long time to do this without the use of the .tJson files. This means that the .tJson files are critical to the performance of the software. I will need to change how the .tJson files get repopulated, it is not worth the time to only refresh every 7 days. This could greatly slow down performance. I will likely need to just keep a 30 day queue, when a new entry comes in, add it to the .tJson file and drop the oldest entry. This way the temp files are always up to date and never need to pull from the full database unless absolutely necessary. I will work on this next.
  • I will also consider the possibility of creating a separate Python script to prepare the temp data base the very first time.
  • The program at this point will accept an IPv4 address, gather the data from the Json database, and display the average number of events and the standard deviation for that host.

2020-02-11

  • The goal today was to get two things fixed.
    • Get the temp files to auto rotate data every time a new entry is input, which is now does. When a new entry is detected, it will leave off the oldest entries, and add on the newest entry (as long as it is not a duplicate)
    • I clarified with GuardSight that the "events" number is not per report, it reports the total number of events since the last day or reboot (reports reset to 0 at 0 hours). This means I had to make sure that I was looking at the span between reports as the number of events.
  • This leaves a number of odd things that need to be considered. I may need to consider the time between reports, this way a 2 hour gap is broken down versus a 15 minute gap. I will also need to consider the option of comparing data to only the data from the same time each year. I am hoping to get an up-to-date data set at the next meeting. This way I can work on the newer data so that my messed up fake data is replaced.
  • Also, I will need to create a separate Python script to create the entire temp database for first time use. This will save a lot of time in the future. If a new host is created the existing program will take care of it. Creating a single new file takes only 5-8 seconds, which is fine if its few and far between, but even creating the temp files for a single report Json can take many hours.

2020-02-17

  • I have been researching different types of terminal plots, and I have not been able to find any that I believe will work well for what will be needed. I will be designing and implementing my own.
  • While implementing this plotter, I noticed that August 2019 data was extremely odd and out of standard range. This was near the beginning of the data entries, but the data seemed odd for a couple of days. I'm not sure why the values didn't reset back to 0, but I will be talking to GuardSight about these anomalies.
  • I was able to complete the new visualization software for now. It shows the data from the current time back a couple of days (depending on the length of the plot). It will show the AVG and STD DEV of both the overall data and the current display data. The entries are color coded, white is within range, yellow is within at least one of the std devs, and red is outside of both std devs. For testing I have the data being read in via Linux grep... I think I may need to discuss options with GuardSight as Grep literally takes seconds to run and is much faster than Python reading in the data alone.
  • This new visualization software in conjunction with the standard software is able to detect anomalies as far as being out of the std devs.
  • This has led me to begin thinking about algorithms to detect anomalies and when to report them. Using visualizations and the comparisons of a set amount of time vs all time gives us a good idea of what may indeed be too high or too low. This basic algorithm will likely be the start of all algorithms finding anomalies.

2020-02-18

  • I really like the way the visualization software ended up working. I have decided to keep the Linux commands for the nature of speed. I want to make this the starting point for the historical analysis, and allow the user to choose an IP, then select a date they want to view. I have mapped out the stages/phases that will be implemented.

2020-02-19

  • I made some major changes to the visualization software. I got the plot fitted and all the screens necessary for it to work as intended. I have learned a lot about the data set, and I have found that the original algorithm to detect anomalies may need to be altered. We may also need to do anomaly testing in two cases: (1) the day; and (2) the day compared to overall. There is definite room for improvement, but I feel better about where this part of the project is at. I will be meeting with GuardSight on Thursday 2/20/2020 to show what has been done, and at this point I feel I will be able to discuss what needs to be detected with better familiarity with the data.

2020-02-20

  • I met with GuardSight today. We looked over what was done already, we redefined some of the objectives, and discussed certain things that need to be done. I have created a revised contract that I will need to be approved by both GuardSight and Dr. Cantrell.
  • will need to implement a SQLite database for storing the data so that it can be accessed faster. We also discussed the algorithms and that distance is not a concern. I will need to go back make sure that the program accepts command line arguments.

2020-03-11

  • I started to break things into multiple parts. I started and completed the Python script that reads all of the Json data into a new SQLite3 database. I am happy to report that it is much more efficient and faster.
  • I also had to research how to read in command line args for Python. I think I have settled on the built in Python library "argparse". This makes it easy and I will plan to use it for all other commands.
  • I also started the separation of the data reader and GuardPlot. This way GuardPlot can call the reader and not have to be changed, only the reader if the database format or type changes.
  • Overall, today I completed the new SQLite3 database implementation and Python script to read data in efficiently and with control.

2020-03-11

  • I continued working on the read in portion and got that far enough along for testing. I also made sure that the data being pulled from the database matches the Json pulls from the old version. Good news, they match.
  • I also changed how colors are handled in the graphical format. It does a full analysis of all points as its writing data. Instead of assigning colors based on graphical boundaries, I write those colors based on actual analysis. This means the number of anomalies on the date screen matches the number of red values on the graph
  • As far as what needs to be done by Thursday:
    • Allow for command line args for the main program
    • Allow the user to view a graph of all time (or last 96 days) where each point is a day and represents the average for that day.
    • When an anomaly is found, find a way to compare it to other hosts to determine if everyone experienced the same anomaly (GuardSight's end) or if it is unique to the host (Host's end).

2020-03-16

  • After working on some of the analysis steps, I've determined that the single table with logs will not be sufficient enough to do an on the fly analysis. I decided to open up the loadDB.py file again and add more functionality. I added 3 more tables that once the initial logs are loaded into the database, analysis is performed on all the hosts so that we can quickly access statistics for any host without having to do an on the fly analysis. However, while this does save a tremendous amount of time while the code is running, the initial setup of the database can take close to 10 hours to populate for the very first use. After the first use it doesn't take nearly as long to add new data since the analysis on previous data is already complete. The tables for analysis are as follows:
    • LOGS
      -host
      -events
      ​-date
      ​-epoch
    • INDVSTATS (new stats for every entry)
      -host
      -d_events (since events is cumulative)
      -is_reset (determines if a reset occured at the time)
      -date
      ​-epoch
    • DAILYSTATS (stats for a host on a specific date)
      -host
      -low (avg - stddev)
      -avg (average number of events for a given date)
      -high (avg + stddev)
      -stddev
      ​-date (based on epoch)
    • OVERALLSTATS (stats for a host overall)
      -host
      -low (avg - stddev)
      -avg (average number of events for a given date)
      -high (avg + stddev)
      -stddev
  • This takes a long time to test since it takes 10 hours to populate the stats tables (reading in all logs simply takes 2-3 minutes from a hdd). None of the run time is being counted in the total hours. I cannot progress on this until the data base is in place.
  • The reason for this change is because one of the anomaly detection tests we want to do is to see if other hosts are reporting similar anomalous data at the same epochs. In order to do that quickly we need to have the statistical analysis already done, or it could take up to 45 minutes per anomaly detected, or multiple hours to display one graph.
  • It is also an important change since we want to eventually do live analysis of data as it comes in. In order to report if a data point is truly anomalous, we need that fast access to the statistics. At this point, once testing is over, we will have a working database and the loader script will be complete.

2020-03-17

  • Outside of waiting for the Database to either work or fail, I was simply debugging errors as they appeared.

2020-03-18

  • The database is actually finally done. After much toil and time watching it succeed until the very end, we finally have a working database with all of the data we need. The focus I did today was to get the classes made and the link to the database complete. A lot of the code in these relied on coming up with ways of find anomalies in different ways.
  • As has been the case since the beginning, a severe anomaly for a given point is determined by comparing it to the overall std devs and daily std devs. In addition to these I had to determine the ideas behind two other anomalies.
  • The first is global anomalies (the purpose behind the really slow database build). A global anomaly is determined by checking the number of anomalies reported by every host for a given epoch, if over 70% of those are also anomalous, we determine the point to be globally anomalous.
  • The second is daily anomaly, or if a given daily overall is anomalous. It is moderately anomalous if the avg events for the day are outside of the overall std devs. It is severely anomalous if the data is moderately anomalous in addition to over 30% of that days points being anomalous as well.
  • This has all been implemented and the classes and links are done.
  • The biggest challenge today was dst. After some research I was able to determine what time zone and dst a given epoch has, which wasn't too bad. But the hardest part is epochs that occur as daylight savings switches, which either means two epochs result in the same time, or a given epoch maps to a time that doesn't exist (I feel this is an oversight in the time zone libraries for Python). But everything at this point works. Tomorrow I need to update the driver.

2020-03-19

  • Today I worked on the driver. The driver now communicates with the DBlinker. I got it working with the new backend and allowed for command line arguments and the ability to see global anomalies on the graph. Overall, everything is working well, and it is more visually pleasing. All I think I have left to do is the following 4 items:
    • Be able to view a month at a time, globally, with each point as the avg of a day
    • Show lists of anomalies and possible causes rather than plots (not sure if this will be a com arg or a toggle-able var in the program.
    • Be able to determine the anomality of a live data point
    • Fix/allow time gaps in the plot. (This is a must be done before the others).

2020-03-28

  • I started working on the last couple of items that are on the list of things that needed to be done. We got the gaps figured out for the most part (DST is still causing some issues). And I started working on live anomaly detection.
  • The live anomalies are compared to other data points from the same time of day from the same host, but not compared to the overall events since the delta events is not known for the event. This means there was no clear way to do the analysis compared to the overall data. I also wasn't able to sort the data coming from the database query since live analysis needed to be fast and sorting the result took too much time. 1-2 seconds doesn't seem like a lot of time, but when its for 1400 hosts at once, it adds up pretty quick.
  • Regardless, some form of live analysis is complete. If I have time I would like to revisit the idea.
  • I then worked on the list view of the anomalies rather than the graph view. This just allows for a more detailed view of each day. Only severe anomalies are shown. And eventually a list of possible causes should be added (this is the last thing to do).
  • I also got the overall view done and the rest of the command line args completed. Overall, the only thing that is left to do is a final objective in the contract to have a list and log of causes for certain types of anomalies. I don't have a lot of time and I'm not finding a lot on the research of that. I will continue working on this last portion.

2020-04-01

  • Today I finished the anomaly logs and the anomaly reasons. The anomaly reasons don't work nearly as well as I was anticipating since there is not concrete way to show what event happened in the past. There was also very little data in my research that talked about anything network traffic related that wasn't super high traffic or no traffic, so there is no given reason when those types of data points are hit.
  • This also brings me to the point where there are now multiple severities of anomaly based on how far above the std dev line the data point resides. Those new levels are now reported and recorded while in anomaly list view.
  • I also took the time to finish commenting out the code, making it look presentable and readable. The code is complete and a readme has been constructed.
  • The Capstone project is complete to the best of my abilities in the time that I was given.

Final Product

The actual code and software project are located on my GitHub account. tybayn/guardplot


Video Demo


Reflection

The goal was to write software that could go through historical logs and find inconsistencies and anomalies in the data and report on those. This project was actually quite difficult, and it pushed me to many different limits, and forced me to go beyond those. Almost the entire project as a whole was made of new material for me. I had to learn: Python3, Linux command line, big data analysis, and data science. But it also utilized skills I have learned during my time in school such as advanced algorithms, object-oriented programming, data bases/json, and patterns-based programming to name a few.

Overall, I really enjoyed this project. I appreciate how far the project pushed me and my skills and forced me to learn more. I have learned many valuable lessons and skills that I would not have been able to learn before. This project was the perfect display of everything I have learned during my time in school, and I feel that it truly is the Capstone of my degree.


Version 1.0 (1 April, 2020)

Initial Release