Statistical Techniques for Network Security
Latest Publications


TOTAL DOCUMENTS

12
(FIVE YEARS 0)

H-INDEX

0
(FIVE YEARS 0)

Published By IGI Global

9781599047089, 9781599047102

Author(s):  
Yu Wang

Data represents the natural phenomena of our real world. Data is constructed by rows and columns; usually rows represent the observations and columns represent the variables. Observations, also called subjects, records, or data points, represent a phenomenon in the real world and variables, as also known as data elements or data fields, represent the characteristics of observations in data. Variables take different values for different observations, which can make observations independent of each other. Figure 4.1 illustrates a section of TCP/IP traffic data, in which the rows are individual network traffics, and the columns, separated by a space, are characteristics of the traffics. In this example, the first column is a session index of each connection and the second column is the date when the connection occurred. In this chapter, we will discuss some fundamental key features of variables and network data. We will present detailed discussions on variable characteristics and distributions in Sections Random Variables and Variables Distributions, and describe network data modules in Section Network Data Modules. The material covered in this chapter will help readers who do not have a solid background in this area gain an understanding of the basic concepts of variables and data. Additional information can be found from Introduction to the Practice of Statistics by Moore and McCabe (1998).


Author(s):  
Yu Wang

In this chapter we will focus on examining computer network traffic and data. A computer network combines a set of computers and physically and logically connects them together to exchange information. Network traffic acquired from a network system provides information on data communications within the network and between networks or individual computers. The most common data types are log data, such as Kerberos logs, transmission control protocol/Internet protocol (TCP/IP) logs, Central processing unit (CPU) usage data, event logs, user command data, Internet visit data, operating system audit trail data, intrusion detection and prevention service (IDS/IPS) logs, Netflow1 data, and the simple network management protocol (SNMP) reporting data. Such information is unique and valuable for network security, specifically for intrusion detection and prevention. Although we have already presented some essential challenges in collecting such data in Chapter I, we will discuss traffic data, as well as other related data, in greater detail in this chapter. Specifically, we will describe system-specific and user-specific data types in Sections System- Specific Data and User-Specific Data, respectively, and provide detailed information on publicly available data in Section Publicly Available Data.


Author(s):  
Yu Wang

Measurement plays a fundamental role in our modern world, and the measurement theory uses statistical tools to measure and to analyze data. In this chapter, we will examine several statistical techniques for measuring user behavior. We will first discuss the fundamental characteristics of user behavior, and then we will describe the scoring and profiling approaches to measure user behavior. The fundamental idea of measurement theory is that measurements are not the same as the outcome being measured. Hence, if we want to draw conclusions about the outcome, we must take into account the nature of the correspondence between the outcome and the measurements. Our goal for measuring user behavior is to understand the behavior patterns so we can further profile users or groups correctly. Readers who are interested in basic measurement theory should refer to Krantz, Luce, Suppes & Tversky (1971), Suppes, Krantz, Luce & Tversky (1989), Luce, Krantz, Suppes & Tversky (1991), Hand (2004), and Shultz & Whitney (2005). Any measurement could involve two types of errors, systematic errors and random errors. A systematic error remains the same direction throughout a set of measurement processes, and can have all positive or all negative (or both) values consistently. Generally, a systematic error is difficult to identify and account for. System errors generally originate in one of two ways: (1) error of calibration, and (2) error of use. Error due to calibration occurs, for example, if network data is collected incorrectly. More specifically, if an allowable value for one variable should have a range from 1 to 1000 but we incorrectly limit the range to a maximum of 100, then all the collected traffic data corresponding to this variable will be affected in the same way, giving rise to a systematic error. Errors of use occur, for example, if the data is collected correctly but was somehow transferred incorrectly. If we define a “byte” as a data type for a variable with a maximum range greater than 256, we expect incorrect results on observations with values greater than 256 for this variable. A random error varies from a process to process and is equally likely to be randomly selected as positive or negative. Random errors arise because of either uncontrolled variables or specimen variations. In any case, the idea is to control all variables that can influence the result of the measurement and to control them closely enough that the resulting random errors are no longer objectionable. Random errors can be addressed with statistical methods. In most measurements, only random errors will contribute to estimates of probable error. One of the common random errors in measuring user behavior is the variance. A robust profiling measurement has to be able to take into account the variances in profiling patterns on (1) the network system side, such as variances in network groups or domains, traffic volume, and operating systems, (2) the user side, such as job responsibilities, working schedules, department categorization, security privileges, and computer skills must also be considered. The profiling measurement must be able to separate such variances from the system and user sides. Hence, revolutionizing network infrastructure or altering employment would have less of an impact on the overall profiling system. Recently, the hierarchical generalized linear model has been increasingly used to address such variances; we will further discuss this modern technique later in this chapter.


Author(s):  
Yu Wang

Statistical software and their corresponding computing environments are essential factors that will lead to the achievement of efficient and better research. If we think of computing and classifying algorithms as the roadmap to arrive at our final destination, a statistical package is the vehicle that is used to reach this point. Figure 2.1 shows a basic roadmap of the roles that statistical software packages play in network security. One of the advantages of using a statistical package in network security is that it provides a fairly easy and quick way to explore data, test algorithms and evaluate models. Unfortunately, not every package is suitable for analyzing network traffic. Given the natural characteristics of the network traffic data (i.e., large size and the ability to change dynamically), several fundamental attributes are necessary for specific packages. First, the package should have good data management capacities, which include the capacity to read large data and output/save resulting files in different formats, the capability to merge and link processed data with other data sources, and the ability to create, modify and delete variables within data. Second, it should be able to process large amounts of data efficiently because statistical analyses in network security are usually based on dynamic online data, which requires the application to conduct analyses timely; this differs from areas such as healthcare, life science, and epidemiology where statistical analyses are conducted based on static offline data. Third, it should support modern modeling procedures and methods, such as the Bayesian methods, hidden Markov model, hierarchical generalized linear model, etc. Finally, because usability is an important factor, we want the software to be both accessible and user-friendly. These attributes are particularly important during the development phase because they allow us to quickly test hypotheses and examine modeling strategies effectively. Since many commercial and research-oriented software packages may not have all of the aforementioned attributes, we may need to implement multiple packages, such as packages for data management, for fitting a particular model, and for displaying results graphically. In the end, we may more likely use a general-purpose programming language, such as C, C++ or Java to create a customized application which we can later integrate with the other components of the intrusion detection or prevention system. The results obtained from the statistical software can be used as a gold-standard benchmark to validate the results from the customized application. customized application. In this chapter, we will introduce several popular commercial and research-oriented packages that have been widely used in the statistical analysis, data mining, bioinformatics, and computer science communities. Specifically, we will discuss SAS1, Stata2 and R in Sections The SAS System, STATA and R, respectively; and briefly describe S-Plus3, WinBUGS, and MATLAB4 in Section Other Packages. The goal of this chapter is to provide a quick overview of these analytic software packages with some simple examples to help readers become familiar with the computing environments and statistical computing languages that will be referred to in the examples presented in the rest of these chapters. We have included some fundamental materials in the Reference section for further reading for those readers who would like to acquire more detailed information on using these software packages.


Author(s):  
Yu Wang

Increasing the accuracy of classification has been a constant challenge in the network security area. While expansively increasing in the volume of network traffic and advantage in network bandwidth, many classification algorithms used for intrusion detection and prevention face high false positive and false negative rates. A stream of network traffic data with many positive predictors might not necessary represent a true attack, and a seemingly anomaly-free stream could represent a novel attack. Depending on the infrastructure of a network system, traffic data can become very large. As a result of such large volumes of data, a very low misclassification rate can yield a large number of alarms; for example, a system with 22 million hourly traffics with a 1% misclassification rate could have approximately 75 alarms within a second (excluding repeated connections). Validating every such case for review is not practical. To address this challenge we can improve the data collection process and develop more robust algorithms. Unlike other research areas, such as the life sciences, healthcare, or economics, where an analysis can be achieved based on a single statistical approach, a robust intrusion detection scheme need to be constructed hierarchically with multiple algorithms. For example, profiling and classifying user behavior hierarchically, using hybrid algorithms (e.g., combining statistics and AI). On the other hand, we can improve the precision of classification by carefully evaluating the results. There are several key elements that are important for statistical evaluation in classification and prediction, such as reliability, sensitivity, specificity, misclassification, and goodness-of-fit. We also need to evaluate the goodness of the data (consistency and repeatability), goodness of the classification, and goodness of the model. We will discuss these topics in this chapter.


Author(s):  
Yu Wang

Data exploratory analysis discovers data structures and patterns with all variables as a whole, but this analysis does not particularly focus on seeking associations between response variables and predictor variables. In this chapter, we will discuss how to identify and measure this response-prediction relationship, which is an essential element in intrusion detection and prevention. Even though the expression for models for association and prediction can have a broad range, in general the goals of modeling for association and prediction in network security are two-fold: (1) to identify variables that are significantly associated with the response variable and (2) to assess the robustness of these variables, if any, in predicting the response. Although the term, model, is perhaps confusing to many people, a model is just a simpli- fied representation of some aspect of the real world, whether an object or observation, or a situation or process. Models are of particular importance for network security because of the size of data and the complex relationship among variables and the desired outcomes. Statistical modeling procedures available for analyzing the response-predictor phenomenon mainly include bivariate analysis and multiple regression-based analysis. Bivariate analysis focuses on the relationship between two variables (e.g., a response and a predictor) without taking into account any impact from other predictor variables on the response variable. The multiple regression modeling approach, on the other hand, requires establishing a regression relationship between a response variable and a set of potential predictor variables, and the predictive power of each of the predictors as adjusted by others. Therefore, a variable associates with the response significantly in the bivariate analysis may no longer hold such an association in the regression analysis after adjusting from other variables. In the following sections, we will review and discuss these two main approaches in detail. For readers who would like to attain a more general knowledge on modeling associations should refer to Mandel (1964), Press & Wilson (1978), Cohen & Cohen (1983), Berry & Feldman (1985), Cox & Snell (1989), McCullagh & Nelder (1989), Agresti (1996), Ryan (1997), Long (1997), Burnham & Anderson (1998), Pampel (2000), Tabachnick & Fidell (2001), Agresti (2002), Myers, Montgomery & Vining (2002), Menard (2002), and O’Connell (2006). Comprehensive reviews on data mining and statistical learning can be found from Vapnik (1998, 1999), Hastie, Tibshirani & Friedman (2001), Bozdogan (2003).


Author(s):  
Yu Wang

This chapter discusses several data reduction techniques that are important in intrusion detection and prevention. Network traffic data includes rich information about system and user behavior, but the raw data itself can be difficult to analyze due to its large size. Being able to efficiently reduce data size is one of the key challenges in network security and has been raised by many researchers over the past decade (Lam, Hui & Chung, 1996; Mukkamala, Tadiparthi, Tummala, & Janoski, 2003; Chebrolu, Abraham & Thomas, 2005; Khan, Awad & Thuraisingham, 2007). Recall the concept of the data cube which was presented in Chapter IV; using various approaches, it is possible to reduce the size of data in all three cube dimensions (variables, observations, and occasions). More specifically, we can reduce the total number of observations by sampling network traffic, reduce the total number of variables by eliminating variables that are not robust and do not associate with the outcome of interest, and reduce the number of occasions by taking a sample of the time-related events. We will discuss these approaches in the following sections, including data structure detection, sampling, and sample size determination. In addition to statistical approaches for data reduction, we need to carefully select a data type for each variable to ensure that the final size of a given dataset does not increase due to any inappropriate data types. For example, we shall use the byte to store a binary variable. Data reduction heavily involves multivariate analysis on which a great number of literatures are available on this topic. Readers who are interested in gaining a better understanding of detailed and advanced multivariate analysis can refer to Thomson (1951), Bartholomew (1987), Snook & Gorsuch (1989), Everitt & Dunn (1991), Kachigan (1991), Loehlin (1992), Hatcher & Stepanski (1994), Rencher (1995), Tabachnick (2000), and Everitt (2005).


Author(s):  
Yu Wang

In this chapter, we will provide a brief overview of network security and introduce essential concepts of intrusion detection and prevention and review their basic principles and guidelines. Then, we will discuss statistical approaches in practice as well as statistical opportunities, roles, and challenges in network security. Network security has become a very popular topic. A simple Google search based on the keyword “network security” showed 2.2 million items on February 29, 2008. Network security aims to protect the entire infrastructure of a computer network and its corresponding services from unauthorized access. The two key elements of network security are risk assessment and risk management. There are several fundamental components in network security: (1) security-specific infrastructures, such as hardware- and software-based firewalls and physical security approaches, (2) security polices, which include security protocols, users’ authentications, authorizations, access controls, information integrity and confidentiality, (3) detection of malicious programs, including anti-viruses, worms, or Trojan horses, and spyware or malware, and (4) intrusion detection and prevention, which encompasses network traffic surveillance and analyzing and profiling user behavior. Since the topic of network security links a great number of research areas and disciplines, we will focus on the component of intrusion detection and prevention in this book. Readers who are interested in other components or want to gain more detailed information on the entire topic may refer to Smedinghoff (1996), Curtin (1997), Garfinkel and Spafford (1997), McClure, Scambray, and Kurtz, (1999), Strebe and Perkins (2000), Bishop (2003), Maiwald (2003), Stallings (2003), Lazarevic, Ertoz, Kumar, Ozgur, & Srivastava, (2003), Bragg, Rhodes-Ousley, Strassberg (2004), McNab (2007), and Dasarathy (2008). For wireless network security, Vacca (2006) provides an essential step-by-step guide that explains the wireless-specific security challenges and tasks, and for mobile phone related intrusion detection refer to Isohara, Takemori & Sasase (2008). Finally, for an overall introduction on network security, including key tools and technologies used to secure network access, refer to Network Security Principles and Practices by Malik (2003) and Network Security Fundamentals by Laet & Schauwers (2005).


Author(s):  
Yu Wang

The requirement for having a labeled response variable in training data from the supervised learning technique may not be satisfied in some situations: particularly, in dynamic, short-term, and ad-hoc wireless network access environments. Being able to conduct classification without a labeled response variable is an essential challenge to modern network security and intrusion detection. In this chapter we will discuss some unsupervised learning techniques including probability, similarity, and multidimensional models that can be applied in network security. These methods also provide a different angle to analyze network traffic data. For comprehensive knowledge on unsupervised learning techniques please refer to the machine learning references listed in the previous chapter; for their applications in network security see Carmines, Edward & McIver (1981), Lane & Brodley (1997), Herrero, Corchado, Gastaldo, Leoncini, Picasso & Zunino (2007), and Dhanalakshmi & Babu (2008). Unlike in supervised learning, where for each vector 1 2 ( , , , ) n X x x x = ? we have a corresponding observed response, Y, in unsupervised learning we only have X, and Y is not available either because we could not observe it or its frequency is too low to be fit ted with a supervised learning approach. Unsupervised learning has great meanings in practice because in many circumstances, available network traffic data may not include any anomalous events or known anomalous events (e.g., traffics collected from a newly constructed network system). While high-speed mobile wireless and ad-hoc network systems have become popular, the importance and need to develop new unsupervised learning methods that allow the modeling of network traffic data to use anomaly-free training data have significantly increased.


Author(s):  
Yu Wang

Decision analysis, a derivative of game theory, was introduced by Von Neumann in the early 1920s and was adopted in Economics in the late 1940s (Von Neumann and Morgenstern, 1947). It is a systematically quantitative approach for assessing the relative value of one or more different decision options based on existing and new information and knowledge. Figure 11.1 shows a general decision-marking process graphically. Network security relates both offline and online decision-making processes. The offline decision-making process involves fundamental security issues, such as determining the thresholds of classification, selecting sampling methods and sampling sizes for collecting network traffic, and deciding baseline patterns for profiling. Offline decisions usually require more statistical analyses and take more time to reach a not just reasonable, good or better, but the “best” solution. The online decision-making process, however, usually requires a response quickly, which could make it more difficult to achieve a good solution. For instance, when an alarm emerging, an immediate action is needed to decide if this alarm is an indication for a real attack or it is a false alarm? In such a circumstance, we do not have much time to conduct a complex analysis but we have to take an action on that alarm instantaneously. Many online decisions could be analyzed complexly and be involved a sequence of compositely interrelated decisions that we may not be able to encompass quickly. As a result, the aim of online decision-making is more likely to focus on a reasonable, a good or a better solution rather than the best solution. In particular, given the uncertainty in decision-making processes, we may never be able to reach the best solution for either offline or online decision-marking processes in many circumstances of network security. Decision-making also associates with network management that is about knowledge—if we know what our network and servers are doing, making decisions could be easier. The primary challenge in the decision-making process is uncertainty. To address this issue of uncertainty, we need to assess risks—risk assessment that utilizes the theory of probability is a fundamental element of decision analysis (Figure 11.2). There is no doubt that risk and uncertainty are important concepts to address for supporting decision-making in many situations. Our goals for decision analysis are the ability to define what may happen in the future and to choose the “best” (or at least a good or better) solution form among alternatives. Under the primary challenge of uncertainty, decision analysis has several tasks, including how to describe and assess risks, how to measure uncertainties, how to model them and how to communicate with them. All these tasks are not easy to accomplish due to the task themselves, which cannot be clearly defined. For example, even though we have a general idea of what risk means, if we were asked to measure it, we would find little consensus on the definition. Nevertheless, decision analysis provides a tool for us to find a solution in confusing and uncertain territory. It gives us a technique for finding a robust and better solution from many alternatives. In this chapter, we will introduce some methods on decision analysis including analyzing uncertainty, statistical control charts and statistical ranking methods, but we will not discuss the decision tree, a classical decision analysis technique, in this chapter. Readers who are interested in obtaining essential decision analysis information (e.g., decision tree) should refer to Raiffa (1968), Hattis & Burmaster (1994), Zheng & Frey (2004), Gelman, Carlin, Stern & Rubin (2004), Aven (2005), and Lindley (2006).


Sign in / Sign up

Export Citation Format

Share Document