What is Exploratory Data Analysis?
Exploratory data analysis (EDA) is the first step in the data analysis process.
Researchers and data analysts use EDA to understand and summarize the contents of a dataset, typically with a specific question in mind, or to prepare for more advanced statistical modeling in future stages of data analysis.
EDA relies on data visualizations that enable researchers to identify and define patterns and characteristics in the dataset that they otherwise would not have known to look for.
EDA was originally developed by John Tukey, an American mathematician, in the 1970s. It’s often thought of as more of a philosophical approach to data analysis than a statistical method.
While performing exploratory data analysis, researchers begin to make sense of the data that they have access to so that they can figure out what questions to ask, how to frame these questions, and how to approach survey respondents so that they can uncover any insights that they feel might be missing.
EDA entails the examination of patterns, trends, outliers, and unexpected results in existing survey data, and using visual and quantitative methods to highlight the narrative that the data is telling.
Researchers that conduct exploratory data analysis are able to:
- Identify mistakes that have been made during data collection, and areas where data might be missing.
- Map out the underlying structure of the data.
- Identify the most influential variables in the dataset.
- List and highlight anomalies and outliers.
- Test previously proposed hypotheses.
- Establish a parsimonious model.
- Estimate parameters, determine confidence intervals, and define margins of error.
The Purpose of Exploratory Data Analysis
The primary purpose of EDA is to examine a dataset without making any assumptions about what it might contain.
By leaving assumptions at the door, researchers and data analysts can recognize patterns and potential causes for observed behaviors.
This ultimately helps to answer a particular question of interest or to inform decisions about which statistical model would be best to use in later stages of data analysis.
Exploratory data analysis is used to validate technical and business assumptions, and to identify patterns.
The assumptions that analysts tend to make about raw datasets can be placed in one of two categories -- technical assumptions and business assumptions.
In order to maintain confidence that the most optimal analytical models and algorithms are used in data analysis, and that the resulting findings are indeed accurate, specific technical assumptions about the data must be correct.
For example, the technical assumption that no data is missing from the dataset, or that no data is corrupted in any way, must be correct so that the insights derived from statistical analysis later on hold true.
The second category of assumptions is business assumptions. Business assumptions can often go unrecognized, and can influence the problem at hand and how it’s framed without the researcher consciously being aware.
A classic business assumption example is one in which researchers expect the users of a product to be significantly experienced in a particular field, but in reality the average user of that product is more at the novice or beginner level.
Because that assumption is misinformed, it can mean that researchers and the businesses that employ them need to display flexibility and consider a whole new set of questions to inform product development.
In order to validate and confirm the accuracy of technical and business assumptions, data scientists must systematically drill into the contents of each data field, and examine its interactions with other variables.
By creating data visualizations, and strategically investigating those visualizations one next to the other, researchers are able to leverage the human mind’s natural skill of pattern recognition.
Pattern recognition allows these analysts to identify potential causes of a particular behavior, highlight problematic data points, and form hypotheses that they can test to inform the decision making process when it comes to choosing a statistical model to use in future analysis of the data.
Using R to Conduct Exploratory Data Analysis
R is a statistical programming package that can be used to conduct exploratory data analysis. It’s versatile, powerful, and best of all it’s open-source, meaning that it’s free to use!
Once exploratory data analysis has been thoroughly executed, R enables researchers to perform various statistical functions, including but not limited to:
- Cluster analysis
- Univariate visualization of and summary statistics for each field in the original dataset
- Bivariate visualization and summary statistics that enable researchers to examine and assess the relationship between each of the variables in the dataset and a specific variable of interest
- Multivariate visualizations that enable researchers to uncover insight into the interactions between different fields in the data
- L-means clustering
- Predictive models, such as linear regression
Carrying out these statistical functions allows researchers and data analysts to validate previously established assumptions and highlight patterns that will then help them to better understand the problem at hand and select a predictive model accordingly.
By doing so, researchers can ensure high quality data analysis, and can confirm that the data has been collected and organized in the way that was expected.
What has your exploratory data analysis process looked like in the past? Feel free to share your story with us! Drop us a line in the comments below.