Exploratory Data Analysis (EDA) Part I – Engineering Data Analysis

A. What is Exploratory Data Analysis (EDA)?

Exploratory Data Analysis (EDA) is a crucial step in any data processing pipeline. It is an approach to data analysis that emphasizes visual exploration of data to generate hypotheses and derive insights. EDA involves a lot of techniques that can help in understanding the nature and distribution of variables, spotting outliers and anomalies, and establishing relationships between variables. In essence, EDA is about getting to know your data, understanding its quirks, and potentially unearthing a valuable nugget of information that provides a fruitful direction for further analysis or modeling.

B. Steps involved in EDA

The steps involved in EDA vary depending on the nature of the dataset and the questions you’re trying to answer. However, the following steps provide a general structure that you can follow:

1. Formulating the Questions

Before delving into data analysis, it’s essential to define the questions you are looking to answer. Are you trying to identify trends? Do you want to understand the relationships between certain variables? The questions you ask will guide your EDA process.

2. Data Wrangling: Loading, cleaning, and transforming data.

The next step is data wrangling, also known as data preprocessing. This involves loading your dataset, dealing with missing values, outliers, and other anomalies, and possibly transforming data into a form more suitable for analysis. This step is vital because clean and well-structured data makes the analysis process smoother and more reliable.

3. Descriptive Statistics: Analyzing data summary which includes mean, median, mode, standard deviation, and IQR.

This step involves generating summary statistics for your data, including measures of central tendency like the mean, median, and mode, and measures of dispersion like the standard deviation and interquartile range (IQR). These statistics provide a concise summary of your data and can often reveal surprising insights.

4. Visualization: Using plots like histograms, boxplots, and scatter plots to understand the data better.

Visualization is a powerful tool for understanding the distribution and relationships within your data. Histograms can help you understand the distribution of a single variable, boxplots can reveal outliers and the spread of the data, and scatter plots can show the relationship between two variables. When done correctly, data visualization can tell a compelling story about the data.

5. Statistical Approach: Forming a hypothesis and validate it with suitable statistical tests.

Finally, depending on the nature of your data and the questions you’re trying to answer, you may employ various statistical tests to validate the hypotheses formed during your exploratory analysis. This could involve t-tests, chi-square tests, ANOVA, or other statistical tests depending on the nature of your data.

By following these steps, you can conduct a comprehensive EDA that will help you understand your data and guide you in the subsequent analysis or predictive modeling. Remember, every dataset is unique, and EDA is more of an art than a science. The key is to remain flexible and adaptable, and to let the data guide your exploration.


Step 1: Formulating the Questions Use the dataset below:

  1. Which hardware specifications significantly influence the performance of the computer?
  2. Is there a correlation between CPU speed and performance?
  3. How does the type of hard disk (HDD vs. SSD) affect performance?

Step 2: Data Wrangling

  • Open your Excel dataset.
  • Even though this is a clean, hypothetical dataset, in a real-world scenario, you would need to check for missing or inconsistent data entries. You can replace missing values with a suitable placeholder or estimate, depending on the data.

Step 3: Descriptive Statistics

  • Use Excel’s built-in functions like AVERAGE, MEDIAN, MODE, STDEV.P, and PERCENTILE to get a statistical summary of your data. For example, calculate the average performance score, median CPU speed, etc.
  • This step will provide you with an overview of your data distribution.

Step 4: Visualization

  • Use Excel’s charting tools to create visualizations.
  • For example, to explore the relationship between CPU speed and performance, you could create a scatter plot with CPU Speed on the x-axis and Performance Score on the y-axis.
  • To compare the performance of computers with HDD vs. SSD, a boxplot or a bar chart could be appropriate.

(

Creating a Scatter Plot in Excel

To create a scatter plot with CPU Speed on the x-axis and Performance Score on the y-axis, follow these steps:

  1. Highlight the data you want to plot. In this case, it’s the ‘CPU Speed’ and ‘Performance Score’ columns.
  2. Go to the ‘Insert’ tab on the Excel ribbon.
  3. In the Charts group, click on the ‘Insert Scatter (X, Y) or Bubble Chart’ button.
  4. Choose ‘Scatter’ from the dropdown list.
  5. Excel will insert the scatter plot into your worksheet.

The scatter plot will allow you to visually inspect if there’s a relationship between CPU Speed and Performance Score. If you see a trend upwards, it suggests a positive correlation.

Creating a Boxplot or Bar Chart in Excel

To create a boxplot or bar chart to compare the performance of computers with HDD vs. SSD, follow these steps:

For a bar chart:

  1. Arrange your data in two columns. The first column would contain the type of Hard Disk (HDD, SSD), and the second column would contain the corresponding Performance Score.
  2. Highlight your data.
  3. Go to the ‘Insert’ tab on the Excel ribbon.
  4. In the Charts group, click on the ‘Insert Column or Bar Chart’ button.
  5. Choose ‘Clustered Bar’ from the dropdown list.
  6. Excel will insert the bar chart into your worksheet.

For a boxplot:

  1. Arrange your data as described above.
  2. Highlight your data.
  3. Go to the ‘Insert’ tab on the Excel ribbon.
  4. In the Charts group, click on the ‘Insert Statistic Chart’ button.
  5. Choose ‘Box and Whisker’ from the dropdown list.
  6. Excel will insert the boxplot into your worksheet.

Both the bar chart and boxplot will allow you to compare the performance scores of systems with HDD and SSD hard disks.

)

Step 5: Statistical Approach

  • Formulate a hypothesis based on your initial findings. For example, “Systems with SSDs have higher Performance Scores than those with HDDs.”
  • You can use Excel’s data analysis toolpack to perform a t-test (if your data meets the assumptions for this test) to compare the means of performance scores between the SSD and HDD systems.
  • The result of the test would help you validate or reject your hypothesis.

(

Formulating a Hypothesis

After you’ve made some initial observations from the data, you might hypothesize that “Systems with SSDs have higher Performance Scores than those with HDDs.” This statement will be your null hypothesis (H0) – which is essentially a statement of no effect or no difference. The alternative hypothesis (H1) – the one you’re trying to prove – would then be “Systems with SSDs do not have higher Performance Scores than those with HDDs.”

Performing a t-test using Excel’s Data Analysis ToolPak

The Data Analysis ToolPak in Excel allows you to perform a t-test to compare the means of performance scores between the SSD and HDD systems. Here’s how:

  1. Go to the ‘File’ tab and click on ‘Options.’
  2. In the Excel Options dialog, click on ‘Add-Ins’ on the left sidebar.
  3. At the bottom of the window, where it says ‘Manage,’ make sure ‘Excel Add-ins’ is selected and click ‘Go.’
  4. In the Add-Ins dialog, check the box for ‘Analysis ToolPak’ and click ‘OK.’ You should now see ‘Data Analysis’ in your ‘Data’ tab.

Once the Data Analysis ToolPak is loaded, you can conduct a t-test:

  1. Go to the ‘Data’ tab and select ‘Data Analysis.’
  2. In the Data Analysis dialog, select ‘t-Test: Two-Sample Assuming Equal Variances’ and click ‘OK.’
  3. In the t-Test dialog, for Variable 1 Range, select your data for SSD performance scores. For Variable 2 Range, select your data for HDD performance scores.
  4. Input a Hypothesized Mean Difference of 0 (since your null hypothesis is that there’s no difference).
  5. Choose an output range, then click ‘OK.’

Excel will output the results of the t-test. Look at the ‘t Stat’ and the ‘P(T<=t) one-tail’ values. If the P-value is less than 0.05, you would reject your null hypothesis, suggesting there is a significant difference in performance scores between SSD and HDD systems, supporting your initial hypothesis.

)

Following these steps will provide a comprehensive understanding of how different hardware specifications impact computer performance. It will help to make informed decisions about hardware selection or further research.

Dataset

ID,Name,CPU Speed(GHz),RAM(GB),Hard Disk,GPU(MHz),Performance Score
1,Computech Alpha,3.1,8,SSD,1250,7.8
2,Computech Beta,2.5,16,HDD,1350,7.4
3,DeltaMachine Z,3.5,8,SSD,1400,8.2
4,OmegaComp X1,2.8,16,SSD,1150,8
5,VegaSystem T4,3.2,16,HDD,1300,7.7
6,Computech Gamma,3,8,SSD,1200,7.9
7,DeltaMachine Y,2.6,16,HDD,1300,7.5
8,OmegaComp X2,3.6,8,SSD,1450,8.3
9,VegaSystem T5,2.9,16,SSD,1200,8.1
10,Computech Delta,3.3,16,HDD,1350,7.8
11,OmegaComp X3,2.7,16,SSD,1100,8.2
12,VegaSystem T6,3,8,SSD,1300,7.7
13,Computech Epsilon,2.8,16,HDD,1200,7.6
14,DeltaMachine X,3.7,8,SSD,1350,8.4
15,OmegaComp X4,2.9,16,SSD,1250,8
16,VegaSystem T7,3.4,16,HDD,1400,7.9
17,Computech Zeta,2.6,8,SSD,1150,7.7
18,DeltaMachine W,3,16,HDD,1400,7.8
19,OmegaComp X5,3.8,8,SSD,1400,8.5
20,VegaSystem T8,2.9,16,SSD,1200,8.2

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *