EUP – Special Course Data Analysis Using Excel™ 12/25/2017
Stanley T. Schuyler, D.Sc. Advanced Excel™ Course Notes
Data Analysis Using Excel™
Notes for Data Set Cleaning
A typical “raw” data sheet one might receive from a marketing firm or a survey might look like this (a small excerpt, there are about 1900 row entries):
The key characteristics of such data sets are that:
-
There are blank entries that indicate missing values (often easy to overlook in large data sets)
-
There are erroneous entries, or entries with undocumented values (harder to “see” than blank cells).
-
There are entries with repeated data values (see Location, Age, Gender, etc.)
It is the responsibility of the analyst to quickly identify anomalous entries (1 and 2) and get familiar with the entire data set qualitatively before performing any quantitative analysis (to avoid inclusion of, or misinterpretation of, anomalous entries ).
1.1Using Conditional Formatting to highlight Missing Information 1.1.1Flag all empty cells (make them jump out)
Objective: Set a special format (say a Red Filled Area) for any cell in the data area that is blank (or empty). Method: Use Conditional Formatting
-
Select the Entire Data Area
-
Home Ribbon: Pull Down Conditional Formatting
-
Select New Rule
-
Format Only Cells that contain
-
Cell Value Pull Down Select “Blanks”
-
Format Button … set format desired
-
Select Apply, then OK
-
Another Technique is to add a column labeled “Incomplete”
-
In the first cell on the first row in that column we can use the functions IF(), OR() and ISBLANK() to detect any cell that is blank; if one or more is found, we have the cell display “Missing”; otherwise nothing (two double quotes in a row: “”)
-
=IF(OR(ISBLANK(A2),ISBLANK(B2), …, ISBLANK(P2)),”Missing”,””)
-
Then we add a modify the conditional formatting rules to also light a cell up red if the word “Missing” is displayed in it.
1.1.2Translate all qualitative codes to Meaningful Text
Objective: Use the survey variable definitions tables or data set code tables to review all coded (non-quantitative entries coded numerically). The “Variable Data definitions” for this market survey are attached at the end of these notes.
Methods: Use Excel IF(), HLOOKUP() and VLOOKUP() functions.
Tactics: compose a general function in the first row cell of a variable category that can be “filled” (or copied) down the rows to reinterpret in English phrases the coded data entries.
-
If the “Data Definitions” are not in the Excel raw data file, you must copy them into a blank sheet. For this class they have already been transcribed into the sheet named “DataDefintions” (from the attached definitions at the bottom of this document).
The process involved selecting the data definition table in the Word™ document, like this one, and doing a “Paste Special RTF” into a blank Excel worksheet (for some data documentation you need to “Paste Special unformatted text”). Then you fix the organization and alignment of the data as exemplified in the “DataDefinitions” sheet provided. The important aspect of the worksheet is that, for each variable or category:
-
The “code” values are horizontally aligned (for a horizontal table), or
-
vertically aligned (for a vertical table), and
-
In ascending order of numeric value. If the code is alphabetic, then in alphabetical order.
-
Make a copy of the sheet containing the raw data. This is done for two reasons: first to make sure the raw data are not inadvertently corrupted; and second, to make referencing the original entries easy.
-
put the mouse on the raw data sheet (tab) name, and right click
-
select “move or copy”
-
check the “create a copy” box
-
Select “(move to end”) in the “Before Sheet” selection window.
-
Select the copied sheet: e.g. “RawData (2)”
-
Select a cell in the first variable category of interest, say “Location”
-
Enter a formula that converts the entry code to an interpretive text string using the DataDefinitions sheet.
-
We will use the following functions in class to perform these translations:
-
IF statements
-
ISBLANK() and ISNA() logical functions
-
HLOOKUP()
-
VLOOKUP()
-
Space is left here for any personal notes, but an experienced Excel user is expected to be able to use these functions without detailed step by step instructions. If you need such steps, some note space is provided below.
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
-
Once the formula is entered and you see the correct translation, test that the formula indeed will “fill” down using the fill box (do five or ten rows, try to include a blank row if one is in view; there is no need to fill down the entire data set yet – that happens later after each variable category has been addressed in the first row).
-
Go back to step 4 and do the same formula composition operation for the next variable category of interest, until no other coded categories need to be translated.
-
Change the Numeric Variable categories in “RawData (2)” to reference the original raw data entries
-
In the first row, for each numeric variable not translated
-
Note the value in the cell (mental note)
-
Select to cell in row one, type (enter) an equal sign (“=”_ and then reference the corresponding cell in the “RawData” sheet.
-
Check that the displayed value matches your mental note
-
Check the numeric format makes sense:
-
If really a financial quantity, should the format be currency? Should cents be shown or is that too much detail?
-
If the entry is a fraction, how many decimals make sense?
-
If a percentage, what format makes sense?
-
Set the formats for these categories appropriately
-
Once all entries in the first row reference the raw data (either by functional translation or simple referencing), then go to the next step.
-
Mass “Fill Down” – but be careful:
-
Select the entire first row, locate the fill handle (or click “copy”)
-
Fill down the entire spreadsheet to the last row of entered data (or select all the rows to the last entry, and click “Paste”).
-
Now begin a quick spot check
-
check the last row, cell by cell across the row, is the translation and/or reference correct?
-
check the first row again?
-
pick rows with blanks and check
-
Look for Excel error messages (those that end in #, like “value#”, “Name#”, “NA”, etc.
-
If any of these errors are found, determine whether
-
the formula was in error, or
-
the value referenced in the “RawData” sheet is an error, or
-
the HLOOKUP() or VLOOKUP() table reference is in error.
-
etc.
-
Depending on what you find, fix the formula or references or decide what to do with bad data. I will provide some suggestions during the class. All fixes get applied to the first row cells, then you repeat steps 6 and 7 until all the anomalies are addressed.
-
Through this procedure you will have become very familiar with the data set provided to you.
1.2Filtering out Anomalous Entries, Enabling Sorts and Searches
At this point you generally need to clear the clutter out of the way. Often, you also need to group like entries together so you can get a “feel” for relationships that might be in the data set.
Tactics: Convert the “RawData (2)” sheet into a Excel data Table.
Methods:
-
Copy “RawData (2)” and create a “RawData (3)” sheet.
-
Select cell A1
-
Go to the “Insert” tab on the ribbon
-
click the “Table” icon near the left side.
-
the check box should indicate you have a header row
-
The result is a transformation of Rawdata (3) into an Excel Data Table.
-
To filter out blank entries (or cells coded as “missing” or any other coded value you want to hide)
-
select the category label pull down above the first row of data
-
In the value list uncheck any code value you do not want to see or to have processed.
-
click OK
-
I expect you to play with the (conditional) number filters on your own (the pull right menu on “Number filters”)..
-
To organize the data (sort it)
-
select the category label pull down above the first row of data
-
Indicate the type of sort and direction you want
-
The data rows are sorted across categories based on your selections, from the leftmost categories to the right.
1.3Producing a Histogram of Various Qualitative Characteristics
Objective: Produce a histogram of the distribution of participants in the survey by level of education.
Strategy: Use built-in functions to quickly count the numbers of each education level in either the RawData sheet or the RawData (3) table.
Method: Use the COUNTIF() function and DataDefintions worksheet tables to calculate the number of each type, and then plot a histogram of the calculations. This is demonstrated in class.
1.4Using Scatter Plots to Identify Relationships
Objective: to determine if there is a correlation or possible functional relationship between two categories.
Strategy: use the original or filtered data views and select all the entries for only two variables and use Excel scatter plot capabilities.
Summarizing Data Quickly using Pivot Tables
Objective: To reduce the focus of the search to a few categories of variables in relationship to each other. The behavior of a Pivot table depends on the values in the cells (e.g. codes, translated codes into strings, or numeric values). The problem with “codes” is that the Pivot table generator treats codes as if they are values. So it is usually better to use the Data Table views with the translated codes to produce the tables.
Method: We will demonstrate the production of Pivot tables (“Insert” tab on the ribbon) and basic display modifications you can use. It will be up to you to “play around” with them after class to get a real handle on how to leverage the Pivot Table capability in Excel.
Advanced Data Exploration (we may not get to this)
Objective: determine if there are significant inter-correlations between variable categories.
Strategy: Select specific data categories and use Excel functionality to detect correlations.
Method: Use the “Data” tab on the ribbon, and the “Data Analysis” options (far right) to explore relationships. We will demonstrate two of the capabilities: correlation and regression.
Variable Definitions for the Course Data Set
The Excel™ data set used for this course is based on a survey of customers at a retail establishment with three locations. Like many data sets, most of the data are codes representing something else. When such a data source is received it may not have in it the information that represents what the codes stand for or mean. Instead, they might be described in a document as shown below. The mapping of codes and meanings needs to be transcribed into the workbook. This needs to be done for conducting an efficient data analysis. Once codes are transcribed into Excel™ sheets, they will be used to facilitate cleaning, interpreting and analyzing the data.
Demographic Variables:
Location
|
Store identifier (1,2,3)
|
Age
|
Customer Age (in years)
|
Gender
|
Customer Gender (1=male, 2=female)
|
Fam-Size
|
Family size (number of persons in household)
|
Pres-Add
|
How long has the customer resided at the present address:
1=0-1 years
2=2-5 years
3=6-10 years
4=11-20 years
5=more than 20 years
|
Own-Rent
|
Residence ownership status (1=own, 2=rent)
|
Income
|
Annual household income:
1=Less than $10,000
2=$10,000-$24,999
3=$25,000-$50,000
4=Over $50,000
|
Educ
|
Education level:
1=Less than high school diploma
2=High school diploma or equivalent
3=Technical or trade school
4=Some college
5=Bachelor’s degree
6=Graduate degree
|
Employ
|
Currently employed? (1=yes, 2=no)
|
CCdebt
|
Current credit card debt (average monthly balance)
|
p(purchase)
|
Estimated probability of purchase on credit card in per month
|
Customer Satisfaction Rating Statements:
On a scale of 1 through 5, please mark your level of agreement or disagreement with the following statements, where 1 indicates strong disagreement; 3 indicates a neutral response, and 5 indicates strong agreement.
The store provides a clean shipping environment. [Cleanliness]
1 2 3 4 5
The store has convenient hours. [Hours]
1 2 3 4 5
The store prices are low compared to competitors. [Prices]
1 2 3 4 5
Store employees were friendly and helpful. [Service]
1 2 3 4 5
On a scale of 1 to 10 indicate your overall impression of the store, where 1 indicates a very poor overall impression and 10 indicates a very positive impression. [Overall-imp]
1 2 3 4 5 6 7 8 9 10
Transcribing to Excel from Word efficiently
In the course we will briefly do a partial transcription to demonstrate efficient ways to use the table on page 1, and the survey segment on page 2, to set up a mapping table in Excel™. Because of time limitations, we will not process every detail. Instead, a hidden sheet with all the information will be revealed and used for the course session.
The completed mapping of codes into an Excel™ sheet is shown below.
Dostları ilə paylaş: |