Course: Elements of Data Science Institute: EPFL Extension School Project Title: Project 2 Author: Elo Ogbokor Learner Name: mediumorchid-tangerine Date: 21 July 2024
Analysing a mystery survey through a series of exercises. I unfortunately cannot elaborate the project as a result of poor data science practices.
For this project we use a very minimalist dataset. Two groups (one of students and one of professors) have been asked a set of questions. You can find the excel file in the resources, with two different sheets (one for the students and one for the professors).
We don’t even know what the questions were, we only have each question ID in the first column. Moreover, whoever collected the data probably did not know about “good practices” since they decided to organise the data by respondent type (one table for students, one table for professors).
Load the data into R. Here you get more experience extracting data from a Excel file. You can use the {readxl} package to load the data, from the two different sheets, in two different tibbles
Answer to Part 1
Assigning students answers to studentsTable
studentsTable <- read_xlsx(path = "/Users/sunshine_in_a_bag/Documents/Adaire/Elements of Data Science/Project 2/data/likert_survey.xlsx",
sheet = "students")
#Use double quotes for sheet
#View(studentsTable)
studentsTable
Assigning professors answers to professorsTable
professorsTable <- read_xlsx(path = "/Users/sunshine_in_a_bag/Documents/Adaire/Elements of Data Science/Project 2/data/likert_survey.xlsx",
sheet = "professors")
#Use double quotes for sheet
#View(professorsTable)
#When using View, the table does not render, rather use table variable name to view table, rather than View
professorsTable
Once the data is loaded, answer the following questions:
Part 2, Question 2.1
What is the total number of answers obtained from students?
Note: If you have a code chunk where you used View directly in a pipe, and later use the table or variable using mutate in a new code chunk, an error might throw. The error will show in the mutate chunk, eventhough the View pipeline is the origin of the flaw.
#Each line represents one question and the columns contain the number of persons that gave a specific answer.
#Add four columns numbers
#Number of individuals who provided answer in each column = numbers of answers in each column
createStudentTypeColumn <- studentsTable %>%
mutate(Respondent_type = "Students")
## mutate: new variable 'Respondent_type' (character) with one unique value and 0%
## NA
createProfessorTypeColumn <- professorsTable %>%
mutate(Respondent_type = "Professors")
## mutate: new variable 'Respondent_type' (character) with one unique value and 0%
## NA
#insertprofessorTableunderstudentTable
insertprofessorTableunderstudentTable <- bind_rows(createStudentTypeColumn, createProfessorTypeColumn )
finalTable <- insertprofessorTableunderstudentTable %>%
group_by(Respondent_type) %>%
mutate(TotalanswersperQuestion = sum(TotallyAgree, Disagree, Agree, TotallyAgree),
(GrandTotalAnswersEachRespondentType = sum(TotalanswersperQuestion))) %>%
rename(GrandTotalAnswersEachRespondentType = `(GrandTotalAnswersEachRespondentType = sum(TotalanswersperQuestion))` ) %>%
ungroup()
## group_by: one grouping variable (Respondent_type)
## mutate (grouped): new variable 'TotalanswersperQuestion' (double) with 2 unique values and 0% NA
## new variable '(GrandTotalAnswersEachRespondentType = sum(TotalanswersperQuestion))' (double) with 2 unique values and 0% NA
## rename: renamed one variable (GrandTotalAnswersEachRespondentType)
## ungroup: no grouping variables remain
#View(finalTable)
finalTable
#When renaming or working with a column which holds a formula or is a formula, use backticks when using the column name
Answer to Part 2, Question 2.1
According to the column titled: “(GrandTotalAnswersEachRespondentType = sum(TotalanswersperQuestion)))” in finalTable above:
grand total of student answers: 272896
Part 2, Question 2.2
What is the total number of answers obtained from professors?
Answer to Part 2, Question 2.2
According to the column titled: “(GrandTotalAnswersEachRespondentType = sum(TotalanswersperQuestion)))” in finalTable above:
grand total of professor answers: 146112
Part 2, Question 2.3
Does each question have the same number of answers?
Answer to Part 2, Question 2.3
Yes, each question for student and professor respondent types contain the same number of answers, as evident in the “TotalanswersperQuestion” column. The reason is most probably the same group of students and professors answered each question, plus there were no missing answers.
Obtain side-by-side bar charts comparing the distribution of answers of the students versus the professors for each question. For this part, you might want to refresh your knowledge on mini charts with facets, covered in S10U07.
Steps to reach Answer to Part 3
#pivot
#full-join
#count responses
#ggplot2
changeTable <- finalTable %>%
select(-GrandTotalAnswersEachRespondentType, -TotalanswersperQuestion) %>%
group_by(Respondent_type) %>%
pivot_longer(cols = c (TotallyDisagree, Disagree, Agree, TotallyAgree),
names_to = "ResponseType",
values_to = "ResponseTypeCount") %>%
ungroup()
## select: dropped 2 variables (TotalanswersperQuestion, GrandTotalAnswersEachRespondentType)
## group_by: one grouping variable (Respondent_type)
## pivot_longer: reorganized (TotallyDisagree, Disagree, Agree, TotallyAgree) into (ResponseType, ResponseTypeCount) [was 64x6, now 256x4]
## ungroup: no grouping variables remain
#full_join(studentsTable, professorsTable, by = c("QuestionID")) %>%
#mutate(CountResponseTypeforEachQuestionperRespondentType = 0)
#ResponseType = Agree, Disagree. Count these for each question for respondent type.
#How many students stated Disagree to 0? ..
View(changeTable)
Answer to Part 3
generatingMiniBargraphs <- changeTable %>%
ggplot(aes(x = ResponseType,
y = ResponseTypeCount,
fill = Respondent_type)) +
geom_col(position = "dodge") +
#above unstacks bars. instead of bars resting on top of each other, they will be next to each other (individuals)
scale_fill_manual(values = c("#EB8DAB",
"#B39DE5")) +
facet_wrap(vars(QuestionID)) +
labs(title = "Series of bar graphs displaying Response Type Count per Question",
#Response Type = Agree, Disagree
subtitle = "Comparing the students and professors Response Types for each question",
caption = "Figure 3.1") +
theme(axis.text.x = element_text(angle = 90))
#Positions x-axis labels vertically
generatingMiniBargraphs
For this last part, provide a solution to the following exercise:
By binding, reshaping, grouping and filtering, create one single tibble that shows only the type of answer with the highest number of answers, for each question, and for each respondent type.
Find: the type of answer (Agree, Disagree) with the highest number of answers: in other words, mode for each question.
Note: In R when finding mode, you are provided the option to: - Use modeest package - Create your own function to generate the mode
Answer to Part 4
findMode <-
changeTable %>%
group_by(QuestionID, Respondent_type) %>%
mutate(ModeResponseTypeCount = max(ResponseTypeCount)) %>%
#creating new column to store mode response type
filter(ResponseTypeCount == ModeResponseTypeCount) %>%
#selecting rows which meet condition of mode
arrange(QuestionID) %>%
ungroup()
## group_by: 2 grouping variables (QuestionID, Respondent_type)
## mutate (grouped): new variable 'ModeResponseTypeCount' (double) with 47 unique values and 0% NA
## filter (grouped): removed 192 rows (75%), 64 rows remaining (removed 0 groups, 64 groups remaining)
## ungroup: no grouping variables remain
findMode
Credits:
Cover page image of briefcase: https://depositphotos.com/vector/cute-cartoon-pink-briefcase-buttons-documents-office-work-school-university-479725488.html
Project 2 End