An iconic Dutch windmill in a sunny Dutch meadow

Course: Elements of Data Science
Institute: EPFL Extension School
Project Title: Project 2
Author: Elo Ogbokor
Learner Name: mediumorchid-tangerine
Date: 21 July 2024

Project summary:

Analysing a mystery survey through a series of exercises. I unfortunately cannot elaborate the project as a result of poor data science practices.

Project Description:

Introduction

For this project we use a very minimalist dataset. Two groups (one of students and one of professors) have been asked a set of questions. You can find the excel file in the resources, with two different sheets (one for the students and one for the professors).

We don’t even know what the questions were, we only have each question ID in the first column. Moreover, whoever collected the data probably did not know about “good practices” since they decided to organise the data by respondent type (one table for students, one table for professors).

Exercises

Part 1

Load the data into R. Here you get more experience extracting data from a Excel file. You can use the {readxl} package to load the data, from the two different sheets, in two different tibbles

Answer to Part 1

Loading likert_survey.xlsx into two tables:

studentsTable
professorsTable

Assigning students answers to studentsTable

studentsTable <- read_xlsx(path = "/Users/sunshine_in_a_bag/Documents/Adaire/Elements of Data Science/Project 2/data/likert_survey.xlsx",
                           sheet = "students") 
#Use double quotes for sheet

#View(studentsTable)
studentsTable

Assigning professors answers to professorsTable

professorsTable <- read_xlsx(path = "/Users/sunshine_in_a_bag/Documents/Adaire/Elements of Data Science/Project 2/data/likert_survey.xlsx",
                           sheet = "professors") 
#Use double quotes for sheet

#View(professorsTable)
#When using View, the table does not render, rather use table variable name to view table, rather than View
professorsTable

Part 2

Once the data is loaded, answer the following questions:

Part 2, Question 2.1

What is the total number of answers obtained from students?

Note: If you have a code chunk where you used View directly in a pipe, and later use the table or variable using mutate in a new code chunk, an error might throw. The error will show in the mutate chunk, eventhough the View pipeline is the origin of the flaw.

#Each line represents one question and the columns contain the number of persons that gave a specific answer.
#Add four columns numbers 
#Number of individuals who provided answer in each column = numbers of answers in each column




createStudentTypeColumn <- studentsTable %>% 
  mutate(Respondent_type = "Students")

## mutate: new variable 'Respondent_type' (character) with one unique value and 0%
## NA

createProfessorTypeColumn <- professorsTable %>% 
    mutate(Respondent_type = "Professors")

## mutate: new variable 'Respondent_type' (character) with one unique value and 0%
## NA

#insertprofessorTableunderstudentTable 
insertprofessorTableunderstudentTable <- bind_rows(createStudentTypeColumn, createProfessorTypeColumn )


finalTable <- insertprofessorTableunderstudentTable  %>% 
  group_by(Respondent_type) %>% 
  

 mutate(TotalanswersperQuestion = sum(TotallyAgree, Disagree, Agree, TotallyAgree),
         (GrandTotalAnswersEachRespondentType = sum(TotalanswersperQuestion)))  %>% 
  rename(GrandTotalAnswersEachRespondentType = `(GrandTotalAnswersEachRespondentType = sum(TotalanswersperQuestion))` ) %>% 

  ungroup()

## group_by: one grouping variable (Respondent_type)
## mutate (grouped): new variable 'TotalanswersperQuestion' (double) with 2 unique values and 0% NA
##                   new variable '(GrandTotalAnswersEachRespondentType = sum(TotalanswersperQuestion))' (double) with 2 unique values and 0% NA
## rename: renamed one variable (GrandTotalAnswersEachRespondentType)
## ungroup: no grouping variables remain

#View(finalTable)
finalTable

#When renaming or working with a column which holds a formula or is a formula, use backticks when using the column name

Answer to Part 2, Question 2.1

According to the column titled: “(GrandTotalAnswersEachRespondentType = sum(TotalanswersperQuestion)))” in finalTable above:

grand total of student answers: 272896

Part 2, Question 2.2

What is the total number of answers obtained from professors?

Answer to Part 2, Question 2.2

According to the column titled: “(GrandTotalAnswersEachRespondentType = sum(TotalanswersperQuestion)))” in finalTable above:

grand total of professor answers: 146112

Part 2, Question 2.3

Does each question have the same number of answers?

Answer to Part 2, Question 2.3

Yes, each question for student and professor respondent types contain the same number of answers, as evident in the “TotalanswersperQuestion” column. The reason is most probably the same group of students and professors answered each question, plus there were no missing answers.

Part 3

Obtain side-by-side bar charts comparing the distribution of answers of the students versus the professors for each question. For this part, you might want to refresh your knowledge on mini charts with facets, covered in S10U07.

Steps to reach Answer to Part 3

Pivoting finalTable to generate mini-bar graphs

#pivot
#full-join
#count responses
#ggplot2

changeTable <-  finalTable %>% 
  select(-GrandTotalAnswersEachRespondentType, -TotalanswersperQuestion) %>% 
  group_by(Respondent_type) %>% 
  pivot_longer(cols = c (TotallyDisagree, Disagree, Agree, TotallyAgree),
names_to = "ResponseType",
values_to = "ResponseTypeCount")  %>% 
    ungroup()

## select: dropped 2 variables (TotalanswersperQuestion, GrandTotalAnswersEachRespondentType)
## group_by: one grouping variable (Respondent_type)
## pivot_longer: reorganized (TotallyDisagree, Disagree, Agree, TotallyAgree) into (ResponseType, ResponseTypeCount) [was 64x6, now 256x4]
## ungroup: no grouping variables remain

  #full_join(studentsTable, professorsTable, by = c("QuestionID")) %>% 


  #mutate(CountResponseTypeforEachQuestionperRespondentType = 0)
#ResponseType = Agree, Disagree. Count these for each question for respondent type. 
#How many students stated Disagree to 0? ..



View(changeTable)

Answer to Part 3

generatingMiniBargraphs <- changeTable %>% 
ggplot(aes(x = ResponseType,
           y = ResponseTypeCount,
           fill = Respondent_type)) +
  geom_col(position = "dodge") +
  #above unstacks bars. instead of bars resting on top of each other, they will be next to each other (individuals)
   scale_fill_manual(values = c("#EB8DAB", 
                             "#B39DE5")) +
  facet_wrap(vars(QuestionID)) +
  labs(title = "Series of bar graphs displaying Response Type Count per Question",
       #Response Type = Agree, Disagree
       subtitle = "Comparing the students and professors Response Types for each question",
       caption = "Figure 3.1") +
theme(axis.text.x = element_text(angle = 90)) 
#Positions x-axis labels vertically



generatingMiniBargraphs

Part 4

For this last part, provide a solution to the following exercise:

By binding, reshaping, grouping and filtering, create one single tibble that shows only the type of answer with the highest number of answers, for each question, and for each respondent type.

Find: the type of answer (Agree, Disagree) with the highest number of answers: in other words, mode for each question.

Note: In R when finding mode, you are provided the option to: - Use modeest package - Create your own function to generate the mode

Answer to Part 4

findMode <-
changeTable %>% 
  group_by(QuestionID, Respondent_type) %>% 
  mutate(ModeResponseTypeCount = max(ResponseTypeCount)) %>% 
  #creating new column to store mode response type
   filter(ResponseTypeCount == ModeResponseTypeCount) %>% 
  #selecting rows which meet condition of mode
  arrange(QuestionID) %>% 
  ungroup()

## group_by: 2 grouping variables (QuestionID, Respondent_type)
## mutate (grouped): new variable 'ModeResponseTypeCount' (double) with 47 unique values and 0% NA
## filter (grouped): removed 192 rows (75%), 64 rows remaining (removed 0 groups, 64 groups remaining)
## ungroup: no grouping variables remain

findMode

Credits:
Cover page image of briefcase: https://depositphotos.com/vector/cute-cartoon-pink-briefcase-buttons-documents-office-work-school-university-479725488.html

Project 2 End

EPFL Extension School, Elements of Data Science, Project 2