EPFL Extension School, Elements of Data Science, Project 1

An iconic Dutch windmill in a sunny Dutch meadow

Course: Elements of Data Science
Institute: EPFL Extension School
Project Title: Project 1
Author: Elo Ogbokor
Learner Name: mediumorchid-tangerine
Date: 05 July 2024

Project Description:

Analysing data collected from a survey of participants at a conference in the Netherlands using survey.csv dataset through a series of exercises.

Part 1

Importing survey.csv

Load the data, make sure you have clean column names and answer the following questions:

Before tackling the exercises, the survey data-set is required to be imported.

Loading the dataset, plus ensuring columns are clean.

Below is the survey.csv dataset containing the results from a survey of participants at a conference in the Netherlands. Each row stands for one participant.

#Importing the survey.csv data-set
original_survey.csv <- read_csv("/Users/sunshine_in_a_bag/Documents/Adaire/Elements of Data Science/Project 1/data/survey.csv")

## New names:
## Rows: 30 Columns: 8
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (3): University, Position, Sizes dbl (5): ...1, Number of employees, Number of
## PhD students, Number of postdo...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

#When i open survey.csv, the first column is title "1," and contains a number assigned to each row. Since this column is unnecessary, use code below to evaporate the first column using select, plus -1. Since the column is title 1 by R, however blank is excel or Numbers, add - in front of the column-name "1" to eliminate the column.
#survey.csv is re-assigned to remove the first unnecessary column
survey.csv <- select(original_survey.csv, -1)


survey.csv

Part 1, Question 1

How many survey participants were there? Show that you can find with code the number of lines in this dataset, since each line is one participant

Steps to reach Answer to Part 1, Question 1

survey.csv  %>%
  #Using nrow to determine number of rows (in other words, number of participants, since each line stands for one participant)
nrow()

## [1] 30

Answer to Part 1, Question 1

There were 30 survey participants.

Part 1, Question 2

What are the data types of each column? Show that you know at least one function that will show in its output the type of each column

Steps to reach Answer to Part 1, Question 2

#Plotting chart to display column names, plus indicate their datatypes.
vis_dat(survey.csv) +
  labs(title = "Bar chart displaying survey.csv column names, plus respective datatypes",
       subtitle = "The datatypes contained in survey.csv are: Numeric and Character.",
       caption = "Figure 1.2") +
   scale_fill_manual(values = c("#EB8DAB", 
                             "#ECC4D3"
   ))

Answer to Part 1, Question 2

As viewed in Figure 1.2 above, the datatypes contained in survey.csv are:

Numeric
Character

The columns classified under Numeric are:

Number of employees
Number of PhD students
Number of postdocs
Number of professors

The columns classified under Character are:

University
Position
Sizes

Part 1, Question 3

What are the top 3 universities with the most survey participants? Show that you know how to compute some summary values on a dataset and extract only the lines that interest you from the result

Steps to reach Answer to Part 1, Question 3

 addColumnWithTotalParticipantsperUniversity <- survey.csv  %>%
  
  group_by(University)  %>%
  
  #number of rows per university = number of participants per university, since one row is one participant

  mutate(TotalParticipantsperUniversity = n())  %>%
  #n() for number of rows
#If we count the number of rows per university, we will have the total participants per university

  ungroup() #Do not forget un-grouping practice


  #count university entries

addColumnWithTotalParticipantsperUniversity

Below we extract the columns (University, plus TotalParticipantsperUniversity) to only view the necessary content.

viewUniversityandTotalParticipants <- addColumnWithTotalParticipantsperUniversity   %>%
  select(University, TotalParticipantsperUniversity)   

viewUniversityandTotalParticipants

NewColumnofOnlyTotalParticipants <-
viewUniversityandTotalParticipants  %>%
  summarise(TotalParticipantsAppearOnce = unique(TotalParticipantsperUniversity))

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

#unique to remove duplicates
  

NewColumnofOnlyTotalParticipants

#summarise only returns new column, unlike mutate

#Attempt to append university to total participants
  survey.csv  %>%
  group_by(University)  %>%
  summarise()  %>%
  #creating table where universities appear only once
  
  #mutate to append the totalparticipants column created using summarise to the universities
  mutate(NewColumnofOnlyTotalParticipants )  %>%
  
  arrange(desc(NewColumnofOnlyTotalParticipants ))

#sort participants in descending order, since we're seeking the highest values

Answer to Part 1, Question 3

According to the table above, the top three universities (3 universities with the highest number of survey participants) are:

Delft University of Technology (9 participants)
Eindhoven University of Technology (7 participants)
Erasmus University Rotterdam (6 participants)

Part 2

Part 2, Question 1

Obtain a bar chart of the number of survey participants for the different positions given in the column Position.

Steps to reach Answer to Part 2, Question 1

What is required?

Number of survey participants for the different positions?
Table with column name: positions , column name: number of participants

#Creating fresh column to count participants per position
countParticipantsperPosition <- 
  survey.csv  %>%
  group_by(Position) %>%
  mutate(Count = n())  %>%
  ungroup()

#using mutate to create a column which stores the number of participants in each position
#n() counts the number of rows (in other words, number of participants since each line is a single participant) per position

countParticipantsperPosition

#Below we extract the necessary columns "Position" and "countParticipantsPerPosition" 
PositionandcountParticipantsPerPositionTable <-
countParticipantsperPosition  %>%
  select(Position, Count )  


PositionandcountParticipantsPerPositionTable

#unique to eliminate duplicate rows

 positionandCountforbarplot <-
  unique(PositionandcountParticipantsPerPositionTable)


 positionandCountforbarplot

Answer to Part 2, Question 1

Bar graph plotting number of participants per Position

ggplot(data =  positionandCountforbarplot,
mapping = aes(x = Position,
              y = Count ,
              fill = Position)) +
  geom_col() +
  coord_flip() +
  

  
  
  
  scale_fill_manual(values = c("#EA7DAF", 
                             "#FFA9C0", 
                             "#F6C8DB",
                             "#F6E0E4",
                             "#AD9EC4",
                             "#D3C0F1",
                             "#EAC1F1",
                             "#EAE2F1"
                             )) +
  
  
  labs(title = "Number of survey participants per Position",
       subtitle = "To conclude, the majority of the participants hold the titles: Postdoc and PhD Student",
       caption = "Figure 2.1")

Part 3

There is a column (named Sizes) in the dataset that is supposed to group the data in the column Number of employees in the following bins:

Small (0 to 20) Medium (21 to 50) Large (51 to 200) Extra large (over 200) Unfortunately, somebody made mistakes in that column…

Part 3, Question 1

Recreate this column correctly,

Part 3, Question 2

then plot a bar chart with {ggplot2} of the number of values in each of the bins. This is an opportunity to verify that you understand how to add a column in your dataset with values depending on the values of another columns (if other column equal x than this column equal y…)

Steps to reach Answer to Part 3, Question 1

Viewing the columns “Number of employees” and “Sizes” to unveil the incorrect columns.

EmployeesSortinSizes <- survey.csv %>%
  select(`Number of employees`, Sizes)

EmployeesSortinSizes

Answer to Part 3, Question 1

Recreating “Sizes” column correctly.

 tableOfEmployeesinCorrectSizes <-
 EmployeesSortinSizes  %>% 
  mutate(Sizes = case_when(
    `Number of employees` <=0  |  `Number of employees` <=20  ~ "Small (0 to 20)",
    `Number of employees` <=21  |  `Number of employees` <=50  ~ "Medium (21 to 50)",
    `Number of employees` <=51  |  `Number of employees` <=200  ~ "Large (51 to 200)",
    `Number of employees` >200  ~ "Extra large (over 200)"
    
  ))
  
  #mutate to override "Sizes" column with correct values.

tableOfEmployeesinCorrectSizes

Steps to reach Answer to Part 3, Question 2

Try number of rows per size, not total

numberofentriesperSize <-
tableOfEmployeesinCorrectSizes %>% 
  
  group_by(Sizes)  %>% 
  
  mutate(numberofEntriesperSize = n()) #n() for number of rows per size.  

      #%>%
 # ungroup()
#ungroup causes error here


numberofentriesperSize

#Removing  `Number of employees` column since we are only interested in Sizes and numberofentriesperSize for the bar-plot.
 SizesandnumberofentriesperSize <-
numberofentriesperSize  %>%
  select(- `Number of employees`)


 SizesandnumberofentriesperSize

SizesandnumberofentriesperSizeOnlyOnce <-
distinct( SizesandnumberofentriesperSize) %>%
  
  rename(NumberofParticipants = numberofEntriesperSize)


SizesandnumberofentriesperSizeOnlyOnce

Answer to Part 3, Question 2

Plotting bar chart with {ggplot2} of the number of values in each of the bins.

ggplot(data =  SizesandnumberofentriesperSizeOnlyOnce,
mapping = aes(x = NumberofParticipants,
              y = Sizes,
              fill = Sizes)) +
  geom_col() +
  
  coord_flip() +
  
  
  
  
  
  scale_fill_manual(values = c("#EA7DAF", 
                             "#ECC4D3", 
                             "#EAC1F1",
                             "#EAE2F1"
                             )) +
  

  labs(title = "Bar plot illustrating number of participants sorted into category of Size", 
       subtitle = "It is evident most participants belong in a 'Small' size category. 
       On the other hand, the minority of participants belong to the 'Extra large' size.",
       caption = "Figure 3.2")

#Note: NEVER wrap ggplot2 column names in quotes. It will produce a strange plot.

Part 4

Part 4, Question 1

Obtain a stacked horizontal bar plot with {ggplot2} that gives the percentage of employees that are Ph.D. students, postdocs and professors for each research group. So now we are interested in calculating and visualising percentage rather than absolute values. This way, all bars in our bar chart will have the same total size and we will be able to see difference in proportions

Steps to reach Answer to Part 4, Question 1

calculatePercentages <-
  #creating tibble to calculate the percentages
original_survey.csv   %>% 
  select(-Position, -University, -Sizes) %>% 
  # - to remove unnecessary columns
  rename("Research group Id" = 1) %>% 
  #Renaming the first column from "1" to "Research group Id"
  mutate("Percentage of Ph.D. students" = `Number of PhD students` / `Number of employees` * 100,
         #mutate to create three new columns, in which the formula is provided to calculate the percentage for each column
         "Percentage of postdocs" = `Number of postdocs` / `Number of employees` * 100 ,
         
         "Percentage of professors " =  `Number of professors` / `Number of employees` * 100 )


calculatePercentages

finalColumns <- calculatePercentages  %>% 
  select(`Research group Id`, 
         `Percentage of Ph.D. students`, 
         `Percentage of postdocs`, 
         `Percentage of professors `)

#%>% 
  
  #group_by(`Research group Id`) %>% 

#ungroup()


finalColumns

reShapeFinalColumns  <-
finalColumns %>% 
  
  #select columns to pivot (join together?)
  #pivot_longer serves function of editing table to result in more rows, plus less columns
pivot_longer(cols = c(`Percentage of Ph.D. students`, `Percentage of postdocs`, `Percentage of professors `),
             names_to = "Employee type",
             #names_to: title of column containing employee position: PhD, postdoc, professors
             #In other words, column titles: Percentage of Ph.D. students, Percentage of postdocs, Percentage of professors under ONE heading, in a single column
             values_to = "Percentage"
             #values_to:  
  
             #mapping the percentages onto the "Employee type" titles
) 
  

reShapeFinalColumns

Answer to Part 2, Question 1

Pivoting link

ggplot(data = reShapeFinalColumns,
       mapping = aes(x = factor (`Research group Id`),
                     #factor to display all y-values: id values
                     y =  `Percentage`,
                     fill  = `Employee type`
 
 )) +
  
  scale_fill_manual(values = c("#EB8DAB", 
                             "#ECC4D3", 
                             "#F6DBE1")) +
  
  geom_col() +
  coord_flip()  +
  
  labs(title = "Stacked Bar Plot displaying employee type ratios for each research group",
       subtitle = "Comparing the employee type percentage ratios per group")

Credits:
Cover page image: 21 Things You Should Know Before Moving to the Netherlands

End of Project 1