Course: Elements of Data Science Institute: EPFL Extension School Project Title: Project 1 Author: Elo Ogbokor Learner Name: mediumorchid-tangerine Date: 05 July 2024
Analysing data collected from a survey of participants at a conference in the Netherlands using survey.csv dataset through a series of exercises.
Load the data, make sure you have clean column names and answer the following questions:
Before tackling the exercises, the survey data-set is required to be imported.
Loading the dataset, plus ensuring columns are clean.
Below is the survey.csv dataset containing the results from a survey of participants at a conference in the Netherlands. Each row stands for one participant.
#Importing the survey.csv data-set
original_survey.csv <- read_csv("/Users/sunshine_in_a_bag/Documents/Adaire/Elements of Data Science/Project 1/data/survey.csv")
## New names:
## Rows: 30 Columns: 8
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (3): University, Position, Sizes dbl (5): ...1, Number of employees, Number of
## PhD students, Number of postdo...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
#When i open survey.csv, the first column is title "1," and contains a number assigned to each row. Since this column is unnecessary, use code below to evaporate the first column using select, plus -1. Since the column is title 1 by R, however blank is excel or Numbers, add - in front of the column-name "1" to eliminate the column.
#survey.csv is re-assigned to remove the first unnecessary column
survey.csv <- select(original_survey.csv, -1)
survey.csv
Part
1, Question 1
How many survey participants were there? Show that you can find with code the number of lines in this dataset, since each line is one participant
Steps to reach Answer to Part 1, Question 1
survey.csv %>%
#Using nrow to determine number of rows (in other words, number of participants, since each line stands for one participant)
nrow()
## [1] 30
Answer to Part 1, Question 1
There were 30 survey participants.
Part 1,
Question 2
What are the data types of each column? Show that you know at least one function that will show in its output the type of each column
Steps to reach Answer to Part 1, Question 2
#Plotting chart to display column names, plus indicate their datatypes.
vis_dat(survey.csv) +
labs(title = "Bar chart displaying survey.csv column names, plus respective datatypes",
subtitle = "The datatypes contained in survey.csv are: Numeric and Character.",
caption = "Figure 1.2") +
scale_fill_manual(values = c("#EB8DAB",
"#ECC4D3"
))
Answer to Part 1, Question 2
As viewed in Figure 1.2 above, the datatypes contained in survey.csv are:
The columns classified under Numeric are:
The columns classified under Character are:
Part 1, Question 3
What are the top 3 universities with the most survey participants? Show that you know how to compute some summary values on a dataset and extract only the lines that interest you from the result
Steps to reach Answer to Part 1, Question 3
addColumnWithTotalParticipantsperUniversity <- survey.csv %>%
group_by(University) %>%
#number of rows per university = number of participants per university, since one row is one participant
mutate(TotalParticipantsperUniversity = n()) %>%
#n() for number of rows
#If we count the number of rows per university, we will have the total participants per university
ungroup() #Do not forget un-grouping practice
#count university entries
addColumnWithTotalParticipantsperUniversity
Below we extract the columns (University, plus TotalParticipantsperUniversity) to only view the necessary content.
viewUniversityandTotalParticipants <- addColumnWithTotalParticipantsperUniversity %>%
select(University, TotalParticipantsperUniversity)
viewUniversityandTotalParticipants
NewColumnofOnlyTotalParticipants <-
viewUniversityandTotalParticipants %>%
summarise(TotalParticipantsAppearOnce = unique(TotalParticipantsperUniversity))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#unique to remove duplicates
NewColumnofOnlyTotalParticipants
#summarise only returns new column, unlike mutate
#Attempt to append university to total participants
survey.csv %>%
group_by(University) %>%
summarise() %>%
#creating table where universities appear only once
#mutate to append the totalparticipants column created using summarise to the universities
mutate(NewColumnofOnlyTotalParticipants ) %>%
arrange(desc(NewColumnofOnlyTotalParticipants ))
#sort participants in descending order, since we're seeking the highest values
Answer to Part 1, Question 3
According to the table above, the top three universities (3 universities with the highest number of survey participants) are:
Part 2, Question 1
Obtain a bar chart of the number of survey participants for the different positions given in the column Position.
Steps to reach Answer to Part 2, Question 1
What is required?
#Creating fresh column to count participants per position
countParticipantsperPosition <-
survey.csv %>%
group_by(Position) %>%
mutate(Count = n()) %>%
ungroup()
#using mutate to create a column which stores the number of participants in each position
#n() counts the number of rows (in other words, number of participants since each line is a single participant) per position
countParticipantsperPosition
#Below we extract the necessary columns "Position" and "countParticipantsPerPosition"
PositionandcountParticipantsPerPositionTable <-
countParticipantsperPosition %>%
select(Position, Count )
PositionandcountParticipantsPerPositionTable
#unique to eliminate duplicate rows
positionandCountforbarplot <-
unique(PositionandcountParticipantsPerPositionTable)
positionandCountforbarplot
Answer to Part 2, Question 1
Bar graph plotting number of participants per Position
ggplot(data = positionandCountforbarplot,
mapping = aes(x = Position,
y = Count ,
fill = Position)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = c("#EA7DAF",
"#FFA9C0",
"#F6C8DB",
"#F6E0E4",
"#AD9EC4",
"#D3C0F1",
"#EAC1F1",
"#EAE2F1"
)) +
labs(title = "Number of survey participants per Position",
subtitle = "To conclude, the majority of the participants hold the titles: Postdoc and PhD Student",
caption = "Figure 2.1")
There is a column (named Sizes) in the dataset that is supposed to group the data in the column Number of employees in the following bins:
Small (0 to 20) Medium (21 to 50) Large (51 to 200) Extra large (over 200) Unfortunately, somebody made mistakes in that column…
Part 3, Question 1
Recreate this column correctly,
Part 3, Question 2
then plot a bar chart with {ggplot2} of the number of values in each of the bins. This is an opportunity to verify that you understand how to add a column in your dataset with values depending on the values of another columns (if other column equal x than this column equal y…)
Steps to reach Answer to Part 3, Question 1
Viewing the columns “Number of employees” and “Sizes” to unveil the incorrect columns.
EmployeesSortinSizes <- survey.csv %>%
select(`Number of employees`, Sizes)
EmployeesSortinSizes
Answer to Part 3, Question 1
Recreating “Sizes” column correctly.
tableOfEmployeesinCorrectSizes <-
EmployeesSortinSizes %>%
mutate(Sizes = case_when(
`Number of employees` <=0 | `Number of employees` <=20 ~ "Small (0 to 20)",
`Number of employees` <=21 | `Number of employees` <=50 ~ "Medium (21 to 50)",
`Number of employees` <=51 | `Number of employees` <=200 ~ "Large (51 to 200)",
`Number of employees` >200 ~ "Extra large (over 200)"
))
#mutate to override "Sizes" column with correct values.
tableOfEmployeesinCorrectSizes
Steps to reach Answer to Part 3, Question 2
Try number of rows per size, not total
numberofentriesperSize <-
tableOfEmployeesinCorrectSizes %>%
group_by(Sizes) %>%
mutate(numberofEntriesperSize = n()) #n() for number of rows per size.
#%>%
# ungroup()
#ungroup causes error here
numberofentriesperSize
#Removing `Number of employees` column since we are only interested in Sizes and numberofentriesperSize for the bar-plot.
SizesandnumberofentriesperSize <-
numberofentriesperSize %>%
select(- `Number of employees`)
SizesandnumberofentriesperSize
SizesandnumberofentriesperSizeOnlyOnce <-
distinct( SizesandnumberofentriesperSize) %>%
rename(NumberofParticipants = numberofEntriesperSize)
SizesandnumberofentriesperSizeOnlyOnce
Answer to Part 3, Question 2
Plotting bar chart with {ggplot2} of the number of values in each of the bins.
ggplot(data = SizesandnumberofentriesperSizeOnlyOnce,
mapping = aes(x = NumberofParticipants,
y = Sizes,
fill = Sizes)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = c("#EA7DAF",
"#ECC4D3",
"#EAC1F1",
"#EAE2F1"
)) +
labs(title = "Bar plot illustrating number of participants sorted into category of Size",
subtitle = "It is evident most participants belong in a 'Small' size category.
On the other hand, the minority of participants belong to the 'Extra large' size.",
caption = "Figure 3.2")
#Note: NEVER wrap ggplot2 column names in quotes. It will produce a strange plot.
Part 4, Question 1
Obtain a stacked horizontal bar plot with {ggplot2} that gives the percentage of employees that are Ph.D. students, postdocs and professors for each research group. So now we are interested in calculating and visualising percentage rather than absolute values. This way, all bars in our bar chart will have the same total size and we will be able to see difference in proportions
Steps to reach Answer to Part 4, Question 1
calculatePercentages <-
#creating tibble to calculate the percentages
original_survey.csv %>%
select(-Position, -University, -Sizes) %>%
# - to remove unnecessary columns
rename("Research group Id" = 1) %>%
#Renaming the first column from "1" to "Research group Id"
mutate("Percentage of Ph.D. students" = `Number of PhD students` / `Number of employees` * 100,
#mutate to create three new columns, in which the formula is provided to calculate the percentage for each column
"Percentage of postdocs" = `Number of postdocs` / `Number of employees` * 100 ,
"Percentage of professors " = `Number of professors` / `Number of employees` * 100 )
calculatePercentages
finalColumns <- calculatePercentages %>%
select(`Research group Id`,
`Percentage of Ph.D. students`,
`Percentage of postdocs`,
`Percentage of professors `)
#%>%
#group_by(`Research group Id`) %>%
#ungroup()
finalColumns
reShapeFinalColumns <-
finalColumns %>%
#select columns to pivot (join together?)
#pivot_longer serves function of editing table to result in more rows, plus less columns
pivot_longer(cols = c(`Percentage of Ph.D. students`, `Percentage of postdocs`, `Percentage of professors `),
names_to = "Employee type",
#names_to: title of column containing employee position: PhD, postdoc, professors
#In other words, column titles: Percentage of Ph.D. students, Percentage of postdocs, Percentage of professors under ONE heading, in a single column
values_to = "Percentage"
#values_to:
#mapping the percentages onto the "Employee type" titles
)
reShapeFinalColumns
Answer to Part 2, Question 1
ggplot(data = reShapeFinalColumns,
mapping = aes(x = factor (`Research group Id`),
#factor to display all y-values: id values
y = `Percentage`,
fill = `Employee type`
)) +
scale_fill_manual(values = c("#EB8DAB",
"#ECC4D3",
"#F6DBE1")) +
geom_col() +
coord_flip() +
labs(title = "Stacked Bar Plot displaying employee type ratios for each research group",
subtitle = "Comparing the employee type percentage ratios per group")
Credits:
Cover page image:
21
Things You Should Know Before Moving to the Netherlands
End of Project 1