Data Science is, well, all about data. But data lives somewhere.
Where? How do we find them? In this session we will learn about the
basic I/O (Input/Output) operations in R: how to load the disk stored
data written in various formats in R, and how to store them back. We
will learn that sometimes it makes no difference if the dataset lives on
the Internet or on our local hard drives. We will learn to perform the
basic I/O operations in base R before we start meeting the friendly tidyverse packages - readr, for example - that
provide improved and somewhat more comfortable to work with procedures.
We will learn more about the data.frame
in R: what is
subsetting and is it done, how to summarize a dataset represented by a
data.frame
in R, how to bind dataframes together. We
continue to play with lists too and start learning about simple
operations on strings in base R.
_data
in the directory where
you want to store your R code for this session.listings.csv
csv
file (under the Amsterdam, North Holland, The
Netherlands section). Let’s learn something right now:
.csv
is short for comma separated values
and represents one the most frequently observed file formats used in
practice. The entries in this file are separated by commas: hence the
name. The csv
format has many close relatives, of which
tsv
- tab separated values - is probably
the most famous one.listings.csv
file in Microsoft Excel or Libre
Calc and save it using the same filename but as an .xlsx
file in your _data
directory.Again, everything happens in a directory somewhere. You really need to keep the organization of your directories neat!
Remember, when we work in R, there is always something called a working directory. I know you have opened and R session in RStudio: ask yourself “Where am I?”:
getwd()
## [1] "/Users/goransm/Work/___DataKolektiv/__freeDataKolektiv/_devsify"
list.files(getwd())
## [1] "_data" "_devsify.Rproj"
## [3] "_site" "_site.yml"
## [5] "01_think_together_dkserver.png" "02_explore_possibilities_dkserver.png"
## [7] "03_training_dkserver.png" "04_wikidata_dkserver.png"
## [9] "about.html" "about.Rmd"
## [11] "academic.png" "AdvAnalyticsR2024_Banner.jpeg"
## [13] "aleksandar_cvetkovic_introMLpython.jpg" "aleksandar_cvetkovic_introMLpython.png"
## [15] "AleksandarNarancic175.png" "AleksandarPanovski175.png"
## [17] "analysttraining.png" "businesstraining.png"
## [19] "DataScienceSessions.jpg" "devsify_DK_logo.png"
## [21] "devsify_GSM500px__Banner.png" "DK_Logo_100.png"
## [23] "DK_Logo_400.png" "DK_logo_horizontal.png"
## [25] "DK_Logo_White_150.png" "DK_Logo_White_NoTitle_25.png"
## [27] "dkmlconfpalic2023.png" "dkserver_consulting.jpg"
## [29] "dkserver_courses_CustomRTraining.jpg" "dkserver_courses_DSSV1.jpg"
## [31] "dkserver_courses_IntroDataManagement.jpg" "dkserver_courses.jpg"
## [33] "dkserver_wikidata.jpg" "DSCEurope2021.png"
## [35] "dscEurope2022_BDTTechTutorial.png" "DSCEurope2022_panel.png"
## [37] "dsceurope2023.png" "dss_vol05_python2023_500px.png"
## [39] "DSS_Vol06_Py_Banner_OPENED_02.png" "dss03python2023_banner500px.png"
## [41] "dss04python2023_banner500px.png" "DSS05_2023_srb.png"
## [43] "dsss2022_courses.jpg" "dsss2022_header.jpg"
## [45] "dssvol06_py_banner.jpeg" "eRum2018.png"
## [47] "eRum2020.png" "favicon.png"
## [49] "filipinterviewInstitutZaFiziku.jpg" "footer.html"
## [51] "foundationaltraining.png" "freeDK_Expectations.png"
## [53] "freeDK_Organization.png" "freeDK_Program.png"
## [55] "freeDK_Rules.png" "freeDK_TechStack.png"
## [57] "Full_GPT_architecture.png" "github.png"
## [59] "goran_milovanovic_introMLpython.jpg" "goran_milovanovic_introMLpython.png"
## [61] "GSM_Krakow2024_invert.jpg" "GSM_Krakow2024.jpg"
## [63] "gsm_round_175px.png" "Hallucination_CronenbergStereo.png"
## [65] "header.html" "HeikeThomys175px.jpg"
## [67] "HipkonLajt2020.png" "igf_identifying_the_impact.png"
## [69] "IlijaLazarevic175px.jpg" "index.html"
## [71] "index.Rmd" "InspiraHub2023Meetup.png"
## [73] "instagram.png" "IntroMLPython_Head_eng.png"
## [75] "IntroMLPython_Head.png" "JelisavetaMilovanovic175px.jpeg"
## [77] "l01_lists.html" "l01_lists.Rmd"
## [79] "linkedin.png" "Logo-2C-21px-TM.png"
## [81] "machinery.png" "media.html"
## [83] "media.Rmd" "MilanoR2019.png"
## [85] "MilicaDjurkovic175px.jpg" "MilovanMinic175px.jpeg"
## [87] "MilovanovicVreme.jpg" "MLSS03_ConvexFunctions.png"
## [89] "MLSS03_NeuralNetwork.png" "MLSS03_WebPage_Notebook_Banner.png"
## [91] "MLSS03_WebPageBanner.png" "n00_installations.html"
## [93] "n00_installations.Rmd" "n00_intro_session.html"
## [95] "n00_intro_session.html.R" "n00_intro_session.R"
## [97] "n00_intro_session.txt.R" "n01_intro_session.R"
## [99] "n01_introR_session.R" "NedaRadibratovic175px.jpg"
## [101] "Netokracija2023.png" "NetokracijaIntervju2023.png"
## [103] "PE_banner_2024.png" "PE_GeneralBanner.png"
## [105] "PE_GeneralBanner02_web.png" "PE_GeneralBanner02.png"
## [107] "PE_GeneralBannerDecember2023.png" "PE_sep2023.png"
## [109] "prerequisites.html" "prerequisites.nb.html"
## [111] "prerequisites.Rmd" "RadmilaVelickovic175px.png"
## [113] "RadovanSimikic175px.jpg" "regression_mle.png"
## [115] "renv" "renv.lock"
## [117] "resources.html" "resources.Rmd"
## [119] "Rlogo.png" "s00_installations.html"
## [121] "s00_installations.Rmd" "S01_01_Functions.jpeg"
## [123] "S01_02_Arrays_and_Lists.jpeg" "s01_introduction_to_r_files"
## [125] "s01_introduction_to_r.html" "s01_introduction_to_r.nb.html"
## [127] "s01_introduction_to_r.Rmd" "S02_01_DataFrames.jpeg"
## [129] "s02_dataframe.Rmd" "S03_01_FunctionalProgramming.jpeg"
## [131] "s03_functional.Rmd" "S04_01_RProgramming.jpeg"
## [133] "S07_01_ProbabilityFunctions.jpeg" "S17_01_ROC.jpeg"
## [135] "S17_02_LikelihoodFunction.jpeg" "S21_01_RandomForest.jpeg"
## [137] "S22_01_Bias-Variance_Tradeoff.jpeg" "session00_logos.png"
## [139] "site_libs" "sitelinks1.jpg"
## [141] "sitelinks2.jpg" "sitelinks3.jpg"
## [143] "sitelinks4.jpg" "skc_kahneman_delfi_2015.jpg"
## [145] "startit_add_dsss2022_800.jpeg" "StartitMeetup2016_InfRetreival.png"
## [147] "StartitMeetup2016.png" "talasAI.png"
## [149] "twitter.png" "undp_machinery2023.jpg"
## [151] "UvodR_startit_banner.jpg" "video.Rmd"
## [153] "VladimirMijatovic175px.jpg" "wd_thinkpad_official.png"
## [155] "webit2023.jpeg" "WikidataCon2019.png"
## [157] "WikidataCon2021.png" "wikimediablog2018.jpg"
Ok, so I do have a _data
directory in my working
directory. Let’s pronounce the _data
directory in R:
rootDir <- getwd()
dataDir <- paste0(rootDir, '/_data/')
analyticsDir <- paste0(rootDir, "/_analytics/")
list.files(dataDir)
## [1] "_listings" "listings_selection.csv" "listings.csv"
## [4] "listings.xlsx"
^^ And here is the listings.csv
file. It represents the
Airbnb summary information and metrics for listings in Amsterdam. We
will use it to practice our data frame skills in R.
We want to read listings.csv
to R and make it a
data.frame
. This is how we do it:
filename <- paste0(dataDir, "listings.csv")
listings <- read.csv(file = filename,
header = TRUE,
check.names = FALSE,
stringsAsFactors = FALSE)
What has just happened:
read.csv
is an R function to read csv
files from disk (or Internet, as we will see later) into the RAM memory
of your machine which is put to R’s availability;file
argument is the complete path to the file in
your local filesystem;header
argument tells R that the first row of the
file contains column names;check.names
argument, set to FALSE
in
this example, tells R not to check whether the column names are
syntacticaly valid R column names for a data frame;stringsAsFactors
argument, set to
FALSE
, is a bit annoying: the read.csv()
default value for this argument is TRUE
and using it that
way would turn any character valued columns into a data type known as
factor
in R, and more often than not that is not what you
want to do; never forget to set stringsASFactors = F
in
read.csv()
if you are not sure that you want all chacter
valued columns automatically converted into factors.You can now inspect listings
in RStudio from the
Environment panel. We will take a look at the structure
of this data frame now: head(someDataFrame, how many rows)
shows us the top how many rows
from the
someDataFrame
data frame; tail()
looks at the
provided number of rows found at the bottom of the data frame:
top_listings <- head(listings, 20)
head(listings, 10)
## id name host_id host_name neighbourhood_group
## 1 2818 Quiet Garden View Room & Super Fast WiFi 3159 Daniel NA
## 2 20168 Studio with private bathroom in the centre 1 59484 Alexander NA
## 3 25428 Lovely apt in City Centre (w.lift) near Jordaan 56142 Joan NA
## 4 27886 Romantic, stylish B&B houseboat in canal district 97647 Flip NA
## 5 28871 Comfortable double room 124245 Edwin NA
## 6 29051 Comfortable single room 124245 Edwin NA
## 7 31080 2-story apartment + rooftop terrace 133488 Nienke NA
## 8 41125 Amsterdam Center Entire Apartment 178515 Fatih NA
## 9 43109 Oasis in the middle of Amsterdam 188098 Aukje NA
## 10 43980 View into park / museum district (long/short stay) 65041 Ym NA
## neighbourhood latitude longitude room_type price minimum_nights
## 1 Oostelijk Havengebied - Indische Buurt 52.36575 4.94142 Private room 59 3
## 2 Centrum-Oost 52.36509 4.89354 Private room 236 1
## 3 Centrum-West 52.37297 4.88339 Entire home/apt 125 14
## 4 Centrum-West 52.38761 4.89188 Private room 135 2
## 5 Centrum-Oost 52.36610 4.88953 Private room 75 2
## 6 Centrum-Oost 52.36607 4.89100 Private room 55 2
## 7 Zuid 52.35132 4.84838 Entire home/apt 219 3
## 8 Centrum-West 52.37891 4.88321 Entire home/apt 160 4
## 9 Centrum-West 52.37537 4.88932 Entire home/apt 211 3
## 10 Zuid 52.35746 4.86124 Entire home/apt 67 30
## number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
## 1 278 2020-02-14 1.95 1 123
## 2 339 2020-04-09 2.58 2 3
## 3 5 2020-02-09 0.14 1 33
## 4 219 2020-07-25 2.01 1 219
## 5 336 2020-09-20 2.68 2 346
## 6 481 2020-08-28 4.05 2 360
## 7 32 2017-10-16 0.28 1 0
## 8 89 2020-02-10 0.73 1 0
## 9 60 2020-03-07 4.31 1 0
## 10 61 2018-02-18 0.50 2 184
tail(listings, 10)
## id name host_id host_name
## 18513 46910585 luxurious 2bdr apartment 378943951 Martin
## 18514 46911127 FELICE house 373125933 Dario
## 18515 46912324 studio In Amsterdam shared bathroom 375398295 Danila
## 18516 46913926 Room for 2 facing private patio - Amsterdam centre 104495359 Sebastian
## 18517 46937483 ♛♝♕ WOW 3 BD APT - Oosterpark Amsterdam ♛♝♕ 379277730 Jean
## 18518 46938808 NiceAmsterdam 379294518 Elizabeth
## 18519 46940774 Fantastic Apartment with beautiful views 279181956 Ramon
## 18520 46944602 cozy apartment in amsterdam 375398295 Danila
## 18521 46953753 Great luxurious apartment in city centre 379498141 Lisa
## 18522 46960026 Classic houseboat in Amsterdam city centre 213284349 Jochum
## neighbourhood_group neighbourhood latitude longitude room_type
## 18513 NA Centrum-Oost 52.37867 4.90730 Entire home/apt
## 18514 NA Centrum-West 52.37706 4.89396 Entire home/apt
## 18515 NA De Pijp - Rivierenbuurt 52.34381 4.90639 Entire home/apt
## 18516 NA Centrum-West 52.37495 4.89378 Private room
## 18517 NA Oud-Oost 52.35830 4.92186 Entire home/apt
## 18518 NA Centrum-Oost 52.36978 4.91443 Entire home/apt
## 18519 NA Centrum-Oost 52.36404 4.87967 Entire home/apt
## 18520 NA De Pijp - Rivierenbuurt 52.34071 4.90854 Entire home/apt
## 18521 NA Centrum-Oost 52.36003 4.89317 Private room
## 18522 NA Oostelijk Havengebied - Indische Buurt 52.37701 4.91797 Private room
## price minimum_nights number_of_reviews last_review reviews_per_month
## 18513 130 3 0 NA
## 18514 46 1 0 NA
## 18515 50 3 0 NA
## 18516 40 1 0 NA
## 18517 275 3 0 NA
## 18518 330 3 0 NA
## 18519 350 4 0 NA
## 18520 60 28 0 NA
## 18521 67 2 0 NA
## 18522 96 2 0 NA
## calculated_host_listings_count availability_365
## 18513 1 347
## 18514 7 135
## 18515 2 163
## 18516 1 13
## 18517 1 350
## 18518 1 362
## 18519 1 21
## 18520 2 346
## 18521 1 179
## 18522 2 89
We use str()
to obtain a description of the data frame -
its columns and the respective data types:
str(listings)
## 'data.frame': 18522 obs. of 16 variables:
## $ id : int 2818 20168 25428 27886 28871 29051 31080 41125 43109 43980 ...
## $ name : chr "Quiet Garden View Room & Super Fast WiFi" "Studio with private bathroom in the centre 1" "Lovely apt in City Centre (w.lift) near Jordaan" "Romantic, stylish B&B houseboat in canal district" ...
## $ host_id : int 3159 59484 56142 97647 124245 124245 133488 178515 188098 65041 ...
## $ host_name : chr "Daniel" "Alexander" "Joan" "Flip" ...
## $ neighbourhood_group : logi NA NA NA NA NA NA ...
## $ neighbourhood : chr "Oostelijk Havengebied - Indische Buurt" "Centrum-Oost" "Centrum-West" "Centrum-West" ...
## $ latitude : num 52.4 52.4 52.4 52.4 52.4 ...
## $ longitude : num 4.94 4.89 4.88 4.89 4.89 ...
## $ room_type : chr "Private room" "Private room" "Entire home/apt" "Private room" ...
## $ price : int 59 236 125 135 75 55 219 160 211 67 ...
## $ minimum_nights : int 3 1 14 2 2 2 3 4 3 30 ...
## $ number_of_reviews : int 278 339 5 219 336 481 32 89 60 61 ...
## $ last_review : chr "2020-02-14" "2020-04-09" "2020-02-09" "2020-07-25" ...
## $ reviews_per_month : num 1.95 2.58 0.14 2.01 2.68 4.05 0.28 0.73 4.31 0.5 ...
## $ calculated_host_listings_count: int 1 2 1 1 2 2 1 1 1 2 ...
## $ availability_365 : int 123 3 33 219 346 360 0 0 0 184 ...
Note. str()
works on lists:
someList <- list(a = 1, b = 2, c = 3)
str(someList)
## List of 3
## $ a: num 1
## $ b: num 2
## $ c: num 3
print(someList)
## $a
## [1] 1
##
## $b
## [1] 2
##
## $c
## [1] 3
names(someList)
## [1] "a" "b" "c"
head()
and tail()
also do lists:
someList <- list(1, 'a', 'Belgrade', 3, 3.14, 909, 'R', TRUE, F, '00')
head(someList, 3)
## [[1]]
## [1] 1
##
## [[2]]
## [1] "a"
##
## [[3]]
## [1] "Belgrade"
tail(someList, 3)
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] FALSE
##
## [[3]]
## [1] "00"
If we are interested in the column names of the data frame only:
colnames(listings)
## [1] "id" "name"
## [3] "host_id" "host_name"
## [5] "neighbourhood_group" "neighbourhood"
## [7] "latitude" "longitude"
## [9] "room_type" "price"
## [11] "minimum_nights" "number_of_reviews"
## [13] "last_review" "reviews_per_month"
## [15] "calculated_host_listings_count" "availability_365"
Later we will see how to use colnames()
to set our own
column names on the existing data frame.
In the next step I want to change the listings
data
frame just a bit, by selecting only a few of its columns, and then store
it to disk in the csv
format but using a different file
name than listings.csv
:
listings_selection <- listings[ , 1:3]
head(listings_selection)
## id name host_id
## 1 2818 Quiet Garden View Room & Super Fast WiFi 3159
## 2 20168 Studio with private bathroom in the centre 1 59484
## 3 25428 Lovely apt in City Centre (w.lift) near Jordaan 56142
## 4 27886 Romantic, stylish B&B houseboat in canal district 97647
## 5 28871 Comfortable double room 124245
## 6 29051 Comfortable single room 124245
Similarly as we have used the read.csv()
function to
read a file into a data frame, we use write.csv()
to store
a data frame into a file in our local filesystem:
write.csv(x = listings_selection,
file = paste0(dataDir, 'listings_selection.csv'))
If you want to see what objects do you have instantiated in the
current R session, use ls()
:
ls()
## [1] "a" "analyticsDir" "awesome" "check"
## [5] "cities" "city" "columns" "counter"
## [9] "countries" "dataDir" "dataFramesCheck" "dataSet"
## [13] "emptyList" "filename" "files" "i"
## [17] "index" "iris" "l" "lattitudes"
## [21] "listings" "listings_selection" "ll" "logic"
## [25] "longitudes" "mean_price" "mtcars" "my_list"
## [29] "my_vector" "myArray1" "myArray2" "myCities"
## [33] "myFun" "myList" "myResult" "mySequence"
## [37] "names" "num_cols" "num_files" "num_pairs"
## [41] "num_rows" "numbers" "person" "persons"
## [45] "populations" "rootDir" "s" "search"
## [49] "separateCities" "set" "some_expression" "someList"
## [53] "this_animal" "this_number" "top_listings" "word"
## [57] "word1" "word2" "x" "z"
Some of them we do not need anymore and we want to remove them. Note. In real Data Science practice, most of the time we really need to look carefully after memory usage, because we typically work with large datasets. The datasets that we have in this session are rather small:
dim(listings)
## [1] 18522 16
listings
has 18522 rows and 16 columnes, while
listings_selection
has…
dim(listings_selection)
## [1] 18522 3
We do not need the listings_selection
data frame
anymore, so let’s remove it with rm()
:
rm(listings_selection)
ls()
## [1] "a" "analyticsDir" "awesome" "check" "cities"
## [6] "city" "columns" "counter" "countries" "dataDir"
## [11] "dataFramesCheck" "dataSet" "emptyList" "filename" "files"
## [16] "i" "index" "iris" "l" "lattitudes"
## [21] "listings" "ll" "logic" "longitudes" "mean_price"
## [26] "mtcars" "my_list" "my_vector" "myArray1" "myArray2"
## [31] "myCities" "myFun" "myList" "myResult" "mySequence"
## [36] "names" "num_cols" "num_files" "num_pairs" "num_rows"
## [41] "numbers" "person" "persons" "populations" "rootDir"
## [46] "s" "search" "separateCities" "set" "some_expression"
## [51] "someList" "this_animal" "this_number" "top_listings" "word"
## [56] "word1" "word2" "x" "z"
someList
is really small and we do not care to remove
it.
What if the listings
data frame was stored as a
Microsoft Excel file, with an .xlsx
extension in place of
.csv
? Well, one thing to do would be to first convert it to
csv
outside R, from Excel itself for example. Why would we
do that? Because we tend to be consistent in the way we code and what
procedures and standards do we use in our work: for example, we can
introduce a convention to keep all data as .csv
files in
the scope of some project. But sometimes we don’t and we simply need to
grab a file with a certain extension quickly. Now, base R does not have
a function to read .xlsx
files. But there are R
packages that provide such functions. Whenever we want
to use an R package, we need to install it first. The base R function to
install packages is install.packages()
:
install.packages('readxl')
When we want to use functions from an R package, we need to call the
package by library()
:
library(readxl)
Now, the readxl
package has a function to read Excel
files:
listings <- read_excel(paste0(dataDir, "listings.xlsx"),
col_names = TRUE)
## New names:
## • `` -> `...17`
head(listings)
## # A tibble: 6 × 17
## id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 2818 Quie… 3159 Daniel <NA> Oostelijk Ha… 52.36575 4.94 Private … 59
## 2 20168 Stud… 59484 Alexander <NA> Centrum-Oost 52.36509 4.89 Private … 236
## 3 25428 Love… 56142 Joan <NA> Centrum-West 52.37297 4.88 Entire h… 125
## 4 27886 Roma… 97647 Flip <NA> Centrum-West 52.38761 4.89 Private … 135
## 5 28871 Comf… 124245 Edwin <NA> Centrum-Oost 52.3661 4.89 Private … 75
## 6 29051 Comf… 124245 Edwin <NA> Centrum-Oost 52.36607 4.89 Private … 55
## # ℹ 7 more variables: minimum_nights <dbl>, number_of_reviews <dbl>, last_review <chr>,
## # reviews_per_month <chr>, calculated_host_listings_count <dbl>, availability_365 <dbl>,
## # ...17 <dbl>
Note how I have reused the variable name: listings
. It
was an existing data frame which is now overwritten by the same data
from a different file. Do not forget to use ?
from the R
console to obtain documentation on any new functions that you need to
learn: ?read_excel
, for example.
Note. Do class(listings)
:
class(listings)
## [1] "tbl_df" "tbl" "data.frame"
What is this: tbl_df
, tbl
? In short: the
classes were added to the data.frame
class in the
read_excel()
call and we will start meeting them frequently
once we begin to use the tidyverse
packages like
readr
. Nothing to worry about at this point. Let’s strip
them of the listings
object manually:
listings <- as.data.frame(listings)
class(listings)
## [1] "data.frame"
Ok. Back to the original listings.csv
now:
filename <- paste0(dataDir, 'listings.csv')
listings <- read.csv(file = filename,
header = T,
check.names = F,
stringsAsFactors = F)
Now we need to learn how to subset a data frame, to slice out exactly the data that we are interested in. Data frames can be sliced by conditions set on their rows, columns, and by any combinations of conditions set on rows and columns. For example, if we are interested in only the top five rows of listings, we can do:
listings_5 <- listings[1:5, ]
listings_5
## id name host_id host_name neighbourhood_group
## 1 2818 Quiet Garden View Room & Super Fast WiFi 3159 Daniel NA
## 2 20168 Studio with private bathroom in the centre 1 59484 Alexander NA
## 3 25428 Lovely apt in City Centre (w.lift) near Jordaan 56142 Joan NA
## 4 27886 Romantic, stylish B&B houseboat in canal district 97647 Flip NA
## 5 28871 Comfortable double room 124245 Edwin NA
## neighbourhood latitude longitude room_type price minimum_nights
## 1 Oostelijk Havengebied - Indische Buurt 52.36575 4.94142 Private room 59 3
## 2 Centrum-Oost 52.36509 4.89354 Private room 236 1
## 3 Centrum-West 52.37297 4.88339 Entire home/apt 125 14
## 4 Centrum-West 52.38761 4.89188 Private room 135 2
## 5 Centrum-Oost 52.36610 4.88953 Private room 75 2
## number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
## 1 278 2020-02-14 1.95 1 123
## 2 339 2020-04-09 2.58 2 3
## 3 5 2020-02-09 0.14 1 33
## 4 219 2020-07-25 2.01 1 219
## 5 336 2020-09-20 2.68 2 346
Remember head()
- we can use that too:
listings_5 <- head(listings, 5)
listings_5
## id name host_id host_name neighbourhood_group
## 1 2818 Quiet Garden View Room & Super Fast WiFi 3159 Daniel NA
## 2 20168 Studio with private bathroom in the centre 1 59484 Alexander NA
## 3 25428 Lovely apt in City Centre (w.lift) near Jordaan 56142 Joan NA
## 4 27886 Romantic, stylish B&B houseboat in canal district 97647 Flip NA
## 5 28871 Comfortable double room 124245 Edwin NA
## neighbourhood latitude longitude room_type price minimum_nights
## 1 Oostelijk Havengebied - Indische Buurt 52.36575 4.94142 Private room 59 3
## 2 Centrum-Oost 52.36509 4.89354 Private room 236 1
## 3 Centrum-West 52.37297 4.88339 Entire home/apt 125 14
## 4 Centrum-West 52.38761 4.89188 Private room 135 2
## 5 Centrum-Oost 52.36610 4.88953 Private room 75 2
## number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
## 1 278 2020-02-14 1.95 1 123
## 2 339 2020-04-09 2.58 2 3
## 3 5 2020-02-09 0.14 1 33
## 4 219 2020-07-25 2.01 1 219
## 5 336 2020-09-20 2.68 2 346
Again, the columns of listings
:
colnames(listings)
## [1] "id" "name"
## [3] "host_id" "host_name"
## [5] "neighbourhood_group" "neighbourhood"
## [7] "latitude" "longitude"
## [9] "room_type" "price"
## [11] "minimum_nights" "number_of_reviews"
## [13] "last_review" "reviews_per_month"
## [15] "calculated_host_listings_count" "availability_365"
And if we want to subset only the rows from 5 to 10 and only the
name
and room_type
columns:
listings[5:10, c('host_name', 'room_type')]
## host_name room_type
## 5 Edwin Private room
## 6 Edwin Private room
## 7 Nienke Entire home/apt
## 8 Fatih Entire home/apt
## 9 Aukje Entire home/apt
## 10 Ym Entire home/apt
Or:
listings[5:10, c(4, 9)]
## host_name room_type
## 5 Edwin Private room
## 6 Edwin Private room
## 7 Nienke Entire home/apt
## 8 Fatih Entire home/apt
## 9 Aukje Entire home/apt
## 10 Ym Entire home/apt
Or:
listings[c(5,6,7,8,9,10), c(4, 9)]
## host_name room_type
## 5 Edwin Private room
## 6 Edwin Private room
## 7 Nienke Entire home/apt
## 8 Fatih Entire home/apt
## 9 Aukje Entire home/apt
## 10 Ym Entire home/apt
Remember that c()
puts things together in R. We have
used two vectors, c(5,6,7,8,9,10)
, which can also be
written as a sequence 5:10
, and c(4,9)
in
which we have used column positions but we could have used column names
as well like in the c('host_name', 'room_type')
example to
subset the listings
data frame.
We can subset a data frame by imposing conditions on rows and/or columns too:
listings[listings$id > 42000 , c(2, 3)]
## name host_id
## 9 Oasis in the middle of Amsterdam 188098
## 10 View into park / museum district (long/short stay) 65041
## 11 Luxury design with canal view 187728
## 12 Cozy loft in central Amsterdam 207342
## 13 Multatuli Luxury Guest Suite in top location 225987
## 14 Luxurous Houseboat-Great Location 227530
## 15 Family Home (No drugs, smoking or parties) 231864
## 16 B & B de 9 Straatjes (city center) 231946
## 17 Amsterdam Aqua 185836
## 18 Spacious studio at the attic of a townhouse 246493
## 19 Nice room near centre with en suite bath 247822
## 20 Luminous central room 260785
## 21 Fully equiped house, PIJP area = great 262846
## 22 Greatly located, cozy atmosphere 263844
## 23 Apartment near Museumplein (centre) 264178
## 24 En Suite Apartment in a monumental canal house 278253
## 25 Charming, beautifully & sunny place with cat Luna 301340
## 26 Very nice appartment ALL of januari 306117
## 27 Luxury Houseboat 306192
## 28 Unique 3 bedroom house in Centre 312863
## 29 Amsterdam - The Pijp Apartment 1A 335166
## 30 Rebel - Private Room 336950
## 31 Yves Klein - Private Room 336950
## 32 Attic room with doublebed available 339333
## 33 Centre Museum Quarter Apt Roof Deck 381900
## 34 Garden Suite Studiohttps://www.airbnb.com/manage-l 399879
## 35 Cottage Room- Completely Private 399879
## 36 Red Studio Palamedes Amsterdam 400896
## 37 Yellow Studio Palamedes Amsterdam 400896
## 38 studio INN, bright and spacious 409579
## 39 TOP LOCATED Canalhouse B&B Jordaan 436145
## 40 The Backroom - Central private appt 186729
## 41 Bright apt in central Amsterdam 450453
## 42 Central Studio apartment in Trendy De Pijp 457263
## 43 Bed&Bike SleepOnNature - free bikes 459028
## 44 Privat room and shared Houseboat "Ramona" centre 494281
## 45 De Baronie B&B in historic centre 510526
## 46 Beautiful and spacious room 195859
## 47 A'dam Centre, Bright and Comfy 249794
## 48 Amsterdam at its best 563208
## 49 all private, city center @ home with art-1 538659
## 50 Large private room for rent Center / De Pijp 642044
## 51 Houseboat with beautiful view 673879
## 52 light modern & cosy apartment 689016
## 53 Bright room in lively Kinkerbuurt 715549
## 54 Apartment 2 minutes from Centrum! 748161
## 55 Modern app. with brand new design bathroom 776296
## 56 Amsterdam at your feet 790547
## 57 SUPERLOCATED Room in CANALHOUSE, Private bath 436145
## 58 Central,Rooftop 360',Zen,Canal-View, Bright,Quiet 806976
## 59 Spacious apartment next to center 815989
## 60 Authentic Amsterdam Experience@Townhouse w/Rooftop 818319
## 61 Pleasant,cosy & crazy apartment with small balcony 818479
## 62 Cosy Apt | IN CITY CENTRE | Close to Artis Zoo 834139
## 63 Cat lover's paradise in Amsterdam's hottest street 58458
## 64 Authentic Amsterdam apartment RAI 856540
## 65 Central, quiet, design apartment 876866
## 66 Metropolitan B&B centre Amsterdam 915591
## 67 Private Loft&Bathroom Ensuite Studio Ctr Amsterda 935742
## 68 The best spot to discover Amsterdam 967796
## 69 ART & DESIGN APARTMENT 976652
## 70 Studio next 2 V´park near Leidsepl. 539840
## 71 TOP location with private terrace in old centre 542540
## 72 Best Area - Large Luxury 3 Bdr Home 999036
## 73 Central yet quiet studio apartment 1011967
## 74 Spacious home w roof garden 887591
## 75 Amsterdam Apartment (waterview) 599238
## 76 LUXURIOUS GUESTHOUSE in CITY CENTER 1107389
## 77 Suite in Canalhouse with Jacuzzi 1135329
## 78 Cosy apartment in West | Quiet street | Netflix 1155595
## 79 Jordaan apt with views of canals and Westerkerk 1205924
## 80 Sonnenberg - Canal side & view - Private & Central 1005087
## 81 LOFT with terrace, super location! 1294098
## 82 Bellavista, central,with canal view 1298735
## 83 Modern studio near central AMS 1314658
## 84 Amazing apartment historical centre 1322517
## 85 The Flying Pancake B&B Amsterdam 1334704
## 86 Amazing apartment in city center 1337537
## 87 bed & bike with free parking \nthe house of finer 1344947
## 88 Caracteristic, family apartment 1350388
## 89 Great CENTRAL apartment De Pijp 1358307
## 90 Charming Studio with Roof Terrace 1376607
## 91 Private, central & luxurious apment 1391689
## 92 Quiet room in sunny apartment on canal 1417625
## 93 Luxury appartm, good neighbourhood 1426693
## 94 Cozy sleeping in popular Amsterdam West 1435990
## 95 Beautiful central & quiet apartment 1436134
## 96 nice apartment in centre amsterdam 1446698
## 97 Your central home in wonderful AMS! 1471195
## 98 5-bedrooms house; central location 1451657
## 99 Cozy and quiet 1 bedroom apartment 1173052
## 100 Center,Canal View, Private Bathroom 1491565
## 101 BandB de Baarsjes Amsterdam 1501422
## 102 Center, Large&Sunny Room,Pvt Bathrm 1491565
## 103 cosy bedroom | garden | near center | breakfast 1247363
## 104 Cozy & large bohemian style room 1511330
## 105 Sunny Apartment along river IJ 1493495
## 106 Volledig huis-apartement 1518360
## 107 Modern apartment + rooftop terrace @city centre 1532319
## 108 Cosy apartment in charming Jordaan 1550321
## 109 Sunny family flat in the Jordaan! 1557454
## 110 Quiet studio in the heart of town with Garden! 276068
## 111 Unique Beautiful Houseboat (ship) in Center 1565319
## 112 Smack dab middle of Amsterdam 1565512
## 113 Canal & Parkview Apartment 1573009
## 114 SunShine, A Sunny Amsterdam Center Special 1141485
## 115 Sailing Ship Amsterdam Centre (1) 1582128
## 116 SunBird, Sunny Apt in Quiet Street 1141485
## 117 SunFlower, A Sunny B&B + 2 Bicycles 1141485
## 118 Luxurious 17C Canal Villa Apt, Low-cost Parking 1570754
## 119 Cozy Central Apartment Amsterdam 1598424
## 120 *historic centre* *bright* *canal view* *jordaan* 1600010
## 121 40 m2 Appartment near Leidseplein 1023895
## 122 Ground floor apartment with patio in De Pijp 1613017
## 123 "The Roosevelt Apt" Amsterdam RAI 1616799
## 124 Elegant Appartement Central Location. 1640702
## 125 Familyhome easy airport connection! 1642815
## 126 beautiful designed ap.+bikes+garden 1672823
## 127 Lovely FAMILY house near Vondelpark 1687595
## 128 Stay at 'Perfect appartement in the center' by the 1730202
## 129 Amstel Nest - an urban retreat for two 1740785
## 130 SPACIOUS historical Amsterdam apartment 120m2 1751751
## 131 Luxe boat AMSTERDAM IJBURG!!! 1775032
## 132 Amsterdam appartment + southgarden 1811552
## 133 ★ STYLISH FAMILY HOME - CITY CENTER 1811745
## 134 ★ Stunning City Centre Apartment #3 1826139
## 135 Bright, newly renovated apartment in Center-East 1831568
## 136 PEACEFUL HOME IN HISTORIC CENTRE, NEXT TO AMSTEL 1853421
## 137 Charming, bright and quiet home center Amsterdam 1855417
## 138 Freedom on the water! 1901477
## 139 Private Studio Houseboat *BEST VIEW* @ Amstel 1907015
## 140 Central Loft, Great Rooftop Terrace 1927004
## 141 Lovely apt, Amsterdam /Jordaan. 1939619
## 142 Room in Jordaan Heart of Amsterdam 1979865
## 143 Comfortable design apartment with roof terrace 2008330
## 144 Beautiful spacious 130m family home 325306
## 145 Rooftop Terrace Luxury Loft Centre Amsterdam 2035506
## 146 Beautiful apartment in AMS! 2048100
## 147 Charming house, perfect for two! 2063261
## 148 Cent Adam Lux (2-4p) Apt 90 sqm(m2) 2063612
## 149 Sunny Amsterdam East Harbor View 1336633
## 150 Beautiful 19th Cent. City Apartment 2078292
## 151 ★ City Centre Loft #4 1826139
## 152 Sunny apartment with cat 1939922
## 153 Beautiful apartment @ true center 2144360
## 154 COOL DESIGN HOUSE with WiFi 2174719
## 155 Bright apt in central Amsterdam 2181530
## 156 Charming houseboat near Olympic St. 2191136
## 157 Family house in centre of Amsterdam 2193861
## 158 Fantastic apartment right in the city centre 2059471
## 159 cosy room with private entrance 2202945
## 160 Three colourful art rooms in trendy Pijp 1543378
## 161 creative ground floor house close to Vondelpark 2220492
## 162 Houseboat 'Saturnus' Amsterdam 2238440
## 163 FABULOUS MODERN LOFT IN JORDAAN 2243649
## 164 Superb, quiet home with garden! 81046
## 165 Great apartment near the centre 1997447
## 166 Central 17th Century Unique House 2297302
## 167 GROUNDFLOOR IN WESTERPARK + WIFI and a comfy bed 860750
## 168 Luxurious apartment @ Dam Square 2313040
## 169 Amsterdam-Beatiful entire apartment 2352274
## 170 Cosy apartment along the Amstel. 2378237
## 171 Live the life of A'dam on water! 1382273
## 172 Spacious studio apartment in A'dam 2395278
## 173 Modern 3 bedroom apt. + garden near Jordaan 2396890
## 174 Central quiet ap'ment with parcview 2418579
## 175 Family apartment 80m2 near Vondelpark 2397045
## 176 Light filled, split level, cosy and clean home 2480479
## 177 Quiet & spacious apt, near Museums & Concertgebouw 2483843
## 178 Centre Amsterdam perfect for family 1645998
## 179 Stylish Design Apartment 2492723
## 180 Light studio in centre of Amsterdam 2493425
## 181 ★ Combo City Centre Apartments 1826139
## 182 Typical Amsterdam apt. in center 2536209
## 183 Familyfriendly Suite incl breakfast 2536879
## 184 Apartment near centre with amazing view over city 2541099
## 185 4 bedroom apartment near the park 2041793
## 186 Bright Canal house with amazing view! @the Jordaan 2550339
## 187 Beautiful grand appartment (135 m2) 2569931
## 188 BEAUTIFUL 3 BEDROOM 2 BALCONY • HEART OF AMSTERDAM 2589433
## 189 Somewhere Else 2594559
## 190 Comfortable and pleasant apartment 2639261
## 191 Beautiful ground floor apartment 356740
## 192 Family Friendly Spacious Appartment 2674073
## 193 PLAYFUL SPLIT-LEVEL LOFT+GARDEN 2303229
## 194 Luxury apartment + private garden 2677807
## 195 3 bedrooms Apartment+Garden 2699033
## 196 Charming canal apartment (75m2) 2707013
## 197 Cosy Campervan Amsterdam 711896
## 198 Near city center, green area, WIFI 2460631
## 199 * Live like a King @ Prinsengracht 1600010
## 200 B&B WestergasTverblijf , Cosy! 618589
## 201 Lovely house close to the Jordaan! 2727759
## 202 Amsterdam - conveniently located 1829820
## 203 peaceful stay near city centre 2737834
## 204 Spacious duplex apartment + terrace near centre 2761838
## 205 Beautiful apartment + Lovely garden. 2778131
## 206 Bright central studio with terrace 2794849
## 207 Pretty apt in the center of A'dam 2797260
## 208 Super nice place in Amsterdam!! 2806790
## 209 Southern Comfort near Vondelpark 2439449
## 210 Spacious, 2 bedrooms & garden close to the centre! 21800150
## 211 Monumental Amsterdam Canalhouse 2809841
## 212 AMS CENTRAL FAMILY HOME with huge garden. 2815910
## 213 Loft right in the city centre 2830952
## 214 Appartment in centre of Amsterdam. 2831228
## 215 Private Studio on Spectacular Houseboat 2851886
## 216 Cosy apartment with canal view! 2859023
## 217 Comfy house with garden in De Pijp 1544456
## 218 Sunny & central located apartment near Albert Cuyp 2865047
## 219 Quiet large park side apartment 2865851
## 220 Bright & spacious design apt. in quiet area 405625
## 221 Jacky's place 2866493
## 222 Amsterdam Family Home 2868026
## 223 Warm Family Canal House in the Center 199266
## 224 Large 3 bedroom apartment surrounde 1759884
## 225 Casa Bacco @ Amsterdam city center 1501059
## 226 Sweet ap. with garden and 2 bedrooms... 2901110
## 227 Great 140 m2 app. lovely canalview. 2717366
## 228 Lovely apartment near city centre 2906309
## 229 Romantic Refuge 2906532
## 230 Our Cozy Third floor Home 2912339
## 231 amazing apartment in Amsterdam!! 2913467
## 232 Beautiful apartment near Vondelpark 2914846
## 233 CASA LOVE @ De Pijp 2655785
## 234 House on waterfront - free parking 2580855
## 235 Cozy room in Amsterdam's Center 2408429
## 236 OvernightWorkingPlaceConnectPeopleDeHallenA'Dam 2923773
## 237 Lovely apartment on TOP location 2871004
## 238 Beautiful Family appartement 2977834
## 239 Modern Houseboat | Canal Views! | JORDAAN AREA 3007402
## 240 Sleep at Amy's B'n'B Front Room 3015785
## 241 Huge luxurious (2bedr.) apt in vibrant 'oud west'! 2485229
## 242 Your own cottage in the countryside 3040748
## 243 Artist's flat overlooking park 749611
## 244 Light, spacious apartment in trendy Westerpark 2782084
## 245 Beautiful and spacious apartment with big rooftop 2695123
## 246 Specious family residence near the city centre. 2980639
## 247 Big apartment near Vondelpark 3093502
## 248 Beautifull apartment at Westerpark 3098348
## 249 Beautiful Apartment Amsterdam 3111814
## 250 Sleep at Amy's B'n'B Garden Room 3015785
## 251 House boat near museums in the Pijp 3113556
## 252 Luxury apartment with nice garden 2504685
## 253 Canal apartment on Singelgracht 3116801
## 254 Cntr. Amsterdam - roof top, 1 bedrm 6522250
## 255 Casa Don Mateo 3146203
## 256 'Westerpark Sanctuary', Office-Apt. across W-Park. 3078488
## 257 Beautiful, light apartment with huge garden! 2922757
## 258 Cozy, quiet home near Vondelpark 3181829
## 259 Spacious duplex with roof terrasse 3192579
## 260 Beautiful apartment in Amsterdam 3195385
## 261 Apartment with garden + 2 bikes 3195444
## 262 Adam Canal Center- /long stay ready. 3207136
## 263 Cute Condo in heart of Amsterdam 3209034
## 264 Proper & clean apartment, nice neighborhood 3209315
## 265 Sunny Apartm canal + roof terrace 3254029
## 266 Cozy apartment centre first canal 3255626
## 267 Most popular neighbourhood of A'Dam 2841990
## 268 Lovely apartment in dynamic Pijp for rent! 3280080
## 269 Sunny apartment with canal view 2369370
## 270 My perfect spot next to the Jordaan 3266011
## 271 VeryBig luxurious family house A"dam free parking 1634910
## 272 Characteristic canalhouse near Leidse square! 3287406
## 273 A-location, ensuite bathroom, centre - Milkmaid - 3278419
## 274 A-location ensuite bathroom centre 'Pearl Earring' 3278419
## 275 Style, History & Design - 2 Bedrm 3305102
## 276 Ground floor 3317069
## 277 Large 4 floor art-filled family-friendly townhouse 3336575
## 278 120m2 luxurious house with a garden 3350109
## 279 Boutique luxe apt. at PRINSENGRACHT 3351752
## 280 Design Waterfront flat near station 1406783
## 281 Artists house with two gardens. 3361425
## 282 Lovely, Spacious appartment! 3367431
## 283 Green, sportive experience in A'dam 3375589
## 284 Apartment in centre of Amsterdam 3375808
## 285 Cozy, unique apartment in Jordaan 2467597
## 286 House in city centre, sunny garden 3080120
## 287 Beautiful & Artistic APT | CITY CENTRE! 1609969
## 288 Long term rent: Spacious 4 room apartment 3418459
## 289 Lovely a'dam, groundfloor, garden 3433441
## 290 Central A'dam 4p 2bedr 3441165
## 291 Peaceful and Central at home 3451776
## 292 Penthouse with terrace in centrum 3460384
## 293 Walking Distance Waterlooplein 3440799
## 294 Nice Room near to RAI and downtown 2309485
## 295 Room privacy near centre Amsterdam 3498674
## 296 Cosy, quiet apartment near Vondelpark 3499336
## 297 City beach apartment 2272095
## 298 SPACIOUS LOFT in 17C CANAL MANSION 3529778
## 299 Luxury canal-apartment Amsterdam 3281295
## 300 2 rooms in beautiful appartment near Vondelpark 3548522
## 301 Quiet room in Amsterdam Center 3556421
## 302 Green oasis only 15 min from Station. 3564543
## 303 Cosy and stylish flat in Oud-West for cat lovers 2302855
## 304 Friendly ap. near centre Amsterdam 3557630
## 305 Museum Suite, near Leidse Square 3595270
## 306 Lovely apartment South A'dam (RAI) 3586216
## 307 Comfortable apartment near city centre 3505991
## 308 Cozy and light in Bos en Lommer, 96 m² 3607736
## 309 Oud West Vondelpark Centre Jordaan 3123495
## 310 Appartment 5 min from Vondelpark & Roofterrace!!! 3051231
## 311 luxe apartment in amsterdam city 3397629
## 312 Amsterdam Sweet Unit Museumplein 3648704
## 313 Unique spacious family home in the Jordaan 3517912
## 314 Independent Studio in 'De Pijp' 3644216
## 315 Cozy Spacious Clean Stylish Quiet in Lively Pijp 3656864
## 316 Light apartment in Old-West nearby the Vondelpark 3672783
## 317 Warm & Romantic apt in De Jordaan 3673389
## 318 Cozy houseboat near Vondelpark! 3678493
## 319 Sunny, Elegant, Spacious apartment in the Centre 3681248
## 320 Loft style home nearby city centre 3689867
## 321 Cozy app. @ Vondelpark (Oud-Zuid) 1725014
## 322 Canal house right in the Jordaan (center) 3713242
## 323 Sunny apt great location free bikes 3723928
## 324 Spacious and cozy apartment close to Westerpark 3727363
## 325 Luxury design appt - Centre De Pijp 3744541
## 326 Beautiful loft in center Amsterdam 3775786
## 327 Superb loft apartment in de Pijp! 1902079
## 328 Fantastic studio Amsterdam Centr 3781782
## 329 Great location modern apartment 3788198
## 330 Suite 2 Stay - studio Beautiful Barok - canal area 3788673
## 331 Bed & Breakfast in Oud West Area 3267693
## 332 Cosy apartment in upcoming Oost! 2460974
## 333 Bright Design Apt | Artful Stay 965667
## 334 FAMILY HOME WITH SUNNY PRIVATE GARDEN IN CENTER 3823268
## 335 Roof Terrace Apartm w. parking 232321
## 336 Fantastic apartment in Amsterdam's best area 3872261
## 337 Private Room at our LoveNest in Amsterdam 3846003
## 338 Love Nest for rent in the center!!! 3846003
## 339 Modern stylish villa: My Amsterdam Houseboat 3877342
## 340 Cosy/Creative/Canal Apt for 2/3 p 3880042
## 341 Sailing ship Amsterdam center 1582128
## 342 Best Deal, City Center!! 3888452
## 343 Kiki's Loft Apartment 3899684
## 344 Kiki's Garden View Apartment 01 3899684
## 345 Kiki's Garden View Apartment 02 3899684
## 346 Kiki's Penthouse Apartment 3899684
## 347 Private luxury B&B near Amstel 3518386
## 348 beautiful apartment in canal house 3713242
## 349 Spacious apt in great neighbourhood 3482922
## 350 nice cozy room near Vondelpark 3548522
## 351 Spacious Sunny Apartment - 4p 3966326
## 352 Beautiful apartment with big garden 3970623
## 353 Canal house - Heart of Amsterdam 3970753
## 354 A special spot near the Jordaan ideal for two 3975477
## 355 Bright & Cosy Canal View Apartment 17507481
## 356 Central apartment 50m2, Old West Jordaan. 1387938
## 357 Luxurious, Spacious and light 130m2 3997123
## 358 Spacious delightful Canalhouse 4000086
## 359 beautiful spacious appartment near Vondelpark 3548522
## 360 Spacious design apt. | Centre A'dam 4002861
## 361 Nice Apartment 50m2, East Side! 3661217
## 362 Private guest apartment on a shared boat 4009374
## 363 40% discount A'dam Apartment - Clean Quiet Safe 4013546
## 364 Great room in Amsterdam near Centre 3914844
## 365 LOFT at Canal : special fall-break & ADE price! 3678216
## 366 Room, brandnew luxurious houseboat 3877342
## 367 Fantastic 100m2 apartment. Top loca 4063638
## 368 Lovely Amsterdam Houseboat 4068486
## 369 City Centre Room with En-Suite Shower and Terrace 2313061
## 370 Nice and Cozy Room Offered in Amsterdam 195859
## 371 Charming monument in the Jordaan 4075972
## 372 Jordaan, the Heart of Amsterdam 2924870
## 373 the Boathouse bed&bike A'dam centre 4091515
## 374 Amsterdam Studio Garden Museumplein 3648704
## 375 Spacious, bright and comfortable 4106698
## 376 Sleep on board of our ship in North Amsterdam 3357851
## 377 Lovely apartment near city centre 4109239
## 378 Beautiful apartment in city centre 4117326
## 379 Design apartment nearby the Amstel 4117053
## 380 9-STREETS apartment in A'dam Wolvenstraat 4120466
## 381 Apartment for FAMILIES in city center | 4P 4125951
## 382 Lovely Cozy Place in De Baarsjes! 3161745
## 383 Great apartment Oud West close to foodhallen*** 4074819
## 384 Amsterdam Houseboat 'Centre' 2586026
## 385 Cozy design apartment in Amsterdam! 4150161
## 386 apartment near center Amsterdam 711884
## 387 Groundfloor Lofty Apt | Artful Stay 965667
## 388 Gorgeous Apartment near Vondelpark 3042726
## 389 Spacious apartment in trendy Baarsjes (2 - 4 pers) 4180511
## 390 Knus en gezellig appartement 3826985
## 391 Romantic Houseboat in the center 4184910
## 392 NEW! cozy and private Amsterdam studio for two 4188143
## 393 PERFECT LOCATION FOR TNW 4203731
## 394 Luxurious apartment 72m2 in Pijp 4207199
## 395 mediterranean ambiance appartment 4215174
## 396 Charlie + Coco's Apartment 1657181
## 397 Sleeping in a unique ship in the center of A'dam! 4221960
## 398 Stylish Loft with a Balcony - Couples retreat 3742898
## 399 Boss 336950
## 400 Spacious, modern loft in historic city centre 3449505
## 401 The Best 2-Floor CANAL Apartment! 4291496
## 402 Modern apt with canal and park view 4294992
## 403 Authentic houseboat with privacy and comfort 4296919
## 404 La Vie En Rose B&B Amsterdam Center large Suite 2 3788898
## 405 CANAL APTM. WITH GARDEN IN CENTRE 4297337
## 406 Cosy apartment next to Dam Square 673543
## 407 Stylish apartment for 4 persons with large garden 4323295
## 408 Very charming home and lush garden! 4346522
## 409 Apt & Garden in the Pijp near RAI 4312792
## 410 Spacious, bright apartment near RAI 4363352
## 411 180m2 Loft in former Monastry with garden @Pijp 4364112
## 412 B&B Mondriaan, bright and modern studio 4364462
## 413 Spacious Apt in the Rivierenbuurt 4119336
## 414 Occasional availability Centre Apt 3054367
## 415 Very cosy apt at top location 551716
## 416 Spacious apartment 60m2 for couple! 3557914
## 417 Spacious apartment! Great location! 437989
## 418 Large charming house A'dam centre 4419885
## 419 Lovely apartment @Vondelpark for families 4663170
## 420 Bright " feel good" apartment 4426149
## 421 Central design apt in pretty street 4438302
## 422 HOUSEBOAT NOVA 80m2 + FREE BIKES 4443833
## 423 Perfect location bright cosy appt! 4398387
## 424 Lovely top 2 floors. Canal view. 3 bedr. 2 bathr. 1347048
## 425 Fancy apartment heart of Amsterdam 4520552
## 426 Apartment in city centre (Jordaan) 4520718
## 427 Beautiful apartment in city centre 2620478
## 428 Sunny apartment Amsterdam Zuid behind Museumsquare 4543444
## 429 luxurious double room apartment 4552493
## 430 Spacious apartment in 'Pijp' area 1118702
## 431 *Comfortable apartment close to city center* 1314991
## 432 Bright apt in center with canal view and king bed! 4592706
## 433 The Weavery Boutique Bed&Breakfast2 4614541
## 434 Private, quiet room in old center 4615516
## 435 The Room Upstairs - Free parking 4615590
## 436 5 BR apartment / Quiet / Garden 4585922
## 437 The Weavery Boutique Bed&Breakfast1 4614541
## 438 Spacious room with private bathroom 3332206
## 439 Big room (& big bathroom) in canalhouse 4663978
## 440 ArtPark B&B 4682226
## 441 Luxurious and spacious mobile home 4698057
## 442 Art apartment Amsterdam 4682226
## 443 Lux 2-bed aprt + 2 bath + 2 terrace 4724613
## 444 Apartment near city centre + canals 4579810
## 445 Unique view on the Herengracht 4735700
## 446 Bright & spacious apt central A'dam 450453
## 447 Room in modern apartment with rooftop terrace 1608775
## 448 Spacious 2Bed apt in REDLIGHT district 4774482
## 449 Comfortabel 2p. appt. in de Pijp 4779243
## 450 Trendy apartment in Oud-West 4817958
## 451 A cosy Bed and Breakfast in A'dam. 488009
## 452 Private getaway Citycentre Adam 1903103
## 453 Big central Jordaan apt + rooftop! 4917532
## 454 Urban Villa, bright and spacious! 4922720
## 455 Luxury apartment near RAI (in Pijp) 4927986
## 456 Cute apartment with big garden 589967
## 457 Spacious family house in city centre 4937229
## 458 The Guestroom in Amsterdam 4947559
## 459 2 KM from Anne Frank house /Jordaan / Canals 3112142
## 460 Fully equipped apartment Amsterdam + Roofterrace 4959941
## 461 House with garden in quite oud-zuid 4965029
## 462 Spacious studio - Best location 1917728
## 463 Charming topfloor flat @ Vondelpark 1176934
## 464 Design apt 10 min from Dam Square 2570850
## 465 Houseboat: Our little paradise in Amsterdam 4281750
## 466 Spacious & central appt + private roof terrace! 5022352
## 467 Authentic modern garden family apt. 120m2 1422043
## 468 LOTUS, Classic Dutch Saling Barge 5037506
## 469 The A'dam experience! Spacious Loft 5062046
## 470 2 private floors near Rijksmuseum 5069954
## 471 Cozy apartment + bikes Westerpark 5073186
## 472 Child friendly classic houseboat 5076226
## 473 H&H's place - Light and colorful 4280207
## 474 Homy apartment 3775031
## 475 Private apartment near lively neighborhood Jordaan 5102632
## 476 Spacious room in quiet neighborhood 5134833
## 477 Comfortable family home with garden 5146783
## 478 Great house,Perfect place, S-Garden 5162447
## 479 Comfortable apartment in eco-house 5163484
## 480 Canalhouse in city center 5173586
## 481 Stylish, clean studio, high service 5183517
## 482 Private studio 25m2 for two near A'dam Arena 5183517
## 483 Canal House 3 floors 4663978
## 484 Modern, spacious designer house 4794187
## 485 central family home 120sm+garden child friendly! 2049167
## 486 All private, city center, art @home-2 538659
## 487 Space, glass floors & roof terrace 2633774
## 488 Living in a houseboat on the water 5297940
## 489 Private Garden room 4073717
## 490 Lovely Room in nice apartment 3879155
## 491 Clean, spacious, 2 bathroom LOFT 3971476
## 492 Nice Room in Centre Amsterdam 5417945
## 493 Travel and sleep like a local 75m2 3205205
## 494 Waterfront B&B 4415881
## 495 Lovely Canal Apartment Leidseplein 5472530
## 496 knus huisje met tuintje 5476741
## 497 Very large & new apartment! 5488484
## 498 The Canal House 5489703
## 499 Nice,new app. nearby Center/Jordaan 5527871
## 500 Beautiful apartment in centre 5530981
## 501 Fun apt @ the corner of the Jordan 5536020
## 502 Luxury BedonBoard on Beautiful Houseboat Amsterdam 3523553
## 503 A'dam West - room with double bed 4443157
## 504 Light Apartment with garden 5545294
## 505 Central, comfortable, bath and great kitchen! 5558015
## 506 Rustig gelegen zonnig appartement voor een familie 5039682
## 507 Canal app. 80m² Museum+shop area 4473576
## 508 Cosy private studio on houseboat in the centre!* 125667
## [ reached 'max' / getOption("max.print") -- omitted 18014 rows ]
Shall we set a condition on column names perhaps?
grepl()
to perform regex
matchlistings[1:10, grepl("^number", colnames(listings))]
## [1] 278 339 5 219 336 481 32 89 60 61
We already now that colnames(listings)
will return a
character vector encompassing all column names from
listings
. The grepl()
function operates on
characters. It’s task is to check if the regular
expression (regex) described by its first argument
(^number
in our example) matches any character sequence
found in its second argument (colnames(listings)
in our
example). The regular expression ^number
says search for
anything that begins with number
in the given string, so
^
is the character (precisely: a metacharacter) in the
regex syntax that stands for the beginning of the string. Similarly,
$
is a metacharacter that stands for the empty character at
the end of the string. Let’se see what grepl()
does:
string <- 'the quick brown fox jumps over the lazy dog'
grepl('^t', string)
## [1] TRUE
^^ Asks if string
begins with 'T'
.
string <- 'the quick brown fox jumps over the lazy dog'
grepl('g$', string)
## [1] TRUE
^^ Asks if string
ends with 'g'
.
string <- 'the quick brown fox jumps over the lazy dog'
grepl('x$', string)
## [1] FALSE
^^ Asks if string
ends with 'x'
.
strings <- c('the quick brown fox jumps over the lazy dog',
'Inland Empire',
'Wild at Heart')
grepl('e$', strings)
## [1] FALSE TRUE FALSE
We will learn more about regex
later in this course. But
the previous example illustrates something more than the usage of
grepl()
to check for character sequences in R. Pay
attention, please: we have defined a new character vector,
strings
, with three elements:
'the quick brown fox jumps over the lazy dog'
,
'Inland Empire'
, and 'Wild at Heart'
. We have
called grepl()
like this: grepl('e$', strings)
to ask if any of the strings in strings
matches the e$
regex (and e$
asks: does it end
with 'e'
?). R responded by a vector of logicals:
FALSE TRUE FALSE
, and the length of the output vector is 3
- exactly as the length of the input string strings
. In
other words, grepl()
is a vectorized
function: it can be applied to a vector of elements, and will
compute what it does on each element, pack its results back in another
vector and serve them in that form! Many R functions
are vectorized, and this is one the most powerful features of this
beautiful programming language. We will also learn much more about
vector programming and code vectorization with R in our future
sessions.
A glimpse of vectorization only, the essential feature of R - which is a member of the class of vector languages or vector programming languages:
someVector <- c(1, 2, 3, 4, 5)
someVector + 10
## [1] 11 12 13 14 15
someVector^2
## [1] 1 4 9 16 25
someVector %% 2 == 0
## [1] FALSE TRUE FALSE TRUE FALSE
read.csv()
from the InternetOh, one more thing. Remember that listings
live on the
Internet: here.
If you browse to that Inside Airbnb page and copy (right click!) the
listings.csv
link location -
http://data.insideairbnb.com/the-netherlands/north-holland/amsterdam/2020-12-12/visualisations/listings.csv
in the time of writing of this Notebook - you can obtain it from
read.csv()
in R like this:
urlData <- 'https://data.insideairbnb.com/the-netherlands/north-holland/amsterdam/2024-09-05/visualisations/listings.csv'
listingsOnline <- read.csv(URLencode(urlData),
header = T,
check.names = F,
stringsAsFactors = F)
head(listingsOnline)
## id name host_id host_name neighbourhood_group
## 1 2818 Quiet Garden View Room & Super Fast Wi-Fi 3159 Daniel NA
## 2 27886 Romantic, stylish B&B houseboat in canal district 97647 Flip NA
## 3 28871 Comfortable double room 124245 Edwin NA
## 4 29051 Comfortable single room 124245 Edwin NA
## 5 44391 Quiet 2-bedroom Amsterdam city centre apartment 194779 Jan NA
## 6 48373 Cozy family home in Amsterdam South 220434 Vesna & Misha NA
## neighbourhood latitude longitude room_type price minimum_nights
## 1 Oostelijk Havengebied - Indische Buurt 52.36435 4.94358 Private room 83 3
## 2 Centrum-West 52.38761 4.89188 Private room 155 3
## 3 Centrum-West 52.36775 4.89092 Private room 79 2
## 4 Centrum-Oost 52.36584 4.89111 Private room 59 2
## 5 Centrum-Oost 52.37168 4.91471 Entire home/apt NA 3
## 6 Buitenveldert - Zuidas 52.32781 4.87680 Entire home/apt NA 3
## number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
## 1 389 2024-08-31 2.07 1 105
## 2 279 2024-09-03 1.81 1 193
## 3 638 2024-09-03 3.73 2 74
## 4 761 2024-09-04 4.64 2 91
## 5 42 2022-08-20 0.25 1 0
## 6 5 2024-04-28 0.36 1 0
## number_of_reviews_ltm license
## 1 35 0363 5F3A 5684 6750 D14D
## 2 18 0363 974D 4986 7411 88D8
## 3 100 0363 607B EA74 0BD8 2F6F
## 4 84 0363 607B EA74 0BD8 2F6F
## 5 0 0363 E76E F06A C1DD 172C
## 6 4 0363 4A2B A6AD 0196 F684
The URLencode()
function takes care of the
percent-encoding of characters in the URLs. Never forget to use it when
you need an online file. There are better solutions to this than the
base R URLencode()
function (see: urltools
package), but the base solution will do nicely as well - or at least in
the beginning of your work in Data Science.
Now we have a copy of listings
…
rm(listingsOnline)
Only our second session and we can already read data from the local filesystem, Microsoft Excel, and the Internet!
Ok, back to data.frame
. Here are some
principles of data frame subsetting in R:
listings
, some other data frames +
basic visualizations w. {ggplot2}It is time: star doing analytics with listings
!
We first provide a concise overview of what is found in this dataset. Let’s see:
str(listings)
## 'data.frame': 18522 obs. of 16 variables:
## $ id : int 2818 20168 25428 27886 28871 29051 31080 41125 43109 43980 ...
## $ name : chr "Quiet Garden View Room & Super Fast WiFi" "Studio with private bathroom in the centre 1" "Lovely apt in City Centre (w.lift) near Jordaan" "Romantic, stylish B&B houseboat in canal district" ...
## $ host_id : int 3159 59484 56142 97647 124245 124245 133488 178515 188098 65041 ...
## $ host_name : chr "Daniel" "Alexander" "Joan" "Flip" ...
## $ neighbourhood_group : logi NA NA NA NA NA NA ...
## $ neighbourhood : chr "Oostelijk Havengebied - Indische Buurt" "Centrum-Oost" "Centrum-West" "Centrum-West" ...
## $ latitude : num 52.4 52.4 52.4 52.4 52.4 ...
## $ longitude : num 4.94 4.89 4.88 4.89 4.89 ...
## $ room_type : chr "Private room" "Private room" "Entire home/apt" "Private room" ...
## $ price : int 59 236 125 135 75 55 219 160 211 67 ...
## $ minimum_nights : int 3 1 14 2 2 2 3 4 3 30 ...
## $ number_of_reviews : int 278 339 5 219 336 481 32 89 60 61 ...
## $ last_review : chr "2020-02-14" "2020-04-09" "2020-02-09" "2020-07-25" ...
## $ reviews_per_month : num 1.95 2.58 0.14 2.01 2.68 4.05 0.28 0.73 4.31 0.5 ...
## $ calculated_host_listings_count: int 1 2 1 1 2 2 1 1 1 2 ...
## $ availability_365 : int 123 3 33 219 346 360 0 0 0 184 ...
So,
id
is just some id of the listing,name
is (I guess, it’s Airbnb’s dataset) the title of
the listing as it was advertised,host_id
is, obviously, the host id,host_name
is also self-explanatory,neighbourhood_group
has a lot of NA
values, we will learn about NA
soon,neighbourhood
seems to represent a particular
neighbourhood of Amsterdam,latitude
and longitude
are
self-explanatory,room_type
- the room type,price
- we do not know the units, say EUR,minimun_nights
- the minimum nights for a stay in this
property,number_of_reviews
- how many reviews did a particular
listing receive,last_review
- the timestamp of the latest review for
this listing, YYYY-MM-DD
format,last_review
- how many reviews were received for this
listing,reviews_per_month
- how many reviews per month, we do
not know the time frame across which was this measure aggregated,calculated_host_listings_count
- I have no idea what
this is, we will do some research on this later, and finallyavailability_365
- how many days in a years is this
available.listings
!Ok. First, I would like to learn more about the
calculated_host_listings_count
column, which I did not
understand immediately. I have an intuition about it: it could
be the number of different listings with the same
host_id
/host_name
.
Step 0: reload listings.csv
filename <- paste0(dataDir, 'listings.csv')
listings <- read.csv(file = filename,
header = T,
check.names = F,
stringsAsFactors = F)
str(listings)
## 'data.frame': 18522 obs. of 16 variables:
## $ id : int 2818 20168 25428 27886 28871 29051 31080 41125 43109 43980 ...
## $ name : chr "Quiet Garden View Room & Super Fast WiFi" "Studio with private bathroom in the centre 1" "Lovely apt in City Centre (w.lift) near Jordaan" "Romantic, stylish B&B houseboat in canal district" ...
## $ host_id : int 3159 59484 56142 97647 124245 124245 133488 178515 188098 65041 ...
## $ host_name : chr "Daniel" "Alexander" "Joan" "Flip" ...
## $ neighbourhood_group : logi NA NA NA NA NA NA ...
## $ neighbourhood : chr "Oostelijk Havengebied - Indische Buurt" "Centrum-Oost" "Centrum-West" "Centrum-West" ...
## $ latitude : num 52.4 52.4 52.4 52.4 52.4 ...
## $ longitude : num 4.94 4.89 4.88 4.89 4.89 ...
## $ room_type : chr "Private room" "Private room" "Entire home/apt" "Private room" ...
## $ price : int 59 236 125 135 75 55 219 160 211 67 ...
## $ minimum_nights : int 3 1 14 2 2 2 3 4 3 30 ...
## $ number_of_reviews : int 278 339 5 219 336 481 32 89 60 61 ...
## $ last_review : chr "2020-02-14" "2020-04-09" "2020-02-09" "2020-07-25" ...
## $ reviews_per_month : num 1.95 2.58 0.14 2.01 2.68 4.05 0.28 0.73 4.31 0.5 ...
## $ calculated_host_listings_count: int 1 2 1 1 2 2 1 1 1 2 ...
## $ availability_365 : int 123 3 33 219 346 360 0 0 0 184 ...
Step 1: ask R how many unique host_id
values there are in the dataset:
num_hosts <- length(unique(listings$host_id))
num_hosts
## [1] 16033
The unique()
function: you will be using it every now
and then. It’s easy: if vec <- c(1, 2, 3, 5, 5, 7)
is a
vector, unique(vec)
is c(1, 2, 3, 5, 7)
,
look:
vec <- c(1, 2, 3, 5, 5, 7)
unique(vec)
## [1] 1 2 3 5 7
Step 2: ask R to count how many different listings there are per
unique host_id
.
num_host_listings <- table(listings$host_id)
head(num_host_listings, 50)
##
## 3159 3592 7924 12085 30390 47517 49851 56142 58458 59484 61977 62341 62658 65041
## 1 1 1 1 1 1 1 1 1 2 1 1 1 2
## 72890 77950 81046 92194 92253 96492 97647 98297 98647 98844 109257 111635 113034 124245
## 1 1 1 1 1 1 1 1 1 1 1 1 1 2
## 125667 126790 127938 133488 142145 149649 158271 166264 169567 178515 178521 179452 185619 185836
## 2 1 1 1 1 1 1 1 1 1 1 1 1 1
## 186729 187728 188073 188098 190897 194523 195126 195537
## 1 1 1 1 1 1 1 1
The table()
function is your tool to obtain the
frequency distribution of a variable in R. It’s easy: if
vec <- c(1, 2, 3, 5, 5, 7, 7, 7)
is a vector,
table(vec)
is:
vec <- c(1, 2, 3, 5, 5, 7, 7, 7)
table(vec)
## vec
## 1 2 3 5 7
## 1 1 1 2 3
What is the class()
of the output of
table()
?
class(table(vec))
## [1] "table"
Yes, R has a plenty of specific types, and sometimes - and maybe most of the time - it is handy to turn them all into data frames:
freqHosts <- as.data.frame(table(listings$host_id),
stringsAsFactors = FALSE)
head(freqHosts)
## Var1 Freq
## 1 3159 1
## 2 3592 1
## 3 7924 1
## 4 12085 1
## 5 30390 1
## 6 47517 1
The Var1
represents the host_id
from the
listings
data frame, while Freq
is its
frequency: how many times does the respective value of Var1
appear in the listings
data.frame?
One check:
dim(freqHosts)[1] == num_hosts
## [1] TRUE
Of course, it has to be! So dim(x)
where x
is a data.frame
returns a vector of length two, of which
the first element is the number of rows in x
and the second
the number of columns in x
. In a frequency distribution,
every particular value of a discrete variables occurs only once, so of
course that dim(freqHosts)[1] == num_hosts
must evaluate to
T
.
Now, if my intuition that calculated_host_listings_count
column stands for the number of different listings with the same
host_id
/host_name
, then its values must be the
same as those that I have produced by table()
in
freqHosts
. How do we test this hypothesis?
Step 3. Extract only host_id
and
calculated_host_listings_count
from listings
;
if I am right, there will be duplicated values in this selection:
testHosts <- listings[ , c('host_id', 'calculated_host_listings_count')]
head(testHosts)
## host_id calculated_host_listings_count
## 1 3159 1
## 2 59484 2
## 3 56142 1
## 4 97647 1
## 5 124245 2
## 6 124245 2
Ok, now: are there any duplicated rows present?
d <- duplicated(testHosts)
table(d)
## d
## FALSE TRUE
## 16033 2489
Of course there are, but the result most probably does not mean too
much at this point. Step by step, the duplicated()
function
R returns a logical vector (TRUE
or
FALSE
):
vec <- c(1, 2, 3, 2, 2, 4, 5, 5, 7)
duplicated(vec)
## [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE
Note how each first appearance of an element in a vector
receives FALSE
- because it is not duplicated - while every
subsequent appearance of the same element receives TRUE
-
because it is duplicated. duplicated()
works for data
frames too, in which case it looks at all the values across all of the
rows and returns as many logicals as there are rows following the same
logic: first apperance is marked as FALSE
and then all
repetitions are marked as TRUE
:
d <- duplicated(testHosts)
head(d)
## [1] FALSE FALSE FALSE FALSE FALSE TRUE
So, what happened when I did table(d)
is that R has
counted how many duplicates (TRUE
) there were in
testHosts
:
table(d)
## d
## FALSE TRUE
## 16033 2489
There are 2489
duplicated values. Interesting enough, I
can use the logical vector that duplicated()
outputs to
clean up my testHosts
data frame from duplicated entries in
this way:
dim(testHosts)
## [1] 18522 2
testHosts <- testHosts[!duplicated(testHosts), ]
dim(testHosts)
## [1] 16033 2
to keep only the 16033
rows that were never
repeated.
We get back to the hypothesis:
calculated_host_listings_count
column stands for the number
of different listings with the same
host_id
/host_name
. If this is true, than the
frequency counts in freqHosts$Freq
must be the same, across
the host_id
values, as the values found in
testHosts$calculated_host_listings_count
, correct? How do
we proceed to find out?
What do we have are two data frames:
head(freqHosts)
## Var1 Freq
## 1 3159 1
## 2 3592 1
## 3 7924 1
## 4 12085 1
## 5 30390 1
## 6 47517 1
and the de-duplicated testHosts
:
head(testHosts)
## host_id calculated_host_listings_count
## 1 3159 1
## 2 59484 2
## 3 56142 1
## 4 97647 1
## 5 124245 2
## 7 133488 1
and do not forget that we know that Var1
in
freqHosts
encompasses the same values of
host_id
from listings
as
testHosts$host_id
. I would know like to put
freqHosts
and testHosts
side by side and
inspect if the values of freqHosts$Var1
and
testHosts$host_id
are really the same.
Check one thing first:
dim(freqHosts)
## [1] 16033 2
dim(testHosts)
## [1] 16033 2
Of course. But the order of Var1
in
freqHosts
and host_id
in
testHosts
does not seem to be the same. Let’s fix that by
using order()
. How does it work?
someDataFrame <- data.frame(someNum = c(5, 9, 1, 3, 4, 10),
someChar = c ('a', 'b', 'c', 'd', 'e', 'f'),
stringsAsFactors = F)
print(someDataFrame)
## someNum someChar
## 1 5 a
## 2 9 b
## 3 1 c
## 4 3 d
## 5 4 e
## 6 10 f
Now:
someDataFrame[order(someDataFrame$someNum), ]
## someNum someChar
## 3 1 c
## 4 3 d
## 5 4 e
## 1 5 a
## 2 9 b
## 6 10 f
Take a look at the following:
vec <- c(5, 9, 1, 3, 4, 10)
order(vec)
## [1] 3 4 5 1 2 6
or
vec <- c(5, 9, 1, 3, 4, 10)
vec[order(vec)]
## [1] 1 3 4 5 9 10
So, the output of order()
tells us the following: which
element of a vector (by position) should be placed where in order to
have the original vector sorted out. This is why
someDataFrame[order(someDataFrame$someNum), ]
## someNum someChar
## 3 1 c
## 4 3 d
## 5 4 e
## 1 5 a
## 2 9 b
## 6 10 f
works: order(someDataFrame$someNum)
returns an ordering
of rows such that the data frame is sorted by someNum
.
We now sort freqHosts
and testHosts
so that
all host_id
values are aligned (remember, Var1
in freqHosts
represents hosts_id
in
testHosts
). Before we do that, take a look at the
following:
str(freqHosts)
## 'data.frame': 16033 obs. of 2 variables:
## $ Var1: chr "3159" "3592" "7924" "12085" ...
## $ Freq: int 1 1 1 1 1 1 1 1 1 2 ...
str(testHosts)
## 'data.frame': 16033 obs. of 2 variables:
## $ host_id : int 3159 59484 56142 97647 124245 133488 178515 188098 65041 187728 ...
## $ calculated_host_listings_count: int 1 2 1 1 2 1 1 1 2 1 ...
What I do not like is that Var1
in freqHosts
is a character
, while host_id
in
testHosts
is a numeric. Fix:
freqHosts$Var1 <- as.numeric(freqHosts$Var1)
str(freqHosts)
## 'data.frame': 16033 obs. of 2 variables:
## $ Var1: num 3159 3592 7924 12085 30390 ...
## $ Freq: int 1 1 1 1 1 1 1 1 1 2 ...
Ok, sort freqHosts by Var1
:
freqHosts <- freqHosts[order(freqHosts$Var1), ]
head(freqHosts, 10)
## Var1 Freq
## 1 3159 1
## 2 3592 1
## 3 7924 1
## 4 12085 1
## 5 30390 1
## 6 47517 1
## 7 49851 1
## 8 56142 1
## 9 58458 1
## 10 59484 2
and sort testHosts
by host_id
:
testHosts <- testHosts[order(testHosts$host_id), ]
head(testHosts, 10)
## host_id calculated_host_listings_count
## 1 3159 1
## 15914 3592 1
## 9837 7924 1
## 9176 12085 1
## 1579 30390 1
## 1405 47517 1
## 10848 49851 1
## 3 56142 1
## 63 58458 1
## 2 59484 2
Now the two data frames should be nicely aligned. Let’s put them side by side in a new data frame:
testDataFrame <- cbind(freqHosts, testHosts)
head(testDataFrame, 40)
## Var1 Freq host_id calculated_host_listings_count
## 1 3159 1 3159 1
## 2 3592 1 3592 1
## 3 7924 1 7924 1
## 4 12085 1 12085 1
## 5 30390 1 30390 1
## 6 47517 1 47517 1
## 7 49851 1 49851 1
## 8 56142 1 56142 1
## 9 58458 1 58458 1
## 10 59484 2 59484 2
## 11 61977 1 61977 1
## 12 62341 1 62341 1
## 13 62658 1 62658 1
## 14 65041 2 65041 2
## 15 72890 1 72890 1
## 16 77950 1 77950 1
## 17 81046 1 81046 1
## 18 92194 1 92194 1
## 19 92253 1 92253 1
## 20 96492 1 96492 1
## 21 97647 1 97647 1
## 22 98297 1 98297 1
## 23 98647 1 98647 1
## 24 98844 1 98844 1
## 25 109257 1 109257 1
## 26 111635 1 111635 1
## 27 113034 1 113034 1
## 28 124245 2 124245 2
## 29 125667 2 125667 2
## 30 126790 1 126790 1
## 31 127938 1 127938 1
## 32 133488 1 133488 1
## 33 142145 1 142145 1
## 34 149649 1 149649 1
## 35 158271 1 158271 1
## 36 166264 1 166264 1
## 37 169567 1 169567 1
## 38 178515 1 178515 1
## 39 178521 1 178521 1
## 40 179452 1 179452 1
Are the two data frames perfectly aligned? To find out we ask how
many matches there are between testDataFrame$Var1
and
testDataFrame$host_id
:
sum(testDataFrame$Var1 == testDataFrame$host_id)
## [1] 16033
A new R function: sum()
. This is how it works:
sum(c(5, 6))
## [1] 11
But also:
sum(c(TRUE, FALSE, TRUE, TRUE))
## [1] 3
sum()
across logical vectors in R: TRUE
counts as 1
, FALSE
counts as 0
.
So if we derive an index vector, say something indicating the number of
matches of some kind, sum()
can help us find out how many
times a match was successful.
Also, the data frames should match in all positions, so
dim(testDataFrame)[1]
- the number of rows in
testDataFrame
should be the same:
dim(testDataFrame)[1]
## [1] 16033
Ok, they match. Are the values in Freq
and
calculated_host_listings_count
really the same?
sum(testDataFrame$Freq == testDataFrame$calculated_host_listings_count)
## [1] 16033
Yes, the values in
testDataFrame$calculated_host_listings_count
and
testDataFrame$Freq
match perfectly, so our
hypothesis about the structure of the listings
data set
holds!
At this point you might ask yourself: is it possible that we need to invest all this work just to figure out the meaning of one single column in a data frame? The answer is: yes, and no. To consider the ‘no’ answer first: I am doing this intentionally, to provide an exercise, a training opportunity to you. In practice, there are many finer, more elaborated, and more comfortable ways to do exactly the same in R, and you will learn a lot about them in the future sessions. As of the ‘yes’ answer: we are dealing with a very small dataset here, a one encompassing barely 16K rows and several columns. In the wild, if you start applying Data Science in R or any other language in practice, the datasets that you will be facing will probably be orders of magnitude larger. Can you inspect by eye a dataset encompassing a few hundreds of millions of rows and say tens, or hundreds of columns? Well, no. And that is why you have to be prepared to invest serious work in understanding the structure of just any dataset at hand. It is hard, it is tedious, it takes a lot of patientce, and it certainly makes Data Science less than the most sexiest profession of the 21st century.
listings
!Q1. What is the distribution of
room_type
in listings
? Let’s see and
illustrate:
roomType <- as.data.frame(table(listings$room_type))
colnames(roomType) <- c('room_type', 'frequency')
roomType <- roomType[order(roomType$frequency), ]
head(roomType)
## room_type frequency
## 4 Shared room 52
## 2 Hotel room 141
## 3 Private room 3896
## 1 Entire home/apt 14433
What if we want to place the most frequent room_type
in
the top?
roomType <- roomType[order(-roomType$frequency), ]
head(roomType)
## room_type frequency
## 1 Entire home/apt 14433
## 3 Private room 3896
## 2 Hotel room 141
## 4 Shared room 52
How many different room_type
values do we observe?
roomTypes <- as.data.frame(table(listings$room_type))
colnames(roomTypes) <- c('room_type', 'count')
roomTypes
## room_type count
## 1 Entire home/apt 14433
## 2 Hotel room 141
## 3 Private room 3896
## 4 Shared room 52
Ok, four only. Visualize this (note: we are jumping a bit ahead here, but there is a reason to it):
library(ggplot2)
ggplot(data = roomTypes,
aes(x = room_type, y = count)) +
geom_bar(stat = 'identity', fill = "cadetblue3") +
xlab('Room type') +
ylab('Count') +
ggtitle('Room type distribution in Airbnb Listings') +
theme_bw() +
theme(panel.border = element_blank())
{ggplot2} is an industry standard R package for static data visualizations (in spite of the fact that you can produce interactive visualizations with {ggplot2}). Before the end of this session, I would like to begin analyzing the {ggplot2} code: we will be using so much of it in the future.
Observe the first part:
ggplot(data = roomTypes,
aes(x = room_type, y = count))
This produced nothing. Not really: it is an empty plot, but
it does have the horizontal (x) axis as well as the vertical
(Y) axis defined. The ggplot()
function asks for a
data
argument: the data.frame
encompassing the
data that we want to visualize. A call to another function,
aes()
, is made inside of ggplot()
, defining
the x
and y
arguments: what goes on the
horizontal and vertical axes of the plot. Note: when using
{ggplot2}, once the dataset is determined in the data
argument of ggplot()
there is no need to make a reference
to it in aes()
or elsewhere: we can make references to its
columns directly, as we did in
aes(x = room_type, y = count)
.
What happens after the first +
in the code?
ggplot(data = roomTypes,
aes(x = room_type, y = count)) +
geom_bar(stat = 'identity', fill = "cadetblue3")
The addition of geom_bar()
adds a layer
to the {ggplot2} plot. Any {ggplot2} visualization is produced in this
way:
data
and the mapping of variables in
aes()
in a ggplot()
call, thengeom_bar()
in this example, andggtitle()
or theme()
.Note how geom_bar()
also has arguments that help style
the plot, e.g. fill = "cadetblue3"
which picked
cadetblue3
as the fill color for the bar plot. More on
colors in R: colors in
R.
What happens next is the introduction of the x
and
y
labels to the plot, as well as the plot title:
ggplot(data = roomTypes,
aes(x = room_type, y = count)) +
geom_bar(stat = 'identity', fill = "cadetblue3") +
xlab('Room type') +
ylab('Count') +
ggtitle('Room type distribution in Airbnb Listings')
Finally we chose to strip some default settings by using
theme_bw()
and style the plot by removing the
panel.border
in an additional theme()
call:
theme(panel.border = element_blank())
:
ggplot(data = roomTypes,
aes(x = room_type, y = count)) +
geom_bar(stat = 'identity', fill = "cadetblue3") +
xlab('Room type') +
ylab('Count') +
ggtitle('Room type distribution in Airbnb Listings') +
theme_bw() +
theme(panel.border = element_blank())
theme()
is used to access the details of a {ggplot2}
plot; for example, center the plot title:
ggplot(data = roomTypes,
aes(x = room_type, y = count)) +
geom_bar(stat = 'identity', fill = "cadetblue3") +
xlab('Room type') +
ylab('Count') +
ggtitle('Room type distribution in Airbnb Listings') +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = 0.5))
or control the behavior of the axes:
ggplot(data = roomTypes,
aes(x = room_type, y = count)) +
geom_bar(stat = 'identity', fill = "cadetblue3") +
xlab('Room type') +
ylab('Count') +
ggtitle('Room type distribution in Airbnb Listings') +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(size = 13))
There are so many nice tricks that can be pulled with {ggplot2}, you will see. At this point, it is important to remember the following:
data
argument in the
ggplot()
call, andaes()
call inside the ggplot()
call;+
, while various
geom
things define the type of the plot (bar plot, in our
example; we will soon learn about line plots, scatter plots, density
plots, etc); finally,theme()
function calls.R Markdown is what I have used to produce this beautiful Notebook. We will learn more about it near the end of the course, but if you already feel ready to dive deep, here’s a book: R Markdown: The Definitive Guide, Yihui Xie, J. J. Allaire, Garrett Grolemunds.
E1. Produce a new column in
listings
: listings$endsIn_e
. The type of the
column should be logical (i.e. TRUE
, FALSE
):
TRUE
if host_name
ends with e
,
FALSE
otherwise. Hint: grepl()
, and remember
that grepl()
is vectorized.
E2. Use table()
so to produce a
neighbourhood
X room_type
two-way contingency
table and turn it into a data.frame
.
E3. Reproduce the following chart with {gplot2}
by relying on the code presented in this session + ggplot2 documentation +
Hadley’s book + steal code from Stack Overflow… whatever, just try to
reproduce it. Hints. In aes()
, you need to
define group =
, and fill =
; the density plot
in {ggplot2} is produced by geom_density()
which has an
alpha =
argument to set for color transparency; natural
logarithm in R is log()
(and then there is
log10()
too).
## Warning: Removed 16 rows containing non-finite outside the scale range (`stat_density()`).
E5. Compute the average
number_of_reviews
by neighbourhood
in
listings
. Hint: use mean()
,
very similar to sum()
. Put the results in the
data.frame
with the columns named
neighbourhood
and average_reviews
.
E6. What are the top five listings in
listings
with the best price per night ratio?
E7. What is the average price per night ratio
across the neighborhoods in listings
?
License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.
Contact: goran.milovanovic@datakolektiv.com
Impressum
Data Kolektiv, 2004, Belgrade.