×
DALT7014 Data Mining
Week 4 - XML Manipulation and Joining Data
2023-02-23
Week 4 Practical
UK Top 40 Singles Chart
https://www.officialcharts.com/charts/uk-top-40-singles-chart/
Inspect the HTML for each single. It isn’t visible on the local version, but the data is in the HTML.
Using the methods shown is this lecture recreate the same dataframe as that read in from top40s.csv
. A local copy of the webpage is available in the code files as top40s.html
.
## Rows: 40 Columns: 2
## -- Column specification ----------------------------------------------
## Delimiter: ","
## chr (2): song, artist
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 40 x 2
## song artist
## <chr> <chr>
## 1 STICK SEASON NOAH KAHAN
## 2 MURDER ON THE DANCEFLOOR SOPHIE ELLIS-BEXTOR
## 3 BEAUTIFUL THINGS BENSON BOONE
## 4 LOSE CONTROL TEDDY SWIMS
## 5 PRAISE JAH IN THE MOONLIGHT YG MARLEY
## 6 PRADA CASSO/RAYE/D-BLOCK EUROPE
## 7 CRUEL SUMMER TAYLOR SWIFT
## 8 GREEDY TATE MCRAE
## 9 TEXAS HOLD 'EM BEYONCE
## 10 YES AND ARIANA GRANDE
## 11 HOMESICK NOAH KAHAN & SAM FENDER
## 12 CARNIVAL KANYE WEST/TY DOLLA SIGN
## 13 HOUDINI DUA LIPA
## 14 UNWRITTEN NATASHA BEDINGFIELD
## 15 ALIBI ELLA HENDERSON FT RUDIMENTAL
## 16 POPULAR WEEKND/PLAYBOI CARTI/MADONNA
## 17 BURN KANYE WEST/TY DOLLA SIGN
## 18 BACK TO ME KANYE WEST/TY DOLLA SIGN
## 19 REDRUM 21 SAVAGE
## 20 NEVER LOSE ME FLO MILLI
## 21 LEAVEMEALONE FRED AGAIN & BABY KEEM
## 22 DNA (LOVING YOU) BILLY GILLIES FT HANNAH BOLEYN
## 23 I REMEMBER EVERYTHING ZACH BRYAN FT KACEY MUSGRAVES
## 24 WHATEVER KYGO & AVA MAX
## 25 EXES TATE MCRAE
## 26 RICH BABY DADDY DRAKE FT SEXYY RED & SZA
## 27 NOTHING MATTERS LAST DINNER PARTY
## 28 LOVIN ON ME JACK HARLOW
## 29 ASKING SONNY FODERA/MK/DOUGLAS
## 30 ON MY LOVE ZARA LARSSON & DAVID GUETTA
## 31 FOREVER NOAH KAHAN
## 32 SCARED TO START MICHAEL MARCAGI
## 33 TOXIC SONGER
## 34 HOME GOOD NEIGHBOURS
## 35 PERFECT (EXCEEDER) MASON/PRINCESS SUPERSTAR
## 36 ONE OF THE GIRLS WEEKND/JENNIE/LILY ROSE DEPP
## 37 ABRACADABRA WES NELSON FT CRAIG DAVID
## 38 FAST CAR TRACY CHAPMAN
## 39 SELFISH JUSTIN TIMBERLAKE
## 40 LIL BOO THANG PAUL RUSSELL
You will need to process the number one single separately from the other 39 as it has different attributes.
For each chart item there is a collection of 11 span elements. The fourth and fifth span elements contain the song and artist for your data.
Week 4 Practical Solution
#download.file("https://www.officialcharts.com/charts/uk-top-40-singles-chart/", file.path("data", "top40s.html"))
library(xml2)
library(dplyr)
library(readr)
top40_soln <- read_csv(file.path("data","top40s.csv"))
## Rows: 40 Columns: 2
## -- Column specification ----------------------------------------------
## Delimiter: ","
## chr (2): song, artist
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
html_doc <- read_html(file.path("data", "top40s.html"))
top <- xml_find_all(html_doc, ".//div[@class=\"primis chart-item relative text-right\"]")
top39 <- xml_find_all(html_doc, ".//div[@class=\"chart-item relative text-right\"]")
top_spans <- xml_find_all(top, ".//span")
top_spans
## {xml_nodeset (11)}
## [1] <span class="digits1 chart-key font-bold"><span class="sr-only ...
## [2] <span class="sr-only">Number </span>
## [3] <span class="movement-icon"></span>
## [4] <span>STICK SEASON</span>
## [5] <span>NOAH KAHAN</span>
## [6] <span title="Last week">LW: <span class="text-brand-pink font- ...
## [7] <span class="text-brand-pink font-bold">1</span>
## [8] <span class="hidden sm:inline-block">, </span>
## [9] <span class="text-brand-cobalt font-bold">1</span>
## [10] <span>, </span>
## [11] <span class="text-brand-pink font-bold">20</span>
select_song_and_artist <- function(node){
spans <- xml_find_all(node, ".//span")
song <- xml_text(spans[4])
artist <- xml_text(spans[5])
return(tibble(song = song, artist = artist))
}
top_row <- select_song_and_artist(top)
top39_rows <- lapply(top39, select_song_and_artist)
top40 <- bind_rows(top_row, top39_rows)
# top40 %>% write_csv(file.path("data","top40s.csv"))
print(top40)
## # A tibble: 40 x 2
## song artist
## <chr> <chr>
## 1 STICK SEASON NOAH KAHAN
## 2 MURDER ON THE DANCEFLOOR SOPHIE ELLIS-BEXTOR
## 3 BEAUTIFUL THINGS BENSON BOONE
## 4 LOSE CONTROL TEDDY SWIMS
## 5 PRAISE JAH IN THE MOONLIGHT YG MARLEY
## 6 PRADA CASSO/RAYE/D-BLOCK EUROPE
## 7 CRUEL SUMMER TAYLOR SWIFT
## 8 GREEDY TATE MCRAE
## 9 TEXAS HOLD 'EM BEYONCE
## 10 YES AND ARIANA GRANDE
## # i 30 more rows