R programming
devtools::install_version("dplyr", version = "0.8.5", repos = "http://cran.us.r-project.org", lib="C:/Users/zhao752/Documents/R/win-library/OLDPKGs")
Economic distance
People live in 3-D world, because human-beings cannot really understand logic of higher dimensions.
Gravity modeling in the trade economic theories is useful in understanding higher dimensional meanings of distance, which might not be physical.
Examples are like credit card promotes spending by decreasing the distance (e.g., processing time, security, risk, and rewards) of the monetary flow.
Likewise, today, I bought DataCamp to decrease the virtual distance to learning advanced R programming.
Let's see how this goes.
There are 13 courses with about ~50 hours of learning time on the basic R skill track.
Finished Introduction to R today (~4hr)
Pretty basic but still something new learned
List is very useful. [[ ]] should be used for selecting items from a list
str() is short for structure
tbl_df vs. data.frame
Intermediate R (~6hr)
&& or || effects on the first element of a set of or vector
awards <- c("Won 1 Oscar.",
"Won 1 Oscar. Another 9 wins & 24 nominations.",
"1 win and 2 nominations.",
"2 wins & 3 nominations.",
"Nominated for 2 Golden Globes. 1 more win & 2 nominations.",
"4 wins & 1 nomination.")
sub(".*\\s([0-9]+)\\snomination.*$", "\\1", awards)
Sys.Date()
Sys.time()
Dplyr & Tidyverser (~<4hr)
count(var, sort = T)
counties_selected %>%
group_by(state) %>%
top_n(1, population)
counties %>%
select(state, county, drive:work_at_home) >%>
select(state, county, starts_with("income"))
contains()
starts_with()
ends_with()
last_col()
counties %>%
transmute(state, county, fraction_men = men / population)
geom_point() + expand_limits(y = 0)
library(broom)
tidy(model)
library(tidyr)
library(purrr)
by_year_country %>% nest(-country) %>%
mutate(models = map(data, ~ lm(percent_yes ~ year, .))) %>%
mutate(tidied = map(models, tidy)) %>%
unnest(tidied)
https://stackoverflow.com/questions/22713325/fitting-several-regression-models-with-dplyr
example <- c("apple", "banana", "apple", "orange")
recode(example,
apple = "plum",
banana = "grape")
v <- list(1, 2, 3)
map(v, ~ . * 10)
expand.grid( )
.data %>% is.na() %>% colSums()
dplyr:: pull() gives a vector
df[["Sepal.Length"]]
select() gives df
df["Sepal.Length"]
library(simputation)
nhanes_imp <- impute_lm(nhanes, Height + Weight ~ .)
function(df, formula) {
# Extract name of response variable
imp_var <- as.character(formula[2])
# Save locations where the response is missing
missing_imp_var <- is.na(df[imp_var])
# Fit logistic regression mode
logreg_model <- glm(formula, data = df, family = binomial)
# Predict the response
preds <- predict(logreg_model, type = "response")
# Sample the predictions from binomial distribution
preds <- rbinom(length(preds), size = 1, prob = preds)
# Impute missing values with predictions
df[missing_imp_var, imp_var] <- preds[missing_imp_var]
return(df)
tao_imp <- hotdeck(tao)
# Create boolean masks for where is_hot and humidity are missing
missing_is_hot <- tao_imp$is_hot_imp
missing_humidity <- tao_imp$humidity_imp
for (i in 1:3) {
# Set is_hot to NA in places where it was originally missing and re-impute it
tao_imp$is_hot[missing_is_hot] <- NA
tao_imp <- impute_logreg(tao_imp, is_hot ~ sea_surface_temp)
# Set humidity to NA in places where it was originally missing and re-impute it
tao_imp$humidity[missing_humidity] <- NA
tao_imp <- impute_lm(tao_imp, humidity ~ sea_surface_temp + air_temp)
}
Cluster analysis
# Calculate the Distance
dist_players <- dist(lineup, method = 'euclidean')
# Perform the hierarchical clustering using the complete linkage
hc_players <- hclust(dist_players, method = 'complete')
# Calculate the assignment vector with a k of 2
clusters_k2 <- cutree(hc_players, 2)
# Create a new data frame storing these results
lineup_k2_complete <- mutate(lineup, cluster = clusters_k2)
ggplot2::ggplot(lineup_k2_complete) + ggplot2::geom_point(ggplot2::aes(x,y, color = cluster, size = cluster))
# Prepare the Distance Matrix
dist_players <- dist(lineup)
# Generate hclust for complete, single & average linkage methods
hc_complete <- hclust(dist_players, method = "complete")
hc_single <- hclust(dist_players, method = "single")
hc_average <- hclust(dist_players, method = "average")
# Plot & Label the 3 Dendrograms Side-by-Side
# Hint: To see these Side-by-Side run the 4 lines together as one command
par(mfrow = c(1,3))
plot(hc_complete, main = 'Complete Linkage')
plot(hc_single, main = 'Single Linkage')
plot(hc_average, main = 'Average Linkage')
library(dendextend)
# Create a dendrogram object from the hclust variable
dend_players <- as.dendrogram(hc_players)
# Plot the dendrogram
plot(color_branches(dend_players, h = 20))
library(ggdendro)
Did a test
map(values, ~.x + 5)
map_dbl(vectors, mean); lapply(vectors, mean)
case_when(
day == "Saturday" ~ "Weekend",
day == "Sunday" ~ "Weekend",
TRUE ~ "Weekday" )
microbenchmark::microbenchmark()
Shell script
Wild cards in Shell
? matches a single character, so 201?.txt will match 2017.txt or 2018.txt, but not 2017-01.txt.
[...] matches any one of the characters inside the square brackets, so 201[78].txt matches 2017.txt or 2018.txt, but not 2016.txt.
{...} matches any of the comma-separated patterns inside the curly brackets, so {*.txt, *.csv} matches any file whose name ends with .txt or .csv, but not files whose names end with .pdf.
To create a shell variable, you simply assign a value to a name:
training=seasonal/summer.csv
without any spaces before or after the = sign
for filetype in gif jpg png; do echo $filetype; done
chmod +x script-name-here.sh
./script-name-here.sh
bash script-name-here.sh
$ echo "Welcome To The Geek Stuff" | sed 's/\(\b[A-Z]\)/\(\1\)/g'
(W)elcome (T)o (T)he (G)eek (S)tuff
which bash
The name comes from the Greek word for time, chronos.
Unix has a bewildering variety of text editors