Partial string matching in R -
this question has answer here:
i trying remove 'bad' email addresses csv. have column of emails "abd@no.com," "123@none.com," "@," or "a". there wide range of email formats want try find , remove them all.
my inital idea strictly @ end of email string - "@..." part. @ length of character, if email of length 1 or 2 not valid.
if have list of bad emails, want generate new list of emails bad ones replaced na.
below code have far not work , looks exact matches on pattern, not end of string.
email_clean <- function(email, invalid = na) { email <- trimws(email) # remove whitespace email[nchar(email) %in% c(1,2)] <- invalid bad_email <- c("\\@no.com", "\\@none.com","\\@email.com","\\@noemail.com") pattern = paste0("(?i)\\b",paste0(bad_email,collapse="\\b|\\b"),"\\b") emails <-gsub(pattern,"",sapply(csv_file$email,as.character)) email } cleaned_email <- email_clean(csv_file$email)
thank help!!!
your function pretty close. note few tweaks:
email_clean <- function(email, invalid = na) { email <- trimws(email) # remove whitespace email[nchar(email) %in% c(1,2)] <- invalid bad_email <- c("\\@no.com", "\\@none.com","\\@email.com","\\@noemail.com") pattern = paste0("(?i)\\b",paste0(bad_email,collapse="\\b|\\b"),"\\b") email <-gsub(pattern, invalid, sapply(email,as.character)) unname(email) } emails <- c("pierre@gmail.com", "hi@none.com", "@", "a") email_clean(emails) # [1] "pierre@gmail.com" na na # [4] na
Comments
Post a Comment