Character data represents textual content.
The data type character
is intended to represent textual data such as actual texts, names of objects, and other contnet that is intended to help both you and the audience you are trying to reach better understand your data.
name <- "Dyer"
sport <- "Frolf"
The two variables above have a sequence of characters enclosed by a double quote. You can use a single quote instead, however the enclosing quoting characters must be the same (e.g., you cannot start with a single quote and end with a double).
The length of a string is a measure of how many varibles there are, not the number of characters within it. For example, the length of dyer
is
length(name)
[1] 1
because it only has one character but the number of characters within it is:
nchar(name)
[1] 4
Length is defined specifically on the number of elements in a vector, and technically the variable dyer
is a vector of length one. If we concatinate them into a vector (go see the vector content)
phrase <- c( name, sport )
we find that it has a length of 2
length(phrase)
[1] 2
And if we ask the vector how many characters are in the elements it contains, it gives us a vector of numeric types representing the number of letters in each of the elements.
nchar(phrase)
[1] 4 5
The binary +
operator has not been defined for objects of class character
, which is understandable once we consider all the different ways we may want to put the values contained in the variables together. If you try it, R
will complain.
name + sport
Error in name + sport: non-numeric argument to binary operator
The paste()
function is designed to take a collection of character
variables and smush them togethers. By default, it inserts a space between each of the variables and/or values passed to it.
paste( name, "plays", sport )
[1] "Dyer plays Frolf"
Dyer plays Frolf
Although, you can have any kind of separator you like:
paste(name, sport, sep=" is no good at ")
[1] "Dyer is no good at Frolf"
Dyer is no good at Frolf
The elements you pass to paste()
do not need to be held in variables, you can put quoted character
values in there as well.
paste( name, " the ", sport, "er", sep="")
[1] "Dyer the Frolfer"
Dyer the Frolfer
If you have a vector of character
types, by default, it considers the pasting operation to be applied to every element of the vector.
paste( phrase , "!")
[1] "Dyer !" "Frolf !"
Dyer !
Frolf !
However if you intention is to take the elements of the vector and paste them together, then you need to specify that using the collapse
optional argument. By default, it is set to NULL
, and that state tells the function to apply the paste()-ing to each element. However, if you set collapse
to something other than NULL
, it will use that to take all the elements and put them into a single response.
paste( phrase, collapse = " is not good at ")
[1] "Dyer is not good at Frolf"
Dyer is not good at Frolf
Many times, we need to extract components from within a longer character
element. Here is a longer bit of text as an example.
corpus <- "An environmental impact statement (EIS), under United States environmental law, is a document required by the 1969 National Environmental Policy Act (NEPA) for certain actions 'significantly affecting the quality of the human environment'.[1] An EIS is a tool for decision making. It describes the positive and negative environmental effects of a proposed action, and it usually also lists one or more alternative actions that may be chosen instead of the action described in the EIS. Several U.S. state governments require that a document similar to an EIS be submitted to the state for certain actions. For example, in California, an Environmental Impact Report (EIR) must be submitted to the state for certain actions, as described in the California Environmental Quality Act (CEQA). One of the primary authors of the act is Lynton K. Caldwell."
We can split the original string into several components by specifying which particular character or set of characters we wish to use to break it apart. Here is an example using the space character to pull it apart into words.
str_split( corpus, pattern=" ", simplify=TRUE)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "An" "environmental" "impact" "statement" "(EIS)," "under" "United"
[,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] "States" "environmental" "law," "is" "a" "document" "required" "by"
[,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23]
[1,] "the" "1969" "National" "Environmental" "Policy" "Act" "(NEPA)" "for"
[,24] [,25] [,26] [,27] [,28] [,29] [,30]
[1,] "certain" "actions" "'significantly" "affecting" "the" "quality" "of"
[,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38] [,39]
[1,] "the" "human" "environment'.[1]" "An" "EIS" "is" "a" "tool" "for"
[,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47]
[1,] "decision" "making." "It" "describes" "the" "positive" "and" "negative"
[,48] [,49] [,50] [,51] [,52] [,53] [,54] [,55]
[1,] "environmental" "effects" "of" "a" "proposed" "action," "and" "it"
[,56] [,57] [,58] [,59] [,60] [,61] [,62] [,63] [,64]
[1,] "usually" "also" "lists" "one" "or" "more" "alternative" "actions" "that"
[,65] [,66] [,67] [,68] [,69] [,70] [,71] [,72] [,73]
[1,] "may" "be" "chosen" "instead" "of" "the" "action" "described" "in"
[,74] [,75] [,76] [,77] [,78] [,79] [,80] [,81] [,82]
[1,] "the" "EIS." "Several" "U.S." "state" "governments" "require" "that" "a"
[,83] [,84] [,85] [,86] [,87] [,88] [,89] [,90] [,91]
[1,] "document" "similar" "to" "an" "EIS" "be" "submitted" "to" "the"
[,92] [,93] [,94] [,95] [,96] [,97] [,98] [,99]
[1,] "state" "for" "certain" "actions." "For" "example," "in" "California,"
[,100] [,101] [,102] [,103] [,104] [,105] [,106] [,107]
[1,] "an" "Environmental" "Impact" "Report" "(EIR)" "must" "be" "submitted"
[,108] [,109] [,110] [,111] [,112] [,113] [,114] [,115]
[1,] "to" "the" "state" "for" "certain" "actions," "as" "described"
[,116] [,117] [,118] [,119] [,120] [,121] [,122]
[1,] "in" "the" "California" "Environmental" "Quality" "Act" "(CEQA)."
[,123] [,124] [,125] [,126] [,127] [,128] [,129] [,130] [,131]
[1,] "One" "of" "the" "primary" "authors" "of" "the" "act" "is"
[,132] [,133] [,134]
[1,] "Lynton" "K." "Caldwell."
An
environmental
impact
statement
(EIS),
under
United
States
environmental
law,
is
a
document
required
by
the
1969
National
Environmental
Policy
Act
(NEPA)
for
certain
actions
'significantly
affecting
the
quality
of
the
human
environment'.[1]
An
EIS
is
a
tool
for
decision
making.
It
describes
the
positive
and
negative
environmental
effects
of
a
proposed
action,
and
it
usually
also
lists
one
or
more
alternative
actions
that
may
be
chosen
instead
of
the
action
described
in
the
EIS.
Several
U.S.
state
governments
require
that
a
document
similar
to
an
EIS
be
submitted
to
the
state
for
certain
actions.
For
example,
in
California,
an
Environmental
Impact
Report
(EIR)
must
be
submitted
to
the
state
for
certain
actions,
as
described
in
the
California
Environmental
Quality
Act
(CEQA).
One
of
the
primary
authors
of
the
act
is
Lynton
K.
Caldwell.
which shows 134 words in the text.
simplify=TRUE
option to str_split
. Had I not done that, it would have returned a list object that contained the individual vector of words. There are various reasons that it returns a list, none of which I can frankly understand, that is just the way the authors of the function made it.
There are two different things you may want to do with substrings; find them and replace them. Here are some ways to figure out where they are.
str_detect(corpus, "Environment")
[1] TRUE
str_count( corpus, "Environment")
[1] 3
str_locate_all( corpus, "Environment")
[[1]]
start end
[1,] 125 135
[2,] 637 647
[3,] 754 764
We can also replace instances of one substring with another.
str_replace_all(corpus, "California", "Virginia")
[1] "An environmental impact statement (EIS), under United States environmental law, is a document required by the 1969 National Environmental Policy Act (NEPA) for certain actions 'significantly affecting the quality of the human environment'.[1] An EIS is a tool for decision making. It describes the positive and negative environmental effects of a proposed action, and it usually also lists one or more alternative actions that may be chosen instead of the action described in the EIS. Several U.S. state governments require that a document similar to an EIS be submitted to the state for certain actions. For example, in Virginia, an Environmental Impact Report (EIR) must be submitted to the state for certain actions, as described in the Virginia Environmental Quality Act (CEQA). One of the primary authors of the act is Lynton K. Caldwell."
An environmental impact statement (EIS), under United States environmental law, is a document required by the 1969 National Environmental Policy Act (NEPA) for certain actions 'significantly affecting the quality of the human environment'.[1] An EIS is a tool for decision making. It describes the positive and negative environmental effects of a proposed action, and it usually also lists one or more alternative actions that may be chosen instead of the action described in the EIS. Several U.S. state governments require that a document similar to an EIS be submitted to the state for certain actions. For example, in Virginia, an Environmental Impact Report (EIR) must be submitted to the state for certain actions, as described in the Virginia Environmental Quality Act (CEQA). One of the primary authors of the act is Lynton K. Caldwell.
There is a lot more fun stuff to do with string based data.
–