Text Manipulation
Command Line Tools for Manipulating Text Files
Section titled “Command Line Tools for Manipulating Text Files”File manipulation operations — browse through and parse text files, and/or extract data from them.
cat (concatenate)
Section titled “cat (concatenate)”-
Used to read and print files, as well as for simply viewing file contents:
Terminal window cat filename -
The main purpose of
catis often to combine (concatenate) multiple files together. -
The
taccommand (catspelled backwards) prints the lines of a file in reverse order. Each line remains the same, but the order of lines is inverted.Terminal window tac filetac file1 file2 > newfileCommand Usage cat file1 file2Concatenate multiple files and display the output; i.e. the entire content of the first file is followed by that of the second file cat file1 file2 > newfileCombine multiple files and save the output into a new file cat file >> existingfileAppend a file to the end of an existing file cat > fileAny subsequent lines typed will go into the file, until CTRL-D is typed cat >> fileAny subsequent lines are appended to the file, until CTRL-D is typed
Using cat Interactively
Section titled “Using cat Interactively”catcan be used to read from standard input (such as the terminal window) if no files are specified. You can use the>operator to create and add lines into a new file, and the>>operator to append lines (or files) to an existing file.- To create a new file, at the command prompt type
cat > filenameand press the Enter key. - This command creates a new file and waits for the user to edit/enter the text. After you finish typing the required text, press CTRL-D at the beginning of the next line to save and exit the editing.
- Another way to create a file at the terminal is
cat > filename << EOF. A new file is created and you can type the required input. To exit, enter EOF at the beginning of a line.- Note that EOF is case sensitive. One can also use another word, such as STOP.
echo (displays text)
Section titled “echo (displays text)”-
echocan be used to display a string on standard output (i.e. the terminal) or to place in a new file (using the>operator) or append to an already existing file (using the>>operator).Terminal window echo string -
The
-eoption, along with the following switches, is used to enable special character sequences, such as the newline character or horizontal tab:\nrepresents newline\trepresents horizontal tab.
-
echois particularly useful for viewing the values of environment variables (built-in shell variables).- For example,
echo $USERNAMEwill print the name of the user who has logged into the current terminal.
Command Usage echo string > newfileThe specified string is placed in a new file echo string >> existingfileThe specified string is appended to the end of an already existing file echo $variableThe contents of the specified environment variable are displayed - For example,
Working with Large and Compressed Files
Section titled “Working with Large and Compressed Files”Working with Large Files
Section titled “Working with Large Files”-
Directly opening a large file in an editor will cause issues, due to high memory utilization, as an editor will usually try to read the whole file into memory first.
-
Use
lessto view the contents of such a large file, scrolling up and down page by page, without the system having to place the entire file in memory before starting. -
Viewing
somefilecan be done by typing either of the two following commands:Terminal window less somefilecat somefile | less
-
headreads the first few lines of each named file (10 by default) and displays it on standard output. -
Number of lines can be increased or decreased using
-n.Terminal window head –n
-
tailprints the last few lines of each named file and displays it on standard output. By default, it displays the last 10 lines. -
tailis especially useful when you are troubleshooting any issue using log files, as you probably want to see the most recent lines of output. -
Number of lines can be increased or decreased using
-n.Terminal window tail –n -
To continually monitor new output in a growing log file:
Terminal window tail -f somefile.logThis command will continuously display any new lines of output in
somefile.logas soon as they appear. Thus, it enables you to monitor any current activity that is being reported and recorded.
Viewing Compressed Files
Section titled “Viewing Compressed Files”-
When working with compressed files, many standard commands cannot be used directly.
-
For many commonly-used file and text manipulation programs, there is also a version especially designed to work directly with compressed files.
-
These associated utilities have the letter
zprefixed to their name. -
For example, we have utility programs such as
zcat,zless,zdiffandzgrep. -
Note that if you run
zlesson an uncompressed file, it will still work and ignore the decompression stage. There are also equivalent utility programs for other compression methods besidesgzip.Command Description zcat compressed-file.txt.gzTo view a compressed file zless somefile.gzorzmore somefile.gzTo page through a compressed file zgrep -i less somefile.gzTo search inside a compressed file zdiff file1.txt.gz file2.txt.gzTo compare two compressed files -
For example, we have
bzcatandbzlessassociated withbzip2, andxzcatandxzlessassociated withxz.
sed and awk
Section titled “sed and awk”sed (stream editor)
Section titled “sed (stream editor)”-
sedis a powerful text processing tool. It is used to modify the contents of a file or input stream, usually placing the contents into a new file or output stream. -
sedcan filter text, as well as perform substitutions in data streams. -
Data from an input source/file (or stream) is taken and moved to a working space.
-
The entire list of operations/modifications is applied over the data in the working space and the final contents are moved to the standard output space (or stream).
Command Usage sed -e command filenameSpecify editing commands at the command line, operate on file and put the output on standard out (e.g. the terminal) sed -f scriptfile filenameSpecify a scriptfile containing sed commands, operate on file and put output on standard out echo "I hate you" | sed s/hate/love/Use sed to filter standard input, putting output on standard out The
-eoption allows you to specify multiple editing commands simultaneously at the command line. It is unnecessary if you only have one operation invoked.
sed Basic Operations
Section titled “sed Basic Operations”-
patternis the current string andreplace_stringis the new string:Command Usage sed s/pattern/replace_string/ fileSubstitute first string occurrence in every line sed s/pattern/replace_string/g fileSubstitute all string occurrences in every line sed 1,3s/pattern/replace_string/g fileSubstitute all string occurrences in a range of lines sed -i s/pattern/replace_string/g fileSave changes for string substitution in the same file -
Use the
-ioption with care, because the action is not reversible. It is always safer to usesedwithout the–ioption and then replace the file yourself, as shown in the following example:Terminal window sed s/pattern/replace_string/g file1 > file2The above command will replace all occurrences of
patternwithreplace_stringinfile1and move the contents tofile2. The contents offile2can be viewed withcat file2.If you approve, you can then overwrite the original file with
mv file2 file1.
-
awkis used to extract and then print specific contents of a file and is often used to construct reports. -
It was created at Bell Labs in the 1970s and derived its name from the last names of its authors: Alfred Aho, Peter Weinberger, and Brian Kernighan.
-
awkhas the following features:- It is a powerful utility and interpreted programming language.
- It is used to manipulate data files, and for retrieving and processing text.
- It works well with fields (containing a single piece of data, essentially a column) and records (a collection of fields, essentially a line in a file).
Command Usage awk 'command' fileSpecify a command directly at the command line awk -f scriptfile fileSpecify a file that contains the script to be executed
awk Basic Operations
Section titled “awk Basic Operations”-
The input file is read one line at a time, and, for each line,
awkmatches the given pattern in the given order and performs the requested action. -
The
-Foption allows you to specify a particular field separator character. -
For example, the
/etc/passwdfile uses:to separate the fields, so the-F:option is used with the/etc/passwdfile. -
The command/action in
awkneeds to be surrounded with apostrophes (or single-quote (')).Command Usage awk '{ print $0 }' /etc/passwdPrint entire file awk -F: '{ print $1 }' /etc/passwdPrint first field (column) of every line, separated by a space awk -F: '{ print $1 $7 }' /etc/passwdPrint first and seventh field of every line
File Manipulation Utilities
Section titled “File Manipulation Utilities”-
sortis used to rearrange the lines of a text file, in either ascending or descending order according to a sort key. -
The default sort key is the order of the ASCII characters (i.e. essentially alphabetically).
Syntax Usage sort filenameSort the lines in the specified file, according to the characters at the beginning of each line cat file1 file2 | sortCombine the two files, then sort the lines and display the output on the terminal sort -r filenameSort the lines in reverse order sort -k 3 filenameSort the lines by the 3rd field on each line instead of the beginning -
When used with the
-uoption,sortchecks for unique values after sorting the records (lines). It is equivalent to runninguniq(which we shall discuss) on the output of sort.
-
uniqremoves duplicate consecutive lines in a text file and is useful for simplifying the text display. Becauseuniqrequires that the duplicate entries must be consecutive, one often runssortfirst and then pipes the output intouniq; ifsortis used with the-uoption, it can do all this in one step. -
To remove duplicate entries from multiple files at once, use the following command:
Terminal window sort file1 file2 | uniq > file3or
Terminal window sort -u file1 file2 > file3 -
To count the number of duplicate entries, use the following command:
Terminal window uniq -c filename
pasteaccepts the following options:-ddelimiters, which specify a list of delimiters to be used instead of tabs for separating consecutive values on a single line. Each delimiter is used in turn; when the list has been exhausted,pastebegins again at the first delimiter.-s, which causes paste to append the data in series rather than in parallel; that is, in a horizontal rather than vertical fashion.
Using paste
Section titled “Using paste”-
pastecan be used to combine fields from different files, as well as combine lines from multiple files. -
For example, line one from
file1can be combined with line one offile2, line two fromfile1can be combined with line two offile2, and so on. -
To paste contents from two files one can do:
Terminal window paste file1 file2 -
The syntax to use a different delimiter is as follows:
Terminal window paste -d, file1 file2 -
Common delimiters are ‘space’, ‘tab’, ’|’, ‘comma’, etc.
joinis an enhanced version ofpaste. It first checks whether the files share common fields and then joins the lines in two files based on a common field.
Using join
Section titled “Using join”- To combine two files on a common field, at the command prompt type
join file1 file2and press the Enter key.
splitis used to break up (or split) a file into equal-sized segments for easier viewing and manipulation, and is generally used only on relatively large files.- By default,
splitbreaks up a file into 1000-line segments. - The original file remains unchanged, and a set of new files with the same name plus an added prefix is created.
- By default, the
xprefix is added. To split a file into segments, use the commandsplit infile. - To split a file into segments using a different prefix, use the command
split infile Prefix.
Regular Expressions and Search Patterns
Section titled “Regular Expressions and Search Patterns”-
Regular expressions are text strings used for matching a specific pattern, or to search for a specific location, such as the start or end of a line or a word. Regular expressions can contain both normal characters or so-called meta-characters, such as
*and$.Search Pattern Usage .(dot)Match any single character a|zMatch a or z $Match end of a line ^Match beginning of a line *Match preceding item 0 or more times
grep and strings
Section titled “grep and strings”-
grepis extensively used as a primary text searching tool. -
It scans files for specified patterns and can be used with regular expressions, as well as simple strings.
Command Usage grep [pattern] filenameSearch for a pattern in a file and print all matching lines grep -v [pattern] filenamePrint all lines that do not match the pattern grep [0-9] filenamePrint the lines that contain the numbers 0 through 9 grep -C 3 [pattern] filenamePrint context of lines (specified number of lines above and below the pattern) for matching the pattern. Here, the number of lines is specified as 3
strings
Section titled “strings”-
stringsis used to extract all printable character strings found in the file or files given as arguments. It is useful in locating human-readable content embedded in binary files; for text files one can just usegrep. -
For example, to search for the string
my_stringin a spreadsheet:Terminal window strings book1.xls | grep my_string
Miscellaneous Text Utilities
Section titled “Miscellaneous Text Utilities”-
The
trutility is used to translate specified characters into other characters or to delete them. The general syntax is as follows:Terminal window tr [options] set1 [set2] -
The items in the square brackets are optional.
-
trrequires at least one argument and accepts a maximum of two. -
The first, designated
set1, lists the characters in the text to be replaced or removed. -
The second,
set2, lists the characters that are to be substituted for the characters listed in the first argument. -
Sometimes these sets need to be surrounded by apostrophes (or single-quotes (
')) in order to have the shell ignore that they mean something special to the shell. It is usually safe (and may be required) to use the single-quotes around each of the sets.Command Usage tr abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZConvert lower case to upper case tr '{}' '()' < inputfile > outputfileTranslate braces into parenthesis echo "This is for testing" | tr [:space:] '\t'Translate white-space to tabs echo "This is for testing" | tr -s [:space:]Squeeze repetition of characters using -secho "the geek stuff" | tr -d 't'Delete specified characters using -doptionecho "my username is 432234" | tr -cd [:digit:]Complement the sets using -coptiontr -cd [:print:] < file.txtRemove all non-printable character from a file tr -s '\n' ' ' < file.txtJoin all the lines in a file into a single line
-
teetakes the output from any command, and, while sending it to standard output, it also saves it to a file. -
In other words, it tees the output stream from the command: one stream is displayed on the standard output and the other is saved to a file.
-
For example, to list the contents of a directory on the screen and save the output to a file, at the command prompt type:
Terminal window ls -l | tee newfileTyping
cat newfilewill then display the output ofls –l.
-
wc(word count) counts the number of lines, words, and characters in a file or list of files.Option Description –lDisplays the number of lines -cDisplays the number of bytes -wDisplays the number of words
-
cutis used for manipulating column-based files and is designed to extract specific columns. -
The default column separator is the
tabcharacter. A different delimiter can be given as a command option. -
For example, to display the third column delimited by a blank space, at the command prompt type:
Terminal window ls -l | cut -d" " -f3