We are currently meeting as a reading group to work through the examples of Haddock and Dunn's Practical computing. At the moment we just completed Chapter 3 and this post will walk through the section called "Putting it all together" (pg. 38). This is a walkthrough of how to manipulate this example via the command line with a simple shell script. It uses as input the example file "Ch3observations.txt".
To start, the easiest way to manipulate the file, is to change the field separators (Delimiter) to a common symbol. In this case, I chose a comma "," as it is part of *.csv file. CSV files are easily imported into excel, R and other programs used for data analysis. This whole process takes 7 lines in a simple shell script.
1) The first step is to remove the final column from the tab delimited input file using awk and storing it as a temporary file that will be removed after the script runs, which I have called "test.remove".
awk -F '\t' 'BEGIN{OFS= "," ;} {print $1, $2, $3}' ./Ch3observations.txt > test.remove
awk calls the program awk and the field separator is defined in the input file as a tab (-F '\t'). Likewise, the output field separator is defined as a comma BEGIN{OFS= "," ;}. Then the program prints the 1rst-3rd columns to the temp file {print $1, $2, $3}'. Awk defines columns by $#. For example, $1 stands for the first column. $5 = the fifth column.
2) The second part replaces the word "at" and the ":" between the hour and min with a comma using the find and replace command "sed"
sed 's/at/,/' ./test.remove | sed 's/:/,/' | sed 's/^[[:space:]]//' > test2.remove
the Pipe symbol "|" allows you to use the output from one command as the input for another. The final part of the command removes all spaces from the beginning of the line ^[[:space:]].
3) There is one last field separator that needs to be added. "13 January, 1752" requires a comma between the 13 and the January so that each can be put in a separate column. This is also done with sed.
sed 's/\([[:digit:]]*\)\([[:space:]]\)/\1, /' ./test2.remove > ./test3.remove
To do this, sed is searching for a digit ([[:digit:]]) repeated 0 or more (*) times followed by a space \([[:space:]]\). This is replaced by the original term (\1) and a comma. The original digits of the day (13) are denoted by putting them in commas.
4) Next the months are abbreviated to 3 letters. This is done again by capturing and recalling variables in sed.
sed 's/\([[:lower:]]\)\([[:lower:]]\)\([[:lower:]]\)\([[:lower:]]*\)/\1\2./' ./test3.remove > ./test4.remove
[[:lower:]] stands for lower case letters in Posix coding.
5) A header row is then written in the final output file "Observations.csv" using the command echo
echo "Year, Mon., Day, Hour, Minute, X data, Y data" > Observations.csv
6) This next line adds the data to the file and changes the orders of columns from day, month, year to year, month, day using awk.
awk -F ',' 'BEGIN{OFS= "," ;} {print $3, $2, $1, $4, $5, $6, $7 }' ./test4.remove >> Observations.csv
Notice that ./test4.remove > Observations.csv would have replaced the header file that we just created. To append the file we need to use ./test4.remove >> Observations.csv.
7) Finally, to remove the temp files, the command rm -f *.remove will remove any file that ends in *.remove. Thus, it is useful for always using the same file extension that is NOT a common extension. i.e. don't use *.txt or all the text files in that folder will also be deleted. Please find my commented script attached and I hope it helps.
To start, the easiest way to manipulate the file, is to change the field separators (Delimiter) to a common symbol. In this case, I chose a comma "," as it is part of *.csv file. CSV files are easily imported into excel, R and other programs used for data analysis. This whole process takes 7 lines in a simple shell script.
1) The first step is to remove the final column from the tab delimited input file using awk and storing it as a temporary file that will be removed after the script runs, which I have called "test.remove".
awk -F '\t' 'BEGIN{OFS= "," ;} {print $1, $2, $3}' ./Ch3observations.txt > test.remove
awk calls the program awk and the field separator is defined in the input file as a tab (-F '\t'). Likewise, the output field separator is defined as a comma BEGIN{OFS= "," ;}. Then the program prints the 1rst-3rd columns to the temp file {print $1, $2, $3}'. Awk defines columns by $#. For example, $1 stands for the first column. $5 = the fifth column.
2) The second part replaces the word "at" and the ":" between the hour and min with a comma using the find and replace command "sed"
sed 's/at/,/' ./test.remove | sed 's/:/,/' | sed 's/^[[:space:]]//' > test2.remove
the Pipe symbol "|" allows you to use the output from one command as the input for another. The final part of the command removes all spaces from the beginning of the line ^[[:space:]].
3) There is one last field separator that needs to be added. "13 January, 1752" requires a comma between the 13 and the January so that each can be put in a separate column. This is also done with sed.
sed 's/\([[:digit:]]*\)\([[:space:]]\)/\1, /' ./test2.remove > ./test3.remove
To do this, sed is searching for a digit ([[:digit:]]) repeated 0 or more (*) times followed by a space \([[:space:]]\). This is replaced by the original term (\1) and a comma. The original digits of the day (13) are denoted by putting them in commas.
4) Next the months are abbreviated to 3 letters. This is done again by capturing and recalling variables in sed.
sed 's/\([[:lower:]]\)\([[:lower:]]\)\([[:lower:]]\)\([[:lower:]]*\)/\1\2./' ./test3.remove > ./test4.remove
[[:lower:]] stands for lower case letters in Posix coding.
5) A header row is then written in the final output file "Observations.csv" using the command echo
echo "Year, Mon., Day, Hour, Minute, X data, Y data" > Observations.csv
6) This next line adds the data to the file and changes the orders of columns from day, month, year to year, month, day using awk.
awk -F ',' 'BEGIN{OFS= "," ;} {print $3, $2, $1, $4, $5, $6, $7 }' ./test4.remove >> Observations.csv
Notice that ./test4.remove > Observations.csv would have replaced the header file that we just created. To append the file we need to use ./test4.remove >> Observations.csv.
7) Finally, to remove the temp files, the command rm -f *.remove will remove any file that ends in *.remove. Thus, it is useful for always using the same file extension that is NOT a common extension. i.e. don't use *.txt or all the text files in that folder will also be deleted. Please find my commented script attached and I hope it helps.

observation_ex.sh |