TDM 20100: Project 5 — 2023

Motivation: awk is a utility designed for text processing. While Python and R definitely have their place in the data science world, awk is a handy way to process data with just one line of analysis.

Context: awk is a powerful tool that can be used to perform a variety of the tasks for which we previously used other UNIX utilities. After this project, we will continue to utilize all of the utilities, and bash scripts, to perform tasks in a repeatable manner, in pipelines of tools.

Scope: awk, UNIX utilities

Learning Objectives
  • Use awk to process and manipulate textual data.

  • Use piping and redirection within the terminal to pipe the output data from one tool to become the input data for the next tool.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/stackoverflow/unprocessed/2011.csv

  • /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt

While the UNIX tools we’ve used up to this point are very useful, awk enables many new capabilities, and can even replace major functionality of other tools.

awk is a text-processing tool in Unix. It scans and processes text based on patterns. awk is a versatile tool, ideally used for tasks in data that is organized in columns. It provides, pattern matching, field-based calculations, and file formatting tasks, all performed efficiently from the command line.

Here is an example to use awk to find the number of people in each salary range in the data set /anvil/projects/tdm/data/stackoverflow/processed/2011.csv

cat /anvil/projects/tdm/data/stackoverflow/processed/2011.csv | awk -F";" '{print $16}' | sort | uniq -c | sort -n

The cat command prints the entire file,

but instead of outputing the entire file, we send it to the awk command.

In the awk command, we use the semicolor as a separator, and we print the 16th field, which contains the salary information.

Then we sort this data, so that all entries that are the same are next to each other.

Then we find how many values of each type occur.

Finally, we sort the responses according to how many times that they occur.

To make this example more interesting, we can simply add the 14th field as well, and then we are classifying responses according the salary range and according to the person’s favorite operating system.

cat /anvil/projects/tdm/data/stackoverflow/processed/2011.csv | awk -F";" '{print $16, $14}' | sort | uniq -c | sort -n

Here is another example:

The prices of the purchases for this file are in the 19th field:

cat /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt | awk -F";" '{print $19}' | head

We can add all of the prices as follows:

cat /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt | awk -F";" '{myprices += $19} END{print myprices}'

There are 283 million dollars of sales altogether!

We can find the amount of sales of BOURBON like this:

cat /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt | grep "BOURBON" | awk -F";" '{myprices += $19} END{print myprices}'

or like this:

cat /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt | awk -F";" '{if ($0 ~ /BOURBON/) {myprices += $19}} END{print myprices}'

Either way, bourbon accounts for 24 million dollars of the sales.

Champagne sales, on the other hand, are only 10206 dollars together:

cat /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt | awk -F";" '{if ($0 ~ /CHAMPAGNE/) {myprices += $19}} END{print myprices}'

or equivalently:

cat /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt | grep "CHAMPAGNE" | awk -F";" '{myprices += $19} END{print myprices}'

Questions

Question 1 (1 pt)

  1. What is the total cost of purchases with WHISKIES in the title?

  2. What is the total cost of all purchases from CEDAR RAPIDS (not just WHISKIES; consider all purchases)

Question 2 (2 pts)

  1. What Store Name had the largest number of purchases (not the largest total cost, but the largest number of purchases; please consider each line to be 1 purchase)

  2. Using the Store Name identified Question 2A, what was the total cost of all purchases from this Store Name?

Question 3 (2 pt)

  1. Please compute the total volume (in liters) of all purchases sold in the file iowa_liquor_sales_cleaner.txt

  2. Please compute the total volume (in liters) of VODKA 80 PROOF sold in the file iowa_liquor_sales_cleaner.txt

Question 4 (2 pts)

  1. When looking at which location has the largest numbers of purchases, if we use the address (instead of the store name), we should include the Address, City, and Zip Code. Using these three variables (together), what location has the largest number of purchases?

  2. Does your answer to Question 4A agree with your answer to Question 2A? How do you know? (Please explain why, and/or use some analysis to justify your answer.)

Question 5 (1 pt)

  1. awk is powerful, and this liquor dataset is pretty interesting! We haven’t covered everything awk (and we won’t). Look at the dataset and ask yourself an interesting question about the data. Use awk to solve your problem (or, at least, get you closer to answering the question). Optionally: You can explore various stackoverflow questions about awk and awk guides online. Try to incorporate an awk function you haven’t used, or a awk trick you haven’t seen. While this last part is not required, it is highly encouraged and can be a fun way to learn something new.

Please be sure to put a brief explanation about your work in Question 5 using awk to study something interesting that YOU FOUND in the data in Question 5.

You do not need to limit yourself to just use awk, but try to do as much using just awk as you are able.

Project 05 Assignment Checklist

  • Jupyter Lab notebook with your code and comments for the assignment

    • firstname-lastname-project05.ipynb.

  • A .sh text file with all of your bash code and comments written inside of it

    • bash code and comments used to solve questions 1 through 5

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.