TDM 20100: Project 6 — 2022

Motivation: awk is a programming language designed for text processing. It can be a quick and efficient way to quickly parse through and process textual data. While Python and R definitely have their place in the data science world, it can be extremely satisfying to perform an operation extremely quickly using something like awk.

Context: This is the second of three projects where we introduce awk. awk is a powerful tool that can be used to perform a variety of the tasks that we’ve previously used other UNIX utilities for. After this project, we will continue to utilize all of the utilities, and bash scripts, to perform tasks in a repeatable manner.

Scope: awk, UNIX utilities

Learning Objectives
  • Use awk to process and manipulate textual data.

  • Use piping and redirection within the terminal to pass around data between utilities.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/craigslist/vehicles_clean.txt

  • /anvil/projects/tdm/data/donorschoose/Donations.csv

  • /anvil/projects/tdm/data/whin/weather.csv

Questions

Question 1

Use awk to determine how many columns and rows are in the following dataset: /anvil/projects/tdm/data/craigslist/vehicles_clean.txt.

Make sure the output is formatted as follows.

output
rows: 12345
columns: 12345
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

What are the possible "conditions" of the vehicles being sold: /anvil/projects/tdm/data/craigslist/vehicles_clean.txt? Use awk to answer this question. How many cars of each condition are in the dataset? Make sure to format the output as follows.

output
Condition Number of cars
--------- --------------
AAA       12345
bb        99999
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Use awk to determine the years (for example, 2020, 2021, etc) of the donations in the dataset: /anvil/projects/tdm/data/donorschoose/Donations.csv?

The substr function in awk will be useful.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Use awk to determine the total donations (in dollars) by year: /anvil/projects/tdm/data/donorschoose/Donations.csv?

Use printf and the unix column utility to format the output as follows.

output
Year Donations in dollars
2020 $1234.56
2021 $9999.99
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Use awk to determine the average temperature_high by month: /anvil/projects/tdm/data/whin/weather.csv. Make sure the output is sorted by month (you can use sort for that). If you are feeling adventurous, try and use awk to output a horizontal bar plot using just ascii and awk. This last part is not required, but could be fun.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.