Notes

Command Line Cheat Sheet for Data Wrangling

This document lists a series of examples of managing data on the command line. All commands are run as shown on macOS 26 (Tahoe), but should work just as well on Linux.

These notes are from a presentation at the Computational Humanities (CMP 25) programme on the 20th of August 2025. They are deliberately terse and intented as an inspiration of how to get started with using the command line for exploring and validating files.

When techniques like the ones shown in these notes are used in a data wrangling pipeline, they should be contained in scripts to facilitate reproducibility. In other words, what is shown here is meant as inspiration—not as examples of how to create actual data cleaning pipelines.

Get started and create some synthetic test data

create a folder to use as sandbox. Folders in a UNIX system are typically called directories.


exploring-files $ mkdir exploring-files
	

change focus to the new folder


exploring-files $ cd exploring-files
	

Maybe open the folder in the Finder. This is a macOS specific command, but Windows have a corresponding command that can be found on the WWW.


exploring-files $ open .
	

List the files in this folder. This ls command has a -l after its name. We call this an option. All the commands shown in this cheat sheet has very many options, which can all be explored by using the manual with e.g., this command: man ls for ls.

Let's list all files in this folder


exploring-files $ ls -l
	      
total 0
	  

off course, we have no files, as we just created this folder ;-).

Now, adjust the prompt. The prompt is the letters and numbers appearing in front of your cursor, waiting for your input. I don't know, how the prompt looks at your specific system, but you can adjust is like this


exploring-files $ export PS1=" %1d :> "
	

Go up one level in the folder hierarchy. cd means change directory and .. means up.


exploring-files :> cd ..
	

Go back to where we came from. - means: the previous folder. The command line is built for effective use, and has a lot of shortcuts.


exploring-files :> cd -
	

I've created a prompt to make a Large Language Model (LLM) produce some synthetic data. Which LLM you use, does not matter. In this cheat sheet, I'll use a local model managed by the LM Studio application. I use the very small model called qwen3-4b-2507, that seems sufficient for the purpose of this cheat sheet.


exploring-files :> cat ../create-data.prompt

create a pseudo data set in CSV format 100 rows with the following five columns: <id> <name> <age> <gender> <occupation> <voting party>, where

<id>: a running unique id
<name>: a made up danish sounding <given name> <surname>
<age>: a integer taken from a distribution normal distribution
<gender>: M (for male), F (for female) or "-" for undefined
<birth country>: This is simulation a Danish survey, and this variable represents people born in countries other than Denmark. Use counties from all over the world and remember the global south
<education>: type of education, e.g., elementary school, apprenticeship, college, university
<occupation>: an occupation related to education with outliers. E.g., have some college people working as unskilled work
<religion>: which religion does the person adhere to. Include all world religions and also a few "ateist" and "agnostic"
<voting party>: <far right>, <right>, <center>, <left>, <far left> with a normal distribution

output ONLY the data and no preliminary or ending notes!
	

LM Studio provides a command line integration. The | character is called a pipe and sends the output of the left command into the right command. I.e., the output of cat is send into the lms. The > character sends the output of the left command into a file named using the right word. So, this command line outputs the file create-data.prompt from the upper folder into the LLM and takes the output of the LLM and stores it in a file called survey.csv.


exploring-files :> cat ../create-data.prompt | lms chat > survey.csv
	

Explore the file

Let's see what we got


exploring-files :> ls -lh

total 16
-rw-r--r--  1 au15929  staff   7.0K Aug 21 11:39 survey.csv
	

How much data did we get?


exploring-files :> wc survey.csv

101     277    7128 survey.csv
	

Just count the number of lines (-l)


exploring-files :> wc -l survey.csv

101 survey.csv
	

List the first five lines


exploring-files :> head -5 survey.csv

id,name,age,gender,birth_country,education,occupation,religion,voting_party
1,Line Hansen,34,M,Denmark,university,teacher,Christianity,center
2,Anette Møller,29,F,Norway,apprenticeship,barista,Protestant,right
3,Anders Jørgensen,41,M,Sweden,college,librarian,agnostic,center
4,Ingrid Larsen,38,F,Denmark,elementary school,caretaker,Christianity,far left
	

List the last five lines


exploring-files :> tail -5 survey.csv

96,Thomas Søndergaard,41,M,Denmark,university,academic,Protestant,center
97,Line Nygaard,39,F,India,high school,barista,Protestant,center
98,Jonas Højgaard,53,M,Denmark,college,teacher,Protestant,center
99,Amalie Fenger,23,F,China,elementary school,weaver,Protestant,right
100,Preben Fjord,46,M,Denmark,university,professor,Protestant,center
	

What kind of file did we get? The file command is used to identify file types and can also detect the character encoding used for a given file.


exploring-files :> file --mime-type --mime-encoding survey.csv

survey.csv: text/csv; charset=utf-8
	

The file extension is of no significance to these commands:


exploring-files :> cp survey.csv unknown-file
exploring-files :> file --mime-type --mime-encoding unknown-file

unknown-file: text/csv; charset=utf-8
	

Let's extract values from the second column


exploring-files :> cat survey.csv | cut -f 2 | head -5

id,name,age,gender,birth_country,education,occupation,religion,voting_party
1,Line Hansen,34,M,Denmark,university,teacher,Christianity,center
2,Anette Møller,29,F,Norway,apprenticeship,barista,Protestant,right
3,Anders Jørgensen,41,M,Sweden,college,librarian,agnostic,center
4,Ingrid Larsen,38,F,Denmark,elementary school,caretaker,Christianity,far left
	

That did not work! It gave us all fields. That's because cut by default operates in data, where columns are seperated by the tabulator character. This behaviour can be changed by the -d option like:


exploring-files :> cat survey.csv | cut -d, -f 2 | head -5

name
Line Hansen
Anette Møller
Anders Jørgensen
Ingrid Larsen
	

We can now easily sort the data. Alas, we have a lot of name reuse, which is an artefact of using a very small (but fast) LLM.


exploring-files :> cat survey.csv | cut -d, -f 2|sort | head -10

Amalie Andersen
Amalie Fenger
Amalie Fenger
Amalie Fenger
Amalie Fenger
Amalie Højgaard
Amalie Højgaard
Amalie Krog
Amalie Søndergaard
Amalie Søndergaard
	

Let's look at unique names


exploring-files :> cat survey.csv | cut -d, -f 2|sort|uniq | tail -5

Thomas Kjær
Thomas Madsen
Thomas Nygaard
Thomas Søndergaard
Thomas Sørensen
	

Let's count the number of different names


exploring-files :> cat survey.csv | cut -d, -f 2|sort | uniq | wc -l

75
	

How many do we have with the name 'Madsen'?


exploring-files :> grep Madsen survey.csv

11,Stig Madsen,49,M,Canada,apprenticeship,plumber,Christianity,right
19,Thomas Madsen,42,M,Denmark,elementary school,unskilled laborer,agnostic,right
35,Thomas Madsen,56,M,Denmark,apprenticeship,plumber,agnostic,right
62,Elisabeth Madsen,40,F,Indonesia,high school,weaver,Christianity,center
79,Anders Madsen,36,M,Denmark,high school,unskilled laborer,Protestant,center
93,Elisabeth Madsen,37,F,Chile,high school,weaver,Protestant,center
	

Count them


exploring-files :> grep Madsen survey.csv

6
	

Now, let's look at the distribution of occupations based on the F and M words.


exploring-files :> grep ',M,' survey.csv | head -5
	

First make two new files, one for the unique occupations for the M category and one for the F category.


exploring-files :> grep ',M,' survey.csv | cut -d, -f7|sort|uniq>m
exploring-files :> grep ',F,' survey.csv | cut -d, -f7|sort|uniq>f
	

Then compare these two lists. < means the value is only in the f file and > in the m file, and yes, the small Qwen LLM do have an occupation bias based only on the letters F and M!


exploring-files :> diff m f

1,6c1,7
< barista
< barmaid
< caretaker
< doctor
< domestic worker
< nurse
---
> academic
> agriculturist
> carpenter
> construction worker
> engineer
> librarian
> plumber
8,9d8
< street vendor
< student
11c10,11
< weaver
---
> university lecturer
> unskilled laborer
	

Other data formats

As an experiment, let us try to use the LLM for transforming the data from CSV.


exploring-files :> head -5 survey.csv|lms chat -p "convert to json" > survey.json
exploring-files :> head -5 survey.csv|lms chat -p "convert to xml" > survey.xml
	

Did it produce any files?


exploring-files :> ls -hl

total 64
-rw-r--r--  1 au15929  staff   102B Aug 21 11:56 f
-rw-r--r--  1 au15929  staff   136B Aug 21 11:56 m
-rw-r--r--  1 au15929  staff   7.0K Aug 21 11:39 survey.csv
-rw-r--r--  1 au15929  staff   1.0K Aug 21 12:06 survey.json
-rw-r--r--  1 au15929  staff   1.3K Aug 21 12:06 survey.xml
-rw-r--r--  1 au15929  staff   7.0K Aug 21 11:45 unknown-file
	

Yes, we got a new JSON and a new XML file, and they are identified correctly


exploring-files :> file --mime-type survey.*

survey.csv:  text/csv; charset=utf-8
survey.json: application/json; charset=utf-8
survey.xml:  text/xml; charset=utf-8
	

What about an Excel sheet?


exploring-files :> head -5 survey.csv|lms chat -p "convert to xlsx"

I can't directly create or send files like `.xlsx` in this chat interface. However, I can provide you with **Python code** that will convert your data into an Excel file (`.xlsx`) using the `pandas` and `openpyxl` libraries.

### ✅ Here's the Python script to convert your data to an XLSX file:

```python
import pandas as pd
...
	

No, it did not work. Instead, we got a Python program for this transformation. I'll leave it up to the reader to try if it works...

Encodings

I've created a version of the survey.csv file in the Latin 1 encoding. The sed command is a so-called stream editor, i.e., it takes some input, edits this using a very terse edit language and outputs the results. we will see more to this command later, but here, I just use it to only print line 4 to 7


exploring-files :> sed -n '4,7p' survey-latin1

3,Sofia Nielsen,28,F,India,University,Doctor,Muslim,right
4,Oscar Larsen,52,M,Sweden,Apprenticeship,Construction Worker,Christian,far right
5,Astrid M¯ller,31,F,Italy,University,Engineer,Christian,left
6,Mikkel Hansen,41,M,Denmark,Elementary School,Unskilled Work,Atheist,far left
	

Line five has this name Astrid M¯ller, where it should have been Astrid Møller, but as the file has an encoding different from that of the terminal, it prints wrong.

We can get file to detect the actual encoding of the file


exploring-files :> file --mime-encoding survey-latin1

survey-latin1: iso-8859-1
	

If we want to use this file, the best thing would be to transform it into the more modern encoding called UTF-8. To that end, we have the iconv command. As of today, it can handle 873 different encodings!


exploring-files :> iconv -l | wc -w
873
	

Now, transform from Latin 1 to UTF-8


exploring-files :> iconv -f latin1 -t utf8 survey-latin1 > survey-utf8
exploring-files :> file --mime-encoding survey-utf8

survey-utf8: utf-8
	

Now, what about a file with a mix of encodings?

	    
exploring-files :> file --mime-encoding survey-bad-encoding-latin1

survey-bad-encoding-latin1: unknown-8bit
	

In such cases, only manual inspection can help you. When you have identified the problems, it is very much possible to make commands and put the into scripts to automate the corrections of many files.

Structural errors

What if we have a CSV file with structural errors? Look closely at line numbered 2 in this file


exploring-files :> head -5 survey-field-error

id,name,age,gender,birth country,education,occupation,religion,voting party
1,Emma Hansen,34,F,Germany,University,Software Engineer,Christian,left
2,Lars Jensen,45,Denmark,College,Teacher,Buddhist
3,Sofia Nielsen,28,F,India,University,Doctor,Muslim,right
4,Oscar Larsen,52,M,Sweden,Apprenticeship,Construction Worker,Christian,far right
	

Lars Jensen have no F or M letter and therefore one less field than the rest. We can still extract the 9th field of all rows with cut without running into system errors. We will just get an empty value, but we cannot deduce which field is missing.


exploring-files :> cut -d, -f9 survey-field-error|head -5
	    
voting party
left

right
far right
	

If we then try to ingest the data file into a system, that transforms the CSV data into an internal representation, it will hopefully fail. Such systems could be Python Pandas, R Tidyverse, SPSS, or, like in this example, the Wolfram Language


wl 'Import["survey-field-error","CSV"]//ToTabular'
	    
...cannot be converted to a Tabular object since is it not a list of rows that are all associations or lists of the same length.
	

Broken data

Okay, so here is another issue, that I have encountered in the wild. All characters in a file has a number representation. Simply said, a could be 1, b could b 2, and so on. This number is then represented as a binary number and can therefore be stored in computer memory or on a storage system. Besides the characters that we are able to see, we also need codes for e.g., space, newline, tabulator and so on. One of these unseeable codes are 0. So, what happens, if we have the 0 code within a string of characters?

For illustration, I have created such a file, where I inserted the 0 (or null) character after the E in Emma. If we then look at the file using head, it looks fine.


exploring-files :> head -5 survey-null-char.csv
	    
id,name,age,gender,birth country,education,occupation,religion,voting party
1,Emma Hansen,34,F,Germany,University,Software Engineer,Christian,left
2,Lars Jensen,45,M,Denmark,College,Teacher,Buddhist,center
3,Sofia Nielsen,28,F,India,University,Doctor,Muslim,right
4,Oscar Larsen,52,M,Sweden,Apprenticeship,Construction Worker,Christian,far right
	

One could also open it in a spreadsheet program, and might not see anything wrong.

But then, if we try to count the length of the names in the second column, we see something different


exploring-files :> awk -F, '{print $2, length($2)}' survey-null-char.csv | head -5

name 4
E 1
Lars Jensen 11
Sofia Nielsen 13
Oscar Larsen 12
	

Emma's name is now just an E! That is because in a lot of systems, the null character is the signal for end-of-string. To awk the string of characters representing Emma's name ends after the initial E. So, invisible stuff in your data, can have serious consequences.

If we really want to see how the file looks, we can use the hexdump command, which displays the codes for all characters in a file. In the right column, We can see a dot after the E in Emma. If we then find the same position in the middle column, we read 00 and have identified a problem.


exploring-files :> hexdump -C survey-null-char.csv|sed -n '1,8p'

00000000  69 64 2c 6e 61 6d 65 2c  61 67 65 2c 67 65 6e 64  |id,name,age,gend|
00000010  65 72 2c 62 69 72 74 68  20 63 6f 75 6e 74 72 79  |er,birth country|
00000020  2c 65 64 75 63 61 74 69  6f 6e 2c 6f 63 63 75 70  |,education,occup|
00000030  61 74 69 6f 6e 2c 72 65  6c 69 67 69 6f 6e 2c 76  |ation,religion,v|
00000040  6f 74 69 6e 67 20 70 61  72 74 79 0a 31 2c 45 00  |oting party.1,E.|
00000050  6d 6d 61 20 48 61 6e 73  65 6e 2c 33 34 2c 46 2c  |mma Hansen,34,F,|
00000060  47 65 72 6d 61 6e 79 2c  55 6e 69 76 65 72 73 69  |Germany,Universi|
00000070  74 79 2c 53 6f 66 74 77  61 72 65 20 45 6e 67 69  |ty,Software Engi|
	

Converting from English numbers to Danish numbers

I have used lms to create a small synthetic data set of some animals and their weights


exploring-files :> cat animals

id,animal,weight
1,"Lion",190.50
2,"Elephant",5400.25
3,"Giraffe",1350.75
4,"Monkey",6.25
5,"Whale",12000.00
6,"Bear",320.10
7,"Zebra",180.45
8,"Wolf",38.90
9,"Rabbit",2.30
10,"Hippo",1400.00
	

Now, the numbers are using the English decimal character, the dot, but let us assume, that we need to use a program made for the Danish decimal character, the comma. OK, we cannot just convert all dots to commas, as the comma is the field separator in the data file. Alas, we need first to convert all the commas to semicolons, and then all the dots to commas. Let us do that using the sed command mentioned above, but this time use it for actual editing


exploring-files :> sed 's/,/;/g;s/\./,/g' animals

id;animal;weight
1;"Lion";190,50
2;"Elephant";5400,25
3;"Giraffe";1350,75
4;"Monkey";6,25
5;"Whale";12000,00
6;"Bear";320,10
7;"Zebra";180,45
8;"Wolf";38,90
9;"Rabbit";2,30
10;"Hippo";1400,00
	

Other relevant commands

man, history, ssh, python, Rscript, alias