Gregor Gorjanc (gg): Functions dim, nrow, and ncol for shell

2012-12-28

Functions dim, nrow, and ncol for shell

Ever wanted to find out quickly the number of rows and/or columns in file directly from terminal. There are many ways to skin this cat. Here is what I used for number of rows for quite a while:

wc -l filename

What about number of columns? "Easy", just combine head and awk commands:

head -n 1 filename | awk '{ print NF }'

not a big problem (there are likely better ways to do this), but is long and tedious.

I got sick of typing commands above and assembled them in three easy to use function with R-like names: nrow, ncol, and dim. Functions are simply a collection of above ideas and assume that the file is of "rectangular shape", i.e., a table, a matrix, etc.

dim()
{
for FILE in $@; do
NROW=$(nrow $FILE | awk '{ print $1}')
NCOL=$(ncol $FILE | awk '{ print $1}')
echo "$NROW $NCOL $FILE"
unset NROW NCOL
done
}
export -f dim

nrow ()
{
for FILE in $@; do
wc -l $FILE
done
}
export -f nrow

ncol ()
{
for FILE in $@; do
TMP=$(head $FILE -n 1 | awk '{ print NF }')
echo "$TMP $FILE"
unset TMP
done
}
export -f ncol

Add these files to your .bashrc or .profile or something similar and you can now simply type:

nrow filename

ncol filename

dim filename

A simple test:

touch file.txt

echo "line1 with four columns" >> file.txt

echo "line2 with four columns" >> file.txt

nrow file.txt

2 file.txt

ncol file.txt

4 file.txt

dim file.txt

2 4 file.txt

4 comments:

Anonymous said...: Hi Gregor,

a shorter and simpler version of :

head -n 1 filename | awk '{ print NF }'

could be :

awk 'END{print NF}' filename

Likewise the number of line could be :
awk 'END{print NR}' filename

But I admit these may be less obvious.; 28 December 2012 at 16:57
Gorjanc Gregor said...: I like your stuff - might be actually more efficient. I knew people will propose better ways to make this even faster and neater, which is one of the reasons I bothered posting this on blog! Thanks!!!; 28 December 2012 at 21:37
Gorjanc Gregor said...: Francois, I did some testing on a file with 550 rows and 100001 columns with 100 repetitions (the code is bellow) and behold:

"My" approach:
- nrow ~6.8sec
- ncol ~1.3sec

"Pure awk" approach:
- nrow ~17.7sec
- ncol ~19.0sec

So "pure awk" approach is way slower. Though few seconds up and down do not really save the day, unless this would be really mission critical. Perhaps there is a way to say to awk in the nrow case to read just first columns and in the ncol case to read only the first line.

time for i in $(seq 1 1 100); do wc -l F2Chip9Genotype.txt > tmp; done

time for i in $(seq 1 1 100); do head -n 1 F2Chip9Genotype.txt | awk '{ print NF }' > tmp; done

time for i in $(seq 1 1 100); do awk 'END{ print NR }' F2Chip9Genotype.txt > tmp; done

time for i in $(seq 1 1 100); do awk 'END{ print NF }' F2Chip9Genotype.txt > tmp; done; 29 December 2012 at 06:48
David Duffy said...: awk 'BEGIN{ getline; print NF }'

may be pretty quick too ;)

I compiled my earlier Awk program for this using Awka

http://awka.sourceforge.net

as it reads right through to check for any unequal lines.

My newer beautiful Fortran code gives output like:

file wc beagle_chr21.genotypes_chr21.gz.gprobs.gz

Field counts for "beagle_chr21.genotypes_chr21.gz.gprobs.gz":

L 1 Len 86548 NFields 9801: "l.4977 col.4979 col.4979 col.4979 col.4981 col.498"

Number of lines = 33826
Length of longest line = 86548 chars
Total number of words = 331528626
Maximum words per line = 9801
Constant word count per line? = T
Length of longest word = 10 chars; 11 June 2013 at 09:16