2012-12-28

Functions dim, nrow, and ncol for shell

Ever wanted to find out quickly the number of rows and/or columns in file directly from terminal. There are many ways to skin this cat. Here is what I used for number of rows for quite a while:

wc -l filename

What about number of columns? "Easy", just combine head and awk commands:

head -n 1 filename | awk '{ print NF }'

not a big problem (there are likely better ways to do this), but is long and tedious.

I got sick of typing commands above and assembled them in three easy to use function with R-like names: nrow, ncol, and dim. Functions are simply a collection of above ideas and assume that the file is of "rectangular shape", i.e., a table, a matrix, etc.


dim()
{
  for FILE in $@; do
    NROW=$(nrow $FILE | awk '{ print $1}')
    NCOL=$(ncol $FILE | awk '{ print $1}')
    echo "$NROW $NCOL $FILE"
    unset NROW NCOL
  done
}
export -f dim

nrow ()
{
  for FILE in $@; do
    wc -l $FILE
  done
}
export -f nrow

ncol ()
{
  for FILE in $@; do
    TMP=$(head $FILE -n 1 | awk '{ print NF }')
    echo "$TMP $FILE"
    unset TMP
  done
}
export -f ncol

Add these files to your .bashrc or .profile or something similar and you can now simply type:

nrow filename
ncol filename
dim filename

A simple test:

touch file.txt
echo "line1 with four columns" >> file.txt
echo "line2 with four columns" >> file.txt

nrow file.txt
2 file.txt

ncol file.txt
4 file.txt

dim file.txt
2 4 file.txt

4 comments:

genomeek said...

Hi Gregor,

a shorter and simpler version of :

head -n 1 filename | awk '{ print NF }'

could be :

awk 'END{print NF}' filename

Likewise the number of line could be :
awk 'END{print NR}' filename

But I admit these may be less obvious.

Gregor Gorjanc said...

I like your stuff - might be actually more efficient. I knew people will propose better ways to make this even faster and neater, which is one of the reasons I bothered posting this on blog! Thanks!!!

Gregor Gorjanc said...

Francois, I did some testing on a file with 550 rows and 100001 columns with 100 repetitions (the code is bellow) and behold:

"My" approach:
- nrow ~6.8sec
- ncol ~1.3sec

"Pure awk" approach:
- nrow ~17.7sec
- ncol ~19.0sec

So "pure awk" approach is way slower. Though few seconds up and down do not really save the day, unless this would be really mission critical. Perhaps there is a way to say to awk in the nrow case to read just first columns and in the ncol case to read only the first line.

time for i in $(seq 1 1 100); do wc -l F2Chip9Genotype.txt > tmp; done

time for i in $(seq 1 1 100); do head -n 1 F2Chip9Genotype.txt | awk '{ print NF }' > tmp; done

time for i in $(seq 1 1 100); do awk 'END{ print NR }' F2Chip9Genotype.txt > tmp; done

time for i in $(seq 1 1 100); do awk 'END{ print NF }' F2Chip9Genotype.txt > tmp; done

David Duffy said...

awk 'BEGIN{ getline; print NF }'

may be pretty quick too ;)

I compiled my earlier Awk program for this using Awka

http://awka.sourceforge.net

as it reads right through to check for any unequal lines.

My newer beautiful Fortran code gives output like:

file wc beagle_chr21.genotypes_chr21.gz.gprobs.gz

Field counts for "beagle_chr21.genotypes_chr21.gz.gprobs.gz":

L 1 Len 86548 NFields 9801: "l.4977 col.4979 col.4979 col.4979 col.4981 col.498"

Number of lines = 33826
Length of longest line = 86548 chars
Total number of words = 331528626
Maximum words per line = 9801
Constant word count per line? = T
Length of longest word = 10 chars