Ever wanted to find out quickly the number of rows and/or columns in file directly from terminal. There are many ways to skin this cat. Here is what I used for number of rows for quite a while:
What about number of columns? "Easy", just combine head and awk commands:
not a big problem (there are likely better ways to do this), but is long and tedious.
I got sick of typing commands above and assembled them in three easy to use function with R-like names: nrow, ncol, and dim. Functions are simply a collection of above ideas and assume that the file is of "rectangular shape", i.e., a table, a matrix, etc.
dim()
{
for FILE in $@; do
NROW=$(nrow $FILE | awk '{ print $1}')
NCOL=$(ncol $FILE | awk '{ print $1}')
echo "$NROW $NCOL $FILE"
unset NROW NCOL
done
}
export -f dim
nrow ()
{
for FILE in $@; do
wc -l $FILE
done
}
export -f nrow
ncol ()
{
for FILE in $@; do
TMP=$(head $FILE -n 1 | awk '{ print NF }')
echo "$TMP $FILE"
unset TMP
done
}
export -f ncol
Add these files to your .bashrc or .profile or something similar and you can now simply type:
nrow filename
ncol filename
dim filename
A simple test:
touch file.txt
echo "line1 with four columns" >> file.txt
echo "line2 with four columns" >> file.txt
nrow file.txt
2 file.txt
ncol file.txt
4 file.txt
dim file.txt
2 4 file.txt
4 comments:
Hi Gregor,
a shorter and simpler version of :
head -n 1 filename | awk '{ print NF }'
could be :
awk 'END{print NF}' filename
Likewise the number of line could be :
awk 'END{print NR}' filename
But I admit these may be less obvious.
I like your stuff - might be actually more efficient. I knew people will propose better ways to make this even faster and neater, which is one of the reasons I bothered posting this on blog! Thanks!!!
Francois, I did some testing on a file with 550 rows and 100001 columns with 100 repetitions (the code is bellow) and behold:
"My" approach:
- nrow ~6.8sec
- ncol ~1.3sec
"Pure awk" approach:
- nrow ~17.7sec
- ncol ~19.0sec
So "pure awk" approach is way slower. Though few seconds up and down do not really save the day, unless this would be really mission critical. Perhaps there is a way to say to awk in the nrow case to read just first columns and in the ncol case to read only the first line.
time for i in $(seq 1 1 100); do wc -l F2Chip9Genotype.txt > tmp; done
time for i in $(seq 1 1 100); do head -n 1 F2Chip9Genotype.txt | awk '{ print NF }' > tmp; done
time for i in $(seq 1 1 100); do awk 'END{ print NR }' F2Chip9Genotype.txt > tmp; done
time for i in $(seq 1 1 100); do awk 'END{ print NF }' F2Chip9Genotype.txt > tmp; done
awk 'BEGIN{ getline; print NF }'
may be pretty quick too ;)
I compiled my earlier Awk program for this using Awka
http://awka.sourceforge.net
as it reads right through to check for any unequal lines.
My newer beautiful Fortran code gives output like:
file wc beagle_chr21.genotypes_chr21.gz.gprobs.gz
Field counts for "beagle_chr21.genotypes_chr21.gz.gprobs.gz":
L 1 Len 86548 NFields 9801: "l.4977 col.4979 col.4979 col.4979 col.4981 col.498"
Number of lines = 33826
Length of longest line = 86548 chars
Total number of words = 331528626
Maximum words per line = 9801
Constant word count per line? = T
Length of longest word = 10 chars
Post a Comment