Extracting lines ending with specific character using sed or grep

A quick note (to myself mostly) about how to extract lines from a text file that end with a specific set of characters. In Linux, you can very easily do this using ‘grep’ or ‘sed’. But, first a little bit of background.

The online tool ‘Reverb’ from the NASA allows to select and downoad MODIS data tiles. It doesn’t let you download the data directly, but generates a text file which contains the download links for the selected items. You can then download the tiles with a data transfer tools like wget:

wget -i text_file_with_url.txt

The text file does not only contain links to the actual data files, but also to preview images and xml files. What if you don’t want to download those? In the text file, each download link is provided on a new line. If you are only interested in the data files, which are of the type “hdf”, you’ll want to select all lines ending with “.hdf”, and delete the other lines. The easiest solution is using grep:

grep '.hdf$' input.txt  > output.txt

As an alternative, you can use the “sed” command, with the use of ‘sed -n /pattern/p‘ to duplicate the function of grep.

sed -ne '/.hdf$/p' input.txt  > output.txt

Check out this link for a clear explanation of the different options and syntax of the “sed” function.

But… I wasn’t there yet. The Reverb tool only allows to generate download links for 2000 tiles (or granules as they call it) per query. So if you are trying to download more, you’ll need to submit several queries covering shorter periods of time or smaller areas. For example, the MOD14A1 data (daily MODIS/Terra Thermal Anomalies/Fire occurrences) is available for the period 2002-2012.

I split up my queries, each query covering a two-year period. This means that for the whole period, I got 6 text files with download links. These I need to combine first. Next, I want to remove all links except the ones for the actual data files as explained above. Also, I noticed that for some reason I haven’t figures out yet, the download text files contained duplicated entries, so I had to remove the duplicates.

That turns out to be fairly easy too. All it takes is the four lines below, including a line to remove the intermediate files.

cat *.txt > tmp1
grep '.hdf$' tmp1  > tmp2
cat tmp2 | sort -u > download_hdf.csv
rm tmp1 tmp2

I am sure you can combine these commands into fewer lines, but I can’t care too much about the number of lines. What I do care about is that this takes about 5 seconds to type and a second to run. Try that using a spreadsheet.

About these ads

About pvanb

I am a tropical forest ecologist with a focus on spatial and temporal patterns and processes at population and ecosystem level. I am furthermore very interested in issues related to conservation and sustainable use of biodiversity and natural resources under current and future climates. I have worked in the Middle East (Syria and Lebanon) and South America (Brazil) and in Eastern Africa (Kenya).
This entry was posted in Data handling, tools and tagged , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s