Big Data Geek - Finding and Loading NOAA Hourly Cl...

Former Member · ‎11-04-2013

So I loved the idea of the Data Geek challenge, but I'm more of Big Data Geek. And it turns out that the National Oceanic and Atmospheric Administration (NOAA) have publicly available hourly climate data, since 1901.

It's available at ftp://ftp.ncdc.noaa.gov/pub/data/noaa and you can download all of it freely. I've been kind to my IT folks in my office network so I've been slowly downloading it over the last few weeks, and I'm up to 2003 and I've got 39GB of compressed data so far. It is GZIP compressed and expands about 10:1... and in addition, the data volumes grow massively since 1973 and keep growing to the present day. I expect 600GB-1TB of raw data before we're done.

So we're going to need SAP HANA and Lumira to find some interesting information, and I thought I'd take you on the journey of how to do this. This is Part 1 of however many it takes me to find some meaning in the data.

Update: Part 2 is now available here! In part 2 we look at Tammy Powlas' question "is it getting warmer in Virginia?"

Downloading the data

I've gone about this using the following command:

wget --mirror ftp://ftp.ncdc.noaa.gov/pub/data/noaa

There are other ways you could use to get the data much faster, including aria2, pcurl or httrack but I wanted to be kind on my IT team and this doesn't use too much bandwidth or mess up our office network. It will take me a few weeks to get all the data and then I can keep it up to date any time!

Loading Data

The data comes in a pretty incomprehensible format and the lines look a bit like this:

0081999999999992002010100004+17900-075900FM-13+9999ELML V02099999999999999999N9999999N1+02791+02791101361REMSYN072BBXX ELML7 0100/ 99179 70759 4//// ///// 10279 20279 40136 8//// 222//;

You can download a PDF document of how this is all formatted here:

ftp://ftp.ncdc.noaa.gov/pub/data/noaa/ish-format-document.pdf

It turns out that it is a complex fixed format file, so I'm just going to use the mandatory fields, which are the first 34 fields. This gives us data like location, timestamp, temperature, wind, visibility. Pretty comprehensive. I'm a big fan of UNIX scripts to reformat this stuff so I wrote an awk script to reformat the fixed format files into CSV.

awk 'BEGIN { FIELDWIDTHS = "4 6 5 4 2 2 2 2 1 6 7 5 5 5 4 3 1 1 4 1 5 1 1 1 6 1 1 1 5 1 5 1 5 1" } {

VAR_CHARS=$1

STATION_USAF=$2

STATION_NCDC=$3

YEAR=$4

MONTH=$5

DAY=$6

HOUR=$7

MIN=$8

SOURCE_FLAG=$9

LATITUDE=$10

LONGITUDE=$11

OBSERVATION_TYPE=$12

ELEVATION=$13

CALL_LETTER=$14

QUALITY_CONTROL=$15

WIND_DIR=$16

WIND_DIR_QUAL=$17

WIND_TYPE=$18

WIND_SPEED=$19

WIND_SPEED_QUAL=$20

SKY_CEILING=$21

SKY_CEILING_QUAL=$22

SKY_CEILING_METH=$23

SKY_CEILING_CAVOK=$24

VISIBILITY=$25

VISIBILITY_QUAL=$26

VISIBILITY_VAR=$27

VISIBILITY_VAR_QUAL=$28

AIR_TEMP=$29

AIR_TEMP_QUAL=$30

DEW_TEMP=$31

DEW_TEMP_QUAL=$32

PRESSURE=$33

PRESSURE_QUAL=$34

printf ("%s,%s,%s,%s-%s-%s %s:%s:00,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s\n", VAR_CHARS, STATION_USAF, STATION_NCDC, YEAR, MONTH, DAY, HOUR, MIN, SOURCE_FLAG, LATITUDE, LONGITUDE, OBSERVATION_TYPE, ELEVATION, CALL_LETTER, QUALITY_CONTROL, WIND_DIR, WIND_DIR_QUAL, WIND_TYPE, WIND_SPEED, WIND_SPEED_QUAL, SKY_CEILING, SKY_CEILING_QUAL, SKY_CEILING_METH, SKY_CEILING_CAVOK, VISIBILITY, VISIBILITY_QUAL, VISIBILITY_VAR, VISIBILITY_VAR_QUAL, AIR_TEMP, AIR_TEMP_QUAL, DEW_TEMP, DEW_TEMP_QUAL, PRESSURE, PRESSURE_QUAL)

}'

Now I can put all 12653 files that constitute 2002's data into one big CSV file!

for a in `find 2002/*2002`; do ./test.sh < $a >> 2002.csv; done

Now I have a pretty human readable format - we can make some sense of this!

0173,010010,99999,2002-01-01 00:00:00,4,+70930,-008660,FM-12,+0009,ENJA ,V020,320,1,N,0100,1,22000,1,9,N,070000,1,N,1,-0039,1,-0087,1,10032

Whilst that script ran (it takes a few minutes per year), I'm went ahead and created the HANA DDL for this table, as well as for the reference data that exist within the PDF, for things like Quality, Source Flags, Observation Types, Wind Types, Sky Ceiling Methods and Visibility. This will make thinks like FM-12 human readable like "SYNOP Report of surface observation form a fixed land station". I've attached the DDL as a SQL script so you can run it on your own system.

A few minutes later, we have 63,653,095 readings for 2002 to pull into SAP HANA. We dispatch that in 2 minutes flat - got to love HANA's bulk loader performance. Now all that remains is to build an Analytic View to combine our fact table and reference data into a readable model for Lumira. The ISH reference data also contains location information for the sensors, though it is very rough, by Country or US State.

Visualization

And now we can have some fun with Lumira! First, we can connect up to our Analytic View in HANA and see our attributes and measures. I've created a bunch of calculated measures so we can do averages within Lumira and push the calculations back into HANA. This makes it super fast.

But first, let's start with Geographic Enrichment. We can add Country, and State:

Very quickly, we can do a Chloropleth Chart showing Average Wind Speed by Country in 2002

Let's switch to Air Temperature:

And drill into the United States:

And now we're going to filter by the month of August:

Next Steps

I need to go ahead and load the rest of the data from 1901, which will take a while. Then, we can go ahead and do some time-based analysis.

What are the questions that you'd like to be able to answer?