Exploratory Data Analysis in R
Mt. Washington, home of some of the world's worst weather: Situated at the intersection of three storm fronts, the summit of Mt. Washington experiences hurricane-force winds and subzero temperatures. Wind gusts of 231 mph were recorded at the summit in 1934, and for many years, that measurement stood as the highest recorded wind speed on Earth (Ben Cosgrove, Time). Yet, Mt. Washington also remains a popular tourist destination and an iconic part of the trail system through the White Mountains of NH.
The Mt. Washington Observatory sits atop the summit of Mt. Washington, and each day, researchers record the weather patterns there (Mt. Washington Observatory). Conveniently, that data is accessible via the NOAA (National Oceanic and Atmospheric Administration), so I decided to take a look.
What follows is an exploration of Mt. Washington's weather patterns since 1948.
I downloaded data from the Mt. Washington Observatory for January 1, 1948 to January 31, 2018. For convenience, I uploaded the data, documentation, and my associated code to my GitHub.
I started by reading in the data and transforming DATE to a date variable. I noticed that in the raw data, 15 dates from January, 1948 were missing. To add those dates back into the dataset, I created a dataframe of all dates between 1/1/1948 and 1/31/2018, then joined it to the original data.
R Code - Get Data
There are several weather type indicators included in the original data, but many of these indicate similar things. For instance, WT01 indicates fog, but may also include heavy fog, which is indicated by WT02. For clarity, I renamed and combined similar indicators.
R Code - Clean Data
Time for some graphs! For these graphs, I decided to limit my observations to data collected after 2014. The dplyr::filter option ensures that R will use the filter function from the dplyr package rather than from stats. I tried out the plotly package, which allows users to create interactive graphs in R.
R Code - Daily Graphs
Daily Minimum Temperatures:
Daily Maximum Temperatures:
One thing that struck me about these temperatures was that there seemed to be considerable seasonal variation, with the summer months experiencing warmer temperatures. However, the daily maximum temperature from 2014 to 2018 never exceeded 70 degrees F. In the winter, minimum temperatures could get as low as -40 degrees F.
Incidentally, -40 degrees F is the same as -40 degrees C.
A look at daily precipitation from 2014-2018:
To take a look at the average temperatures and precipitation at the summit of Mt. Washington by year, I first aggregated the data with the data.table package. This package provides convenient and fast methods for grouping data. For more information, see the package documentation here and a brief tutorial here.
R Code - Yearly Averages
The average yearly maximum temperature has generally stayed in the low 30s, with the average yearly minimum temperature hovering in the low 20s.
In February of 1969, a 100-hour snowstorm swept through the state, dumping over 4 feet of snow on the summit in a single day. Total snowfall during that storm exceed 8 feet (Mike Squires, Climate.gov). This likely contributed to the Mt. Washington's record yearly precipitation of over 130 inches in 1969.
I wanted to take a closer look at peak wind gust and the weather type variables (e.g., fog, thunder, etc.). However, many of these values were missing. So, I took a look at the missing values by finding the proportion of data that was missing each year.
R Code - Proportion of Data Missing Per Year
Peak wind gusts appeared to be regularly recorded between 1980 and 2005, but not during other time periods.
Each weather type variable was an indicator. Hence, if a certain weather type was observed, the indicator was marked as '1,' and if not, the value was left as 'NA.' For instance, if the day was foggy, FOG would be marked 1. Otherwise, FOG would be NA.
This method meant that while we could tell when weather was observed, we could not distinguish between when a weather type variable was missing because
1. That weather type was not observed that day (e.g., there was no fog), or
2. Nothing was recorded for the day (a true NA value).
In particular, the HIGH WIND and THUNDER weather type indicators were missing close to 100% of the time.
We do have a good amount of FOG observations, though. Here's a closer look:
R Code - Fog
This graph suggests that the summit of Mt. Washington experiences fog or mist during at least part of the day during about 75-95% of each year.
The original data from the Mt. Washington Observatory listed 6 different ways to measure the speed of wind. I decided to focus on the peak gust wind speed, where a gust is a sudden increase in wind speed that lasts a short amount of time (see this article for more information). Since most of the data on peak gust wind speeds were collected from 1980-2005, I looked at this time period only.
R Code - Peak Gust Wind Speed
Wind gusts exceeding 80 mph appear to be a regular occurrence on Mt. Washington. No wonder the summit buildings need to be chained down!
All things considered, I consider myself lucky to have climbed Mt. Washington 3 times (twice by foot, once by car). One final image: Sunrise on Mt. Washington.
This January, I spent some time refreshing myself on SQL. I used SQL in graduate classes and an internship - in the context of PROC SQL statements in SAS. Time to learn more!
I've collected some resources and information in this blog post to help those getting started in SQL. Some are resources that I've found useful this month, and others are resources recommended by others. I also touch upon installing a database management system on your own computer at the end of the post.
What is SQL and why is it useful?
SQL (Structured Query Language) is a language used to create, manage, and access data stored in databases called Relational Database Management Systems (RDBMSs). Some of the more popular RDBMSs are MySQL, Oracle, PostgreSQL, and Microsoft SQL Server (MSSQL).
Data analysts and data scientists use SQL to clean, explore, and extract data from datasets that can exceed the capacity of R or other analytical software. SQL can be used in conjunction with other software to analyze data. A recent blog by Bob Muenchen plotted the popularity of different data tools based on the number of job advertisements mentioning each tool. SQL was more commonly mentioned than R, Python, and SAS.
Online tools to learn SQL
The Khan Academy SQL videos cover the basics of the SELECT statement. Short exercises let you practice writing statements using an online SQL interpreter.
SQLZoo guides you through a series of exercises on SQL using an online SQL interpreter. I find that SQLZoo covers more ground and has more examples than Khan Academy.
Mode SQL Tutorial
Though I have not started the Mode SQL Tutorial, I read several articles recommending it. Mode combines written explanations with exercises relating to each concept. Concepts are divided into the beginner, intermediate, or advanced categories.
TutorialsPoint provides a handy SQL reference with concise explanations.
Data World offers an introductory guide to SQL with opportunities to practice SQL queries
Books on SQL
Learning MySQL: Get a Handle on Your Data
I read chapters 1-7 in this book. Unfortuately, the chapters on installing MySQL were outdated. The book provides an overview of RDBMSs and creating databases before diving into SQL in chapters 5-7. I found chapters 5-7 most useful, though the information in those chapters is freely available online.
SQL Practice Problems
This book provides 57 SQL exercises (varying in difficulty) for readers to solve in MSSQL. I am currently 2/3 of the way through the book, and find it useful for practicing SQL queries. It is written for beginner/intermediate users.
Installing a Relational Database Management System
There are a variety of options for installing a RDBMS on your computer to practice SQL outside of an online SQL interpreter.
Here are some options: MSSQL Express, PostgreSQL, and MariaDB (a variation of MySQL).
I have been practicing SQL with MSSQL Express and MariaDB. MSSQL Express has a fairly straightforward installation process (though I ran into some compatibility issues on Windows 10) and comes with a graphical user interface (GUI). There are a few quirky differences between MariaDB and MSSQL syntax, but the main ideas are the same!
If you decide to work with MariaDB, you might consider installing XAMPP. XAMPP allows you to run MariaDB by command line as well as through a GUI called phpMyAdmin. Apache, MariaDB, a PHP interpreter, and a Perl interpreter are included in the default installation of XAMPP.
Starting MariaDB in XAMPP
XAMPP is one of the software suites recommended in Learning MySQL: Get a Handle on Your Data. It took me awhile to figure out how to start MariaDB in XAMPP so that I could practice SQL on my own. Here is how to do it:
(These instructions are up to date as of 1/31/2018.)
1. First, open the XAMPP Control Panel and press Start on the MySQL module.
2. You have a couple options here. You can either run MariaDB (MySQL) via the command line in the MariaDB monitor, or through the phpMyAdmin GUI.
Running MariaDB through the command line
Running MariaDB via phpMyadmin
3. One note:
MariaDB supports different users. The root user has all privileges. phpMyAdmin logs on as the root user by default. You can log into the root account by typing mysql -u root -p on the command line, then pressing enter when pressed for a password (the root user has no password at the start).
For greater security, you can choose to change the password for the root user from the MariaDB monitor by typing
and replacing 'new password' with your desired password. For more information, see here.
If you do this, you will need to change the configuration file for phpMyAdmin to use MariaDB through phpMyAdmin. See here for a stackoverflow answer that addresses that problem.
4. To exit from MariaDB, type quit in the MariaDB monitor and stop active modules in the XAMPP control panel.
Please feel free to comment if you have any additional comments/suggestions!