Mt. Washington, home of some of the world's worst weather: Situated at the intersection of three storm fronts, the summit of Mt. Washington experiences hurricane-force winds and subzero temperatures. Wind gusts of 231 mph were recorded at the summit in 1934, and for many years, that measurement stood as the highest recorded wind speed on Earth (Ben Cosgrove, Time). Yet, Mt. Washington also remains a popular tourist destination and an iconic part of the trail system through the White Mountains of NH.

The Mt. Washington Observatory sits atop the summit of Mt. Washington, and each day, researchers record the weather patterns there (Mt. Washington Observatory). Conveniently, that data is accessible via the NOAA (National Oceanic and Atmospheric Administration), so I decided to take a look.

What follows is an exploration of Mt. Washington's weather patterns since 1948.

I downloaded data from the Mt. Washington Observatory for January 1, 1948 to January 31, 2018. For convenience, I uploaded the data, documentation, and my associated code to my GitHub.

R Setup

I started by reading in the data and transforming DATE to a date variable. I noticed that in the raw data, 15 dates from January, 1948 were missing. To add those dates back into the dataset, I created a dataframe of all dates between 1/1/1948 and 1/31/2018, then joined it to the original data.

R Code - Get Data

There are several weather type indicators included in the original data, but many of these indicate similar things. For instance, WT01 indicates fog, but may also include heavy fog, which is indicated by WT02. For clarity, I renamed and combined similar indicators.

R Code - Clean Data

Time for some graphs! For these graphs, I decided to limit my observations to data collected after 2014. The dplyr::filter option ensures that R will use the filter function from the dplyr package rather than from stats. I tried out the plotly package, which allows users to create interactive graphs in R.

R Code - Daily Graphs

Daily Minimum Temperatures:

Daily Maximum Temperatures:

One thing that struck me about these temperatures was that there seemed to be considerable seasonal variation, with the summer months experiencing warmer temperatures. However, the daily maximum temperature from 2014 to 2018 never exceeded 70 degrees F. In the winter, minimum temperatures could get as low as -40 degrees F.

Incidentally, -40 degrees F is the same as -40 degrees C.

A look at daily precipitation from 2014-2018:

Incidentally, -40 degrees F is the same as -40 degrees C.

A look at daily precipitation from 2014-2018:

To take a look at the average temperatures and precipitation at the summit of Mt. Washington by year, I first aggregated the data with the data.table package. This package provides convenient and fast methods for grouping data. For more information, see the package documentation here and a brief tutorial here.

R Code - Yearly Averages

The average yearly maximum temperature has generally stayed in the low 30s, with the average yearly minimum temperature hovering in the low 20s.

In February of 1969, a 100-hour snowstorm swept through the state, dumping over 4 feet of snow on the summit in a single day. Total snowfall during that storm exceed 8 feet (Mike Squires, Climate.gov). This likely contributed to the Mt. Washington's record yearly precipitation of over 130 inches in 1969.

I wanted to take a closer look at peak wind gust and the weather type variables (e.g., fog, thunder, etc.). However, many of these values were missing. So, I took a look at the missing values by finding the proportion of data that was missing each year.

I wanted to take a closer look at peak wind gust and the weather type variables (e.g., fog, thunder, etc.). However, many of these values were missing. So, I took a look at the missing values by finding the proportion of data that was missing each year.

R Code - Proportion of Data Missing Per Year

Peak wind gusts appeared to be regularly recorded between 1980 and 2005, but not during other time periods.

Each weather type variable was an indicator. Hence, if a certain weather type was observed, the indicator was marked as '1,' and if not, the value was left as 'NA.' For instance, if the day was foggy, FOG would be marked 1. Otherwise, FOG would be NA.

This method meant that while we could tell when weather was observed, we could not distinguish between when a weather type variable was missing because

1. That weather type was not observed that day (e.g., there was no fog), or

2. Nothing was recorded for the day (a true NA value).

In particular, the HIGH WIND and THUNDER weather type indicators were missing close to 100% of the time.

We do have a good amount of FOG observations, though. Here's a closer look:

This method meant that while we could tell when weather was observed, we could not distinguish between when a weather type variable was missing because

1. That weather type was not observed that day (e.g., there was no fog), or

2. Nothing was recorded for the day (a true NA value).

In particular, the HIGH WIND and THUNDER weather type indicators were missing close to 100% of the time.

We do have a good amount of FOG observations, though. Here's a closer look:

R Code - Fog

This graph suggests that the summit of Mt. Washington experiences fog or mist during at least part of the day during about 75-95% of each year.

The original data from the Mt. Washington Observatory listed 6 different ways to measure the speed of wind. I decided to focus on the peak gust wind speed, where a gust is a sudden increase in wind speed that lasts a short amount of time (see this article for more information). Since most of the data on peak gust wind speeds were collected from 1980-2005, I looked at this time period only.

R Code - Peak Gust Wind Speed

Wind gusts exceeding 80 mph appear to be a regular occurrence on Mt. Washington. No wonder the summit buildings need to be chained down!

All things considered, I consider myself lucky to have climbed Mt. Washington 3 times (twice by foot, once by car). One final image: Sunrise on Mt. Washington.

This January, I spent some time refreshing myself on SQL. I used SQL in graduate classes and an internship - in the context of PROC SQL statements in SAS. Time to learn more!

I've collected some resources and information in this blog post to help those getting started in SQL. Some are resources that I've found useful this month, and others are resources recommended by others. I also touch upon installing a database management system on your own computer at the end of the post.

SQL (Structured Query Language) is a language used to create, manage, and access data stored in databases called Relational Database Management Systems (RDBMSs). Some of the more popular RDBMSs are MySQL, Oracle, PostgreSQL, and Microsoft SQL Server (MSSQL).

Data analysts and data scientists use SQL to clean, explore, and extract data from datasets that can exceed the capacity of R or other analytical software. SQL can be used in conjunction with other software to analyze data. A recent blog by Bob Muenchen plotted the popularity of different data tools based on the number of job advertisements mentioning each tool. SQL was more commonly mentioned than R, Python, and SAS.

Data analysts and data scientists use SQL to clean, explore, and extract data from datasets that can exceed the capacity of R or other analytical software. SQL can be used in conjunction with other software to analyze data. A recent blog by Bob Muenchen plotted the popularity of different data tools based on the number of job advertisements mentioning each tool. SQL was more commonly mentioned than R, Python, and SAS.

Khan Academy

The Khan Academy SQL videos cover the basics of the SELECT statement. Short exercises let you practice writing statements using an online SQL interpreter.

SQLZoo

SQLZoo guides you through a series of exercises on SQL using an online SQL interpreter. I find that SQLZoo covers more ground and has more examples than Khan Academy.

Mode SQL Tutorial

Though I have not started the Mode SQL Tutorial, I read several articles recommending it. Mode combines written explanations with exercises relating to each concept. Concepts are divided into the beginner, intermediate, or advanced categories.

TutorialsPoint

TutorialsPoint provides a handy SQL reference with concise explanations.

Data World

Data World offers an introductory guide to SQL with opportunities to practice SQL queries

SQL Cheat Sheet from WebsiteSetup

Intro to SQL commands, data types, functions, and views from WebsiteSetup

The Khan Academy SQL videos cover the basics of the SELECT statement. Short exercises let you practice writing statements using an online SQL interpreter.

SQLZoo

SQLZoo guides you through a series of exercises on SQL using an online SQL interpreter. I find that SQLZoo covers more ground and has more examples than Khan Academy.

Mode SQL Tutorial

Though I have not started the Mode SQL Tutorial, I read several articles recommending it. Mode combines written explanations with exercises relating to each concept. Concepts are divided into the beginner, intermediate, or advanced categories.

TutorialsPoint

TutorialsPoint provides a handy SQL reference with concise explanations.

Data World

Data World offers an introductory guide to SQL with opportunities to practice SQL queries

SQL Cheat Sheet from WebsiteSetup

Intro to SQL commands, data types, functions, and views from WebsiteSetup

Learning MySQL: Get a Handle on Your Data

I read chapters 1-7 in this book. Unfortuately, the chapters on installing MySQL were outdated. The book provides an overview of RDBMSs and creating databases before diving into SQL in chapters 5-7. I found chapters 5-7 most useful, though the information in those chapters is freely available online.

SQL Practice Problems

This book provides 57 SQL exercises (varying in difficulty) for readers to solve in MSSQL. I am currently 2/3 of the way through the book, and find it useful for practicing SQL queries. It is written for beginner/intermediate users.

I read chapters 1-7 in this book. Unfortuately, the chapters on installing MySQL were outdated. The book provides an overview of RDBMSs and creating databases before diving into SQL in chapters 5-7. I found chapters 5-7 most useful, though the information in those chapters is freely available online.

SQL Practice Problems

This book provides 57 SQL exercises (varying in difficulty) for readers to solve in MSSQL. I am currently 2/3 of the way through the book, and find it useful for practicing SQL queries. It is written for beginner/intermediate users.

There are a variety of options for installing a RDBMS on your computer to practice SQL outside of an online SQL interpreter.

Here are some options: MSSQL Express, PostgreSQL, and MariaDB (a variation of MySQL).

I have been practicing SQL with MSSQL Express and MariaDB. MSSQL Express has a fairly straightforward installation process (though I ran into some compatibility issues on Windows 10) and comes with a graphical user interface (GUI). There are a few quirky differences between MariaDB and MSSQL syntax, but the main ideas are the same!

If you decide to work with MariaDB, you might consider installing XAMPP. XAMPP allows you to run MariaDB by command line as well as through a GUI called phpMyAdmin. Apache, MariaDB, a PHP interpreter, and a Perl interpreter are included in the default installation of XAMPP.

Here are some options: MSSQL Express, PostgreSQL, and MariaDB (a variation of MySQL).

I have been practicing SQL with MSSQL Express and MariaDB. MSSQL Express has a fairly straightforward installation process (though I ran into some compatibility issues on Windows 10) and comes with a graphical user interface (GUI). There are a few quirky differences between MariaDB and MSSQL syntax, but the main ideas are the same!

If you decide to work with MariaDB, you might consider installing XAMPP. XAMPP allows you to run MariaDB by command line as well as through a GUI called phpMyAdmin. Apache, MariaDB, a PHP interpreter, and a Perl interpreter are included in the default installation of XAMPP.

XAMPP is one of the software suites recommended in *Learning MySQL: Get a Handle on Your Data.* It took me awhile to figure out how to start MariaDB in XAMPP so that I could practice SQL on my own. Here is how to do it:

(These instructions are up to date as of 1/31/2018.)

**1.** First, open the XAMPP Control Panel and press Start on the MySQL module.

(These instructions are up to date as of 1/31/2018.)

****Click on*Shell*to get the command line interface to pop up.- Once at the interface, type
*mysql*. You should get a message that you have reached the MariaDB monitor. - You can now type in SQL commands to create tables or practice select queries.

Running MariaDB via phpMyadmin

- Start Apache from the XAMPP Control Panel.

- Open a browser window. In the address, type
*localhost.* Press enter. This should take you to a XAMPP welcome page. Click on phpMyAdmin.

- You are now running MariaDB through phpMyAdmin. There is a list of existing databases to the left.

MariaDB supports different users. The root user has all privileges. phpMyAdmin logs on as the root user by default. You can log into the root account by typing

For greater security, you can choose to change the password for the root user from the MariaDB monitor by typing

and replacing 'new password' with your desired password. For more information, see here.

If you do this, you will need to change the configuration file for phpMyAdmin to use MariaDB through phpMyAdmin. See here for a stackoverflow answer that addresses that problem.

**4. **To exit from MariaDB, type *quit* in the MariaDB monitor and stop active modules in the XAMPP control panel.

Please feel free to comment if you have any additional comments/suggestions!

]]>If you do this, you will need to change the configuration file for phpMyAdmin to use MariaDB through phpMyAdmin. See here for a stackoverflow answer that addresses that problem.

Please feel free to comment if you have any additional comments/suggestions!

Introduction

Image sources: Monty Hall | Marilyn vos Savant

I came across the Monty Hall problem while revisiting my old statistics notes recently. The problem asks,

"Suppose you're on a game show, and you're given the choice of three doors. Behind one door is a car. Behind the others, goats.

You pick a door, say No. 1, and the host, who knows what's behind the doors, opens another door, say No. 3, which has a goat.

He then says to you, 'Do you want to pick door No. 2?'

Is it to your advantage to switch your choice?"

- Monty Hall Problem on Wikipedia:

https://en.wikipedia.org/wiki/Monty_Hall_problem

These assumptions are made:

"(1) The host must always open a door that was not picked by the contestant.

(2) The host must always open a door to reveal a goat and never the car. [When the host chooses between two doors with a goat, he picks between them randomly.]

(3) The host must always offer the chance to switch between the originally chosen door and the remaining closed door."

- Monty Hall Problem on Wikipedia:

https://en.wikipedia.org/wiki/Monty_Hall_problem

The Monty Hall problem became famous after Marilyn vos Savant published her answer that switching doors is the best strategy. She was right, to the surprise of many contemporary mathematicians. Now, this problem is a classic puzzle taught in statistics courses. The first time I saw the Monty Hall problem was in my AP Statistics class, and I remember how counterintuitive the solution seemed to me.

I thought I'd have some fun by simulating the Monty Hall problem in R and writing up a mathematical explanation for the solution.

I coded up this function to simulate playing the Monty Hall game

If you want to play around with this function, copy and paste it into R.

R function for Monty Hall problem

Note that this function uses the graphical package ggplot2. If you have not installed that package in R, use the following code before copying and pasting the above:

Code to install ggplot2 and dplyr in R

Here is an example of the Monty Hall simulation function in action.

Using the Monty Hall function to simulate playing the Monty Hall game 100,000 times

The simulation results? When you always

result relativefrequency

0 loss 0.66625

1 win 0.33375

When you always

result relativefrequency

0 loss 0.33192

1 win 0.66808

The simulation shows that

Let's first explore an intuitive explanation of the Monty Hall problem. Suppose you decide to always switch doors. You pick a door randomly.

- If a car is behind your first door, Monty will open one of the doors with a goat. Since you decided to always switch doors, you'll switch to the door with a goat - and lose.
- If a goat is behind your first door, Monty will not open the door with the car. Instead, Monty will open the remaining door that contains a goat. Since you decided to switch doors, you'll switch to the door with a car - and win.

Now for a mathematical explanation using conditional probabilities. Suppose you start out by picking door 1 and Monty reveals door 3, as stated in the original problem.

Define the following events: \[C_{1} = car \ behind \ door \ 1 \] \[C_{2} = car \ behind \ door \ 2\] \[C_{3}= car \ behind \ door \ 3\] \[M_{3}= Monty \ picks \ door \ 3.\]

If you do not switch doors after Monty reveals door 3, the only way you can win is if the car is behind your original door: door 1. We'll calculate \[P(C_{1} | M_{3})\], the probability that the car is behind door 1 given that Monty reveals door 3.

By the definition of conditional probability,

\[P(C_{1} | M_{3}) = \frac{P(C_{1} \cap M_{3})}{P(M_{3})} \]

Note that \[P(C_{1} \cap M_{3}) = P(M_{3} | C_{1})P(C_{1})\] from the definition of conditional probability.

Also, since \[ C_{1}, C_{2}, C_{3} \] is a partition of the sample space, the law of total probability tells us that \[P(M_{3}) = P(M_{3} | C_{1})P(C_{1}) + P(M_{3} | C_{2})P(C_{2}) + P(M_{3} | C_{3})P(C_{3})\].

Substituting this into our equation, we get

\[P(C_{1} | M_{3}) = \frac{P(M_{3} | C_{1})P(C_{1})}{P(M_{3} | C_{1})P(C_{1}) + P(M_{3} | C_{2})P(C_{2}) + P(M_{3} | C_{3})P(C_{3})} \]

\[= \frac{ (\frac{1}{2})(\frac{1}{3}) }{ (\frac{1}{2})(\frac{1}{3}) + (1)(\frac{1}{3}) + (0)(\frac{1}{3}) } \]

\[= \frac{1}{3} \]

If you do switch doors after Monty reveals door 3, you will end up with whatever is behind door 2. You win if the car is behind door 2. We'll calculate \[P(C_{2} | M_{3})\], the probability that the car is behind door 2 given that Monty reveals door 3. The calculations follow the same pattern as in the first case.

\[P(C_{2} | M_{3}) = \frac{P(C_{2} \cap M_{3})}{P(M_{3})} = \frac{P(M_{3} | C_{2})P(C_{2})}{P(M_{3} | C_{1})P(C_{1}) + P(M_{3} | C_{2})P(C_{2}) + P(M_{3} | C_{3})P(C_{3})} \]

\[ = \frac{ (1)(\frac{1}{3}) }{ (\frac{1}{2})(\frac{1}{3}) + (1)(\frac{1}{3}) + (0)(\frac{1}{3}) } \]

\[ = \frac{2}{3} \]

The mathematical explanation and simulation results match. The probability of winning game if you switch doors is 2/3, while the probability of winning if you stay with your original choice is 1/3. Hence, **switching is the better strategy.**