Data Set Citations

List of data sets used in this online book

The Intuition Behind Correlation

Automobile MPG Data Set: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Download curated data set

Example of a nonlinear relationship (Image by Author)
Example of a nonlinear relationship (Image by Author)

Average monthly maximum temperatures recorded in Boston, Massachusetts. Data is taken from National Centers for Environmental Information. Download curated data set

Monthly average maximum temperature of Boston, MA from Jan 1998 to Jun 2019. Weather data source: National Centers for Environmental Information (Image by Author)
Monthly average maximum temperature of Boston, MA from Jan 1998 to Jun 2019. Weather data source: National Centers for Environmental Information (Image by Author)

Understanding Partial Auto-correlation And The PACF

Southern Oscillation Index (SOI) data is downloaded from United States National Weather Service’s Climate Prediction Center’s Weather Indices page. Download curated data set.

Southern Oscillations. Data source: NOAA
Southern Oscillations. Data source: NOAA (Image by Author)

Average monthly maximum temperatures recorded in Boston, Massachusetts. Data is taken from National Centers for Environmental Information. Download curated data set

Monthly average maximum temperature of Boston, MA from Jan 1998 to Jun 2019. Weather data source: National Centers for Environmental Information (Image by Author)
Monthly average maximum temperature of Boston, MA from Jan 1998 to Jun 2019. Weather data source: National Centers for Environmental Information (Image by Author)

How To Adjust For Inflation In Monetary Data Sets

Wages and salaries by Occupation: Total wage and salary earners (series id: CXU900000LB1203M). U.S. Bureau of Labor Statistics

Wages and salaries by Occupation: Total wage and salary earners (series id: CXU900000LB1203M). U.S. Bureau of Labor Statistics
Wages and salaries by Occupation: Total wage and salary earners (series id: CXU900000LB1203M). U.S. Bureau of Labor Statistics (Image by Author)

How To Isolate Trend, Seasonality And Noise From A Time Series

U.S. Census Bureau, Retail Sales: Used Car Dealers [MRTSSM44112USN], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/MRTSSM44112USN, June 17, 2020, under FRED copyright terms. Download curated data set.

Retail Used Car Sales. Data source: US FRED
Retail Used Car Sales. Data source: US FRED (Image by Author)

U.S. Census Bureau, E-Commerce Retail Sales [ECOMNSA], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/ECOMN, under FRED copyright terms.

Retail eCommerce sales. Data source: US FRED
Retail eCommerce sales. Data source: US FRED (Image by Author)

SILSO, World Data Center — Sunspot Number and Long-term Solar Observations, Royal Observatory of Belgium, on-line Sunspot Number catalogue: http://www.sidc.be/SILSO/, 1818–2020 (CC-BY-NA)

Daily sunspot count. Data source: SILSO
Daily sunspot count. Data source: SILSO (Image by Author)

Samuel H. Williamson, “Daily Closing Values of the DJA in the United States, 1885 to Present,” MeasuringWorth, 2020 URL: http://www.measuringworth.com/DJA/

Dow Jones % change in closing price from previous year (1880–2020). Data source: MeasuringWorth.com via Wikipedia)
Dow Jones % change in closing price from previous year (1880–2020). Data source: MeasuringWorth.com via Wikipedia) (Image by Author)

The White Noise Model

Restaurant decibel levels data is copyright Sachin Date under CC-BY-NC-SA. Download curated data set

Time series of decibel levels at a restaurant
Time series of decibel levels at a restaurant (Image by Author)

The Assumptions Of Linear Regression, And How To Test Them

Combined Cycle Power Plant Data Set: downloaded from UCI Machine Learning Repository is used under the following citation requests:

  • Pınar Tüfekci, Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods, International Journal of Electrical Power & Energy Systems, Volume 60, September 2014, Pages 126–140, ISSN 0142–0615, [Web Link],
    ([Web Link])
  • Heysem Kaya, Pınar Tüfekci , Sadık Fikret Gürgen: Local and Global Learning Methods for Predicting Power of a Combined Gas & Steam Turbine, Proceedings of the International Conference on Emerging Trends in Computer and Electronics Engineering ICETCEE 2012, pp. 13–18 (Mar. 2012, Dubai
Scatter plots of Power_Output against each explanatory variable
Scatter plots of Power_Output against each explanatory variable (Image by Author)

Introduction to Heteroskedasticity

U.S. Census Bureau, Retail Sales: Beer, Wine, and Liquor Stores [MRTSSM4453USN], retrieved from FRED, Federal Reserve Bank of St. Louis; USN, June 19, 2021.


U.S. Bureau of Labor Statistics, Export Price Index (End Use): Non-monetary Gold [IQ12260], retrieved from FRED, Federal Reserve Bank of St. Louis;, June 19, 2021. Download curated data set

Export price index of gold plotted for 132 consecutive monthly time periods from Jan 2001 to Dec 2011
Export price index of gold plotted for 132 consecutive monthly time periods from Jan 2001 to Dec 2011 (Image by Author)

Building Robust Linear Models For Nonlinear, Heteroscedastic Data

U.S. Bureau of Labor Statistics, Export Price Index (End Use): Non-monetary Gold [IQ12260], retrieved from FRED, Federal Reserve Bank of St. Louis;, June 19, 2021. Curated version for download

Export price index of gold plotted for 132 consecutive monthly time periods from Jan 2001 to Dec 2011
Export price index of gold plotted for 132 consecutive monthly time periods from Jan 2001 to Dec 2011 (Image by Author)

The Poisson Regression Model

Bicycle Counts for East River Bridges. Daily total of bike counts conducted monthly on the Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge. From NYC Open Data under Terms of Use. Curated data set for download.

Source: Bicycle Counts for East River Bridges (Data source: NYC OpenData)
Source: Bicycle Counts for East River Bridges (Data source: NYC OpenData) (Image by Author)

The Negative Binomial Regression Model

Bicycle Counts for East River Bridges. Daily total of bike counts conducted monthly on the Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge. From NYC Open Data under Terms of Use. Curated data set for download.

Source: Bicycle Counts for East River Bridges (Data source: NYC OpenData)
Source: Bicycle Counts for East River Bridges (Data source: NYC OpenData) (Image by Author)

The Generalized Poisson Regression Model

Bicycle Counts for East River Bridges. Daily total of bike counts conducted monthly on the Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge. From NYC Open Data under Terms of Use. Curated data set for download.

Source: Bicycle Counts for East River Bridges (Data source: NYC OpenData)
Source: Bicycle Counts for East River Bridges (Data source: NYC OpenData) (Image by Author)

Fitting Linear Regression Models on Count Based Data Sets

Bicycle Counts for East River Bridges. Daily total of bike counts conducted monthly on the Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge. From NYC Open Data under Terms of Use. Curated data set for download.

Source: Bicycle Counts for East River Bridges (Data source: NYC OpenData)
Source: Bicycle Counts for East River Bridges (Data source: NYC OpenData) (Image by Author)

Introduction to Regression With ARIMA Errors Model

Data set of Air Quality measurements is from UCI Machine Learning repository and available for research purposes. Curated data set download link

Paper citation for original data set: S. De Vito, E. Massera, M. Piga, L. Martinotto, G. Di Francia, On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario, Sensors and Actuators B: Chemical, Volume 129, Issue 2, 22 February 2008, Pages 750–757, ISSN 0925–4005, [Web Link]. ([Web Link])

Top 10 rows of the Air Quality data set
Top 10 rows of the Air Quality data set (Image by Author)

Holt-Winters Exponential Smoothing

U.S. Census Bureau, Retail Sales: Used Car Dealers [MRTSSM44112USN], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/MRTSSM44112USN, June 17, 2020, under FRED copyright terms. Download link to curated data set.

Retail Used Car Sales. Data source: US FRED
Retail Used Car Sales. Data source: US FRED (Image by Author)

Merck & Co., Inc. (MRK), NYSE — Historical Adjusted Closing Price. Currency in USD, https://finance.yahoo.com/quote/MRK/history?p=MRK, 23-Jul-2020. Copyright Yahoo Finance and NYSE

An illustration of a positive trend between two consecutive levels
An illustration of a positive trend between two consecutive levels (Image by Author)

SILSO, World Data Center — Sunspot Number and Long-term Solar Observations, Royal Observatory of Belgium, on-line Sunspot Number catalogue: http://www.sidc.be/SILSO/, 1818–2020 (CC-BY-NA)

Daily sunspot count. Data source: SILSO
Daily sunspot count. Data source: SILSO (Image by Author)

Poisson Regression Models For Time Series Data

The STRIKES Data set. Source: R data sets

The STRIKES Data set (Source: R data sets)
The STRIKES Data set (Source: R data sets) (Image by Author)

The Binomial Regression Model

The Titanic data set has been downloaded from Stanford’s CS109 class website. Curated data set download link.

Top 10 rows of the Titanic data set
Top 10 rows of the Titanic data set (Image by Author)

Introduction to Survival Analysis – Concepts, Techniques and Regression models

The Stanford heart transplant data set is taken from https://statistics.stanford.edu/research/covariance-analysis-heart-transplant-survival-data and available for personal/research purposes only. Curated data set download.

What is the probability of survival at T > t_6?
What is the probability of survival at T > t_6? (Image by Author)

The Stratified Cox Proportional Hazards Regression Model

The VA lung cancer data appears in the following book: The Statistical Analysis of Failure Time Data, Second Edition, by John D. Kalbfleisch and Ross L. Prentice.

The first few rows of the VA lung cancer data set (Image by Author)

Testing For Normality of Residual Errors Using Skewness And Kurtosis Measures

Wages and salaries by Occupation: Total wage and salary earners (series id: CXU900000LB1203M). U.S. Bureau of Labor Statistics under US BLS Copyright Terms. Curated data set link for download

Source: Wages and salaries (series id: CXU900000LB1203M). U.S. Bureau of Labor Statistics
Source: Wages and salaries by Occupation: Total wage and salary earners (series id: CXU900000LB1203M). U.S. Bureau of Labor Statistics under US BLS Copyright Terms (Image by Author)

The Nonlinear Least Squares (NLS) Regression Model

The Bike sharing data set has been downloaded from UCI Machine Learning Repository. Curated data set download link.

Data set cited in paper: Fanaee-T, Hadi, and Gama, Joao, “Event labeling combining ensemble detectors and background knowledge”, Progress in Artificial Intelligence (2013): pp. 1–15, Springer Berlin Heidelberg, doi:10.1007/s13748–013–0040–3.

Plot of predicted versus actual bicycle user counts on the test data set
Plot of predicted versus actual bicycle user counts on the test data set (Image by Author)

Estimation of Vaccine Efficacy Using a Logistic Regression Model

COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. URL: https://github.com/CSSEGISandData/COVID-19.

Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Inf Dis. 20(5):533–534. doi: 10.1016/S1473–3099(20)30120–1


The Akaike Information Criterion

Average monthly maximum temperatures recorded in Boston, Massachusetts. Data is taken from National Centers for Environmental Information. Download link for the curated data set

Monthly average maximum temperature of Boston, MA from Jan 1998 to Jun 2019. Weather data source: National Centers for Environmental Information (Image by Author)
Monthly average maximum temperature of Boston, MA from Jan 1998 to Jun 2019. Weather data source: National Centers for Environmental Information (Image by Author)

R-squared, Adjusted R-squared and Pseudo-R-squared

The Taiwan House prices data set retrieved from UCI Machine Learning repository. Curated data set download

Data set citation: I-Cheng Yeh, Tzu-Kuang Hsu, Building real estate valuation models with comparative approach through case-based reasoning, Applied Soft Computing, Volume 65, 2018, Pages 260-271, ISSN 1568-4946

The Linear Model and the Mean Model (Data set: House prices in New Taipei City, Taiwan)
The Linear Model and the Mean Model (Image by Author) (Data set: House prices in New Taipei City, Taiwan)

The Chi Squared Test

The TAKEOVER BIDS data set has been referenced from the following paper: Jaggia, S., Thosar, S. Multiple bids as a consequence of target management resistance: A count data approach. Rev Quant Finan Acc 3, 447–457 (1993). https://doi.org/10.1007/BF02409622 PDF Download link Download link for the curated data set


Testing The Assumptions Of the Cox Proportional Hazards Model Using Schoenfeld Residuals

The Stanford heart transplant data set is taken from https://statistics.stanford.edu/research/covariance-analysis-heart-transplant-survival-data and available for personal/research purposes only. Download curated data set.

Top 5 rows of the Stanford heart transplant data set (Image by Author)
Top 5 rows of the Stanford heart transplant data set (Image by Author)

Estimating The Range Of A Population Parameter: A Guide To Interval Estimation

The DOHMH Beach Water Quality Data taken from NYC OpenData under their Terms of Use. Download curated data set.

Water quality data samples taken from the beaches of New York City between 2005 and 2021 (Source: NYC OpenData under Terms of Use) (Image by Author)
Water quality data samples taken from the beaches of New York City between 2005 and 2021 (Source: NYC OpenData under Terms of Use) (Image by Author)

Estimator Bias, And The Bias — Variance Tradeoff

The North East Atlantic Real Time Sea Surface Temperature data set downloaded from data.world under CC BY 4.0.


The Auto-Regressive Poisson Model

The Poisson INAR(1) Regression Model

The Manufacturing strikes data set is one of several data sets available for public use and experimentation in statistical software, most notably, over here as an R package. The data set has been made accessible for use in Python by Vincent Arel-Bundock via vincentarelbundock.github.io/rdatasets under a GPL v3 license.

The STRIKES Data set (Source: R data sets)
The STRIKES Data set (Source: R data sets) (Image by Author)

Hidden Markov Models

U.S. Bureau of Labor Statistics, Unemployment Rate [UNRATE], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/UNRATE, October 29, 2021. Available under Public license.

Monthly unemployment rate in the US (Data source: US FRED under public domain license)
Monthly unemployment rate in the US (Data source: US FRED under public domain license) (Image by Author)

The Markov Switching Dynamic Regression Model

U.S. Bureau of Economic Analysis, Personal Consumption Expenditures [PCE], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/PCE, November 11, 2021. Available under Public license.

Percentage change in Personal Consumption Expenditures (data source US FRED under public domain license)
Percentage change in Personal Consumption Expenditures (data source US FRED under public domain license) (Image by Author)

University of Michigan, Survey Research Center, Surveys of Consumers. The Index of Consumer Sentiment. Available under public license.

Percentage change in Consumer Sentiment Index (data source: U. of Michigan under public domain license)
Percentage change in Consumer Sentiment Index (data source: U. of Michigan under public domain license) (Image by Author)

Hamilton, James, Dates of U.S. recessions as inferred by GDP-based recession indicator [JHDUSRGDPBR], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/JHDUSRGDPBR, November 12, 2021.

Markov state probabilities super-imposed with the GDP-based recession indicator
Markov state probabilities super-imposed with the GDP-based recession indicator (Image by Author)

The Pooled OLS Regression Model for Panel Data Sets
The Fixed Effects Regression Model For Panel Data Sets
The Random Effects Regression Model for Panel Data Sets

World Development Indicators data from World Bank under CC BY 4.0 license. Download curated dataset

A panel data set (Source: World Development Indicators data under CC BY 4.0 license) (Image by Author)
A panel data set (Source: World Development Indicators data under CC BY 4.0 license) (Image by Author)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.