Skip to content

Commit d91aaee

Browse files
Update data attributions (#2)
* Change to just file `LICENSE`, add `@source` for CTIS data * describe format for all datasets; unify doc section naming format * documentation wording * move data dictionary docs to separate section to ensure heritability * build docs * add data contributors as authors * remove dangling partial attribution for CTIS data * remove copyright holder info for CTIS. Resolve in issue cmu-delphi/epiprocess#348 * note use in epiprocess vignettes * build docs * archive, cases_deaths doc wording * county and outlier doc line wrapping * add description tag to keep external use in top-level description * outlier dataset uses nj, not ca data * add nat as maintainer * title suggestions Co-authored-by: brookslogan <lcbrooks+github@andrew.cmu.edu> * attribute COVID Canada working group * match wording between data attributions * build docs with new data titles * convert archive..._dt to archive * generate all datasets with explicit as_of * document all as_ofs * check if epiprocess installed on load * remove on-load error * on data load, convert to epiprocess object if epiprocess installed rename all the underlying data to *_dt move dates into `local` and use ::: * describe archive format of can_prov_cases * use helper to modify rather than overwrite sysdata.rda * warn on data load if epiprocess not installed * update links and types in docs * document new links and types * remove discontinued direction field from jhu * zero-pad state census fips * document JHU missings as "NA" * covid_case_death_rates should only import starting from Dec 2020 * add canadian grad income dataset * add attribution for doctor-visits * double-check attribution and modifications to all datasets * don't wrap href * helper avoid error when sysdata.rda is empty * use stronger compression on external data files * use stronger compression on internal data * move tibble to Enhances Co-authored-by: brookslogan <lcbrooks+github@andrew.cmu.edu> * _helper.R sysdata env to inherit from empty environment Co-authored-by: brookslogan <lcbrooks+github@andrew.cmu.edu> * spelling in documentation Co-authored-by: brookslogan <lcbrooks+github@andrew.cmu.edu> * spelling in documentation Co-authored-by: brookslogan <lcbrooks+github@andrew.cmu.edu> * require newer version of epiprocess * rename starter datasets from _dt to _tbl * rebuild docs * save datasets with version=2 in `save` for backwards compatibility --------- Co-authored-by: Nat DeFries <42820733+nmdefries@users.noreply.github.com>
1 parent 0632a77 commit d91aaee

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

51 files changed

+1132
-461
lines changed

DESCRIPTION

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,32 +2,42 @@ Type: Package
22
Package: epidatasets
33
Title: Epidemiological Data for Delphi Tooling Examples
44
Version: 0.0.1
5-
Authors@R:
6-
person(c("Daniel", "J."), "McDonald", , "daniel@stat.ubc.ca", role = c("cre", "aut"))
5+
Authors@R: c(
6+
person(c("Daniel", "J."), "McDonald", email="daniel@stat.ubc.ca", role = c("aut")),
7+
person("Nat", "DeFries", email="ndefries@andrew.cmu.edu", role = c("cre", "aut")),
8+
person("Johns Hopkins University Center for Systems Science and Engineering", role = "dtc", comment = "Owner of COVID-19 cases and deaths data from the COVID-19 Data Repository"),
9+
person("Johns Hopkins University", role = "cph", comment = "Copyright holder of COVID-19 cases and deaths data from the COVID-19 Data Repository"),
10+
person("Carnegie Mellon University Delphi Group", role = "dtc", comment = "Owner of masking and social-distancing data from the COVID-19 Trends and Impacts Survey. Owner of claims-based CLI data from the Delphi Epidata API"),
11+
person("The COVID-19 Canada Open Data Working Group", role = "dtc", comment = "Owner of Canadian COVID-19 cases rates from the Covid19Canada data repository"),
12+
person("Statistics Canada", role = "dtc", comment = "Owner of Canadian graduate employment income data from the Statistics Canada website")
13+
)
714
Description: This package contains data sets used to compile vignettes and
815
other documentation in Delphi R Packages. The goal is to avoid calls
916
to the Delphi Epidata API, and to deposit some examples here for easy
1017
offline use.
11-
License: MIT + file LICENSE
18+
License: file LICENSE
1219
Depends:
1320
R (>= 2.10)
1421
Suggests:
1522
covidcast,
23+
data.table,
1624
dplyr,
1725
epidatr,
18-
epiprocess,
1926
here,
2027
httr,
2128
jsonlite,
2229
lubridate,
2330
magrittr,
2431
purrr,
2532
readr
33+
Enhances:
34+
epiprocess (>= 0.9.0),
35+
tibble
2636
Remotes:
2737
cmu-delphi/epidatr,
2838
cmu-delphi/epiprocess
2939
Encoding: UTF-8
3040
LazyData: true
3141
Roxygen: list(markdown = TRUE)
32-
RoxygenNote: 7.3.1
42+
RoxygenNote: 7.3.2
3343
URL: https://cmu-delphi.github.io/epidatasets/

LICENSE

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
1-
YEAR: 2023
2-
COPYRIGHT HOLDER: epidatasets authors
1+
This contains a collection of data from different sources under different
2+
licenses; please see the documentation for each object for license information.

LICENSE.md

Lines changed: 0 additions & 21 deletions
This file was deleted.

R/epipredict-data.R

Lines changed: 150 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,15 @@
1-
#' Subset of JHU daily state cases and deaths
1+
#' JHU daily COVID-19 cases and deaths rates from all states
22
#'
3-
#' This data source of confirmed COVID-19 cases and deaths
4-
#' is based on reports made available by the Center for
5-
#' Systems Science and Engineering at Johns Hopkins University.
6-
#' This example data ranges from Dec 31, 2020 to Dec 31, 2021,
7-
#' and includes all states.
3+
#' This data source of confirmed COVID-19 cases and deaths is based on reports
4+
#' made available by the Center for Systems Science and Engineering at Johns
5+
#' Hopkins University, as downloaded from the CMU Delphi COVIDcast Epidata
6+
#' API. This example data is a snapshot as of March 20, 2024, and
7+
#' ranges from December 31, 2020 to December 31, 2021. It
8+
#' includes all states. It is used in the {epiprocess} correlation vignette.
89
#'
9-
#' @format A tibble with 20,496 rows and 4 variables:
10+
#' @format An [`epiprocess::epi_df`] (object of class `c("epi_df", "tbl_df", "tbl", "data.frame")`) with 37576 rows and 4 columns.
11+
#' @section Data dictionary:
12+
#' The data has columns:
1013
#' \describe{
1114
#' \item{geo_value}{the geographic value associated with each row
1215
#' of measurements.}
@@ -38,47 +41,104 @@
3841
#'
3942
#' Data set on state populations, from the 2019 US Census.
4043
#'
41-
#' @format Data frame with 57 rows (including one for the United States as a
42-
#' whole, plus the District of Columbia, Puerto Rico Commonwealth,
43-
#' American Samoa, Guam, the U.S. Virgin Islands, and the Northern Mariana,
44-
#' Islands).
44+
#' @format A [`tibble::tibble`] (object of class `c("tbl_df", "tbl", "data.frame")`) with 57 rows and 4 columns.
45+
#' @section Data dictionary:
46+
#' The data includes 57 regions (all US states, the United
47+
#' States as a whole, the District of Columbia, Puerto Rico Commonwealth,
48+
#' American Samoa, Guam, the U.S. Virgin Islands, and the Northern Mariana
49+
#' Islands) with columns:
4550
#'
4651
#' \describe{
47-
#' \item{fips}{FIPS code}
52+
#' \item{fips}{2-digit FIPS code}
4853
#' \item{name}{Full name of the state or territory}
4954
#' \item{pop}{Estimate of the location's resident population in
5055
#' 2019.}
5156
#' \item{abbr}{Postal abbreviation for the location}
5257
#' }
5358
#'
54-
#' @source United States Census Bureau, at
59+
#' @source
60+
#' This object is derived from several datasets from the United States
61+
#' Census Bureau, Population Division, at
5562
#' \url{https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/co-est2019-alldata.pdf},
5663
#' \url{https://www.census.gov/data/tables/time-series/demo/popest/2010s-total-puerto-rico-municipios.html},
57-
#' and \url{https://www.census.gov/data/tables/2010/dec/2010-island-areas.html}
64+
#' and \url{https://www.census.gov/data/tables/2010/dec/2010-island-areas.html}.
65+
#' It is made available through the `covidcast` package. This data is
66+
#' public domain.
5867
"state_census"
5968

6069
# Epipredict Vignette Data ----------------------------------------------------
6170

62-
#' CTIS COVID Behaviours
71+
#' Subset of CTIS COVID-19-related behaviours from 5 states
6372
#'
6473
#' Data set for a handful of states on masking and distancing behaviours
65-
#' during the COVID-19 Pandemic and downloaded from the CMU Delphi COVIDcast
66-
#' Epidata API. This data set covers the period from
67-
#' June to December 2021.
74+
#' during the COVID-19 Pandemic, and downloaded from the CMU Delphi COVIDcast
75+
#' Epidata API. This example data is a snapshot as of March 20, 2024, and
76+
#' ranges from June 4, 2021 to December 31, 2021.
77+
#' It is limited to California, Florida, Texas, New Jersey, and New York.
78+
#'
79+
#' @format A [`tibble::tibble`] (object of class `c("tbl_df", "tbl", "data.frame")`) with 1055 rows and 4 columns.
80+
#' @section Data dictionary:
81+
#' The data has columns:
82+
#' \describe{
83+
#' \item{geo_value}{the geographic value associated with each row
84+
#' of measurements.}
85+
#' \item{time_value}{the time value associated with each row of measurements.}
86+
#' \item{masking}{Estimated percentage of people who wore a mask for most or all of the time while in public in the past 7 days; those not in public in the past 7 days are not counted.}
87+
#' \item{distancing}{Estimated percentage of respondents who reported that all or most people they encountered in public in the past 7 days maintained a distance of at least 6 feet. Respondents who said that they have not been in public for the past 7 days are excluded.}
88+
#' }
89+
#'
90+
#' @source
91+
#' This object contains a modified part of the
92+
#' \href{https://cmu-delphi.github.io/delphi-epidata/symptom-survey/#covid-19-trends-and-impact-survey}{data
93+
#' aggregations in the API} that are prepared from the
94+
#' \href{https://www.pnas.org/doi/full/10.1073/pnas.2111454118}{COVID-19
95+
#' Trends and Impact Survey}; see the first link for more information on
96+
#' citing in publications.
97+
#' The data is made available via the
98+
#' \href{https://cmu-delphi.github.io/delphi-epidata/}{Delphi Epidata API}.
99+
#'
100+
#' These aggregations are licensed under the terms of
101+
#' the \href{https://creativecommons.org/licenses/by/4.0/}{Creative Commons
102+
#' Attribution license}.
103+
#'
104+
#' Modifications:
105+
#' * The data has been limited to a very small number of rows, the
106+
#' signal names slightly altered, and formatted into an `epi_df`.
68107
"ctis_covid_behaviours"
69108

70-
#' COVID-19 Incident Cases and Deaths
109+
#' Subset of COVID-19 incident cases and deaths from 5 states
71110
#'
72111
#' Data set for 5 states containing COVID-19 Incident Cases and Deaths as
73-
#' reported
74-
#' by JHU-CSSE and downloaded from the CMU Delphi COVIDcast Epidata API.
75-
#' This data set covers the period from June 2021 to December 2021, and is
76-
#' used in the epipredict Vignette on ... .
112+
#' reported by JHU-CSSE and downloaded from the CMU Delphi COVIDcast Epidata
113+
#' API. This example data is a snapshot as of March 20, 2024, and
114+
#' ranges from June 4, 2021 to December 31, 2021. It
115+
#' is limited to California, Florida, Texas, New Jersey, and New York.
77116
#'
78-
#' @source This object contains a modified part of the \href{https://github.com/CSSEGISandData/COVID-19}{COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University} as \href{https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html}{republished in the COVIDcast Epidata API}. This data set is licensed under the terms of the
117+
#' @format An [`epiprocess::epi_df`] (object of class `c("epi_df", "tbl_df", "tbl", "data.frame")`) with 1055 rows and 4 columns.
118+
#' @section Data dictionary:
119+
#' The data has columns:
120+
#' \describe{
121+
#' \item{geo_value}{the geographic value associated with each row
122+
#' of measurements.}
123+
#' \item{time_value}{the time value associated with each row of measurements.}
124+
#' \item{cases}{Number of new confirmed COVID-19 cases, daily}
125+
#' \item{deaths}{Number of new confirmed COVID-19 deaths, daily}
126+
#' }
127+
#'
128+
#' @source This object contains a modified part of the \href{https://github.com/CSSEGISandData/COVID-19}{COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University}
129+
#' as \href{https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html}{republished in the COVIDcast Epidata API}.
130+
#' This data set is licensed under the terms of the
79131
#' \href{https://creativecommons.org/licenses/by/4.0/}{Creative Commons Attribution 4.0 International license}
80132
#' by the Johns Hopkins University on behalf of its Center for Systems Science in Engineering.
81133
#' Copyright Johns Hopkins University 2020.
134+
#'
135+
#' Modifications:
136+
#' * \href{https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html}{From the COVIDcast Epidata API}:
137+
#' The signals are taken directly from the JHU CSSE
138+
#' \href{https://github.com/CSSEGISandData/COVID-19}{COVID-19 GitHub repository}
139+
#' without changes.
140+
#' * Furthermore, the data has been limited to a very small number of rows, the
141+
#' signal names slightly altered, and formatted into an `epi_df`.
82142
"counts_subset"
83143

84144
#' Canadian COVID-19 case rates
@@ -93,13 +153,76 @@
93153
#' \href{https://github.com/ccodwg/CovidTimelineCanada}{ccodwg/CovidTimelineCanada GitHub repository},
94154
#' which also reports vaccine-related signals.
95155
#'
96-
#' This dataset contains versioned data covering the period from April 2020 to
97-
#' December 2021 and is used in the epipredict slide vignette.
156+
#' This dataset contains versioned data snapshots from February 1, 2021 to December
157+
#' 1, 2021 covering the period from April 2, 2020 to December 1, 2021. It is
158+
#' used in the epipredict slide vignette.
98159
#'
160+
#' @format An [`epiprocess::epi_archive`]. The DT attribute contains the data formatted as a [`data.table::data.table`] (object of class `c("data.table", "data.frame")`) with 65299 rows and 4 columns.
161+
#' @section Data dictionary:
162+
#' The data in the `epi_archive$DT` attribute has columns:
163+
#' \describe{
164+
#' \item{version}{the time value specifying the version for each row of measurements.}
165+
#' \item{geo_value}{the province or territory associated with each row of measurements.}
166+
#' \item{time_value}{the time value associated with each row of measurements.}
167+
#' \item{case_rate}{number of new confirmed cases due to COVID-19 per 100,000 population, daily}
168+
#' }
99169
#' @source This object contains a modified part of the COVID-19 Canada Open
100170
#' Data Working Group's
101171
#' \href{https://github.com/ccodwg/Covid19Canada}{Covid19Canada data repository} (archived).
102172
#' This data set is licensed under the terms of the
103173
#' \href{https://creativecommons.org/licenses/by/4.0/}{Creative Commons Attribution 4.0 International license}
104-
#' by the COVID-19 Canada Open Data Working Group.
174+
#' by the COVID-19 Canada Open Data Working Group. The COVID-19 Canada Open
175+
#' Data Working Group collected the data from publicly available sources such
176+
#' as government datasets and news releases.
177+
#'
178+
#' Modifications:
179+
#' * The case rate signal are calculated using the case count taken directly from the CCODWG
180+
#' \href{https://github.com/ccodwg/Covid19Canada}{ccodwg/Covid19Canada GitHub repository}
181+
#' and population data.
182+
#' * Furthermore, the data has been limited to a very small number of rows, the
183+
#' signal names slightly altered, some province names replaced with abbreviations, and
184+
#' formatted into an `epi_archive`.
185+
#'
186+
#' The population data used (but not included in the dataset itself) is from the
187+
#' \href{https://github.com/mountainMath/BCCovidSnippets/}{mountainMath/BCCovidSnippets GitHub repository}.
105188
"can_prov_cases"
189+
190+
#' Subset of Statistics Canada median employment income for postsecondary graduates
191+
#'
192+
#' Data set for all territories (aggregated) and all 10 provinces containing
193+
#' yearly income data for postsecondary graduates as reported by Statistics
194+
#' Canada, downloaded from the Statistics Canada website at
195+
#' www.statcan.gc.ca. This example data is a snapshot as of September 18,
196+
#' 2024, and ranges from 2010 to 2017 (yearly).
197+
#'
198+
#' @format An [`epiprocess::epi_df`] (object of class `c("epi_df", "tbl_df", "tbl", "data.frame")`) with 10193 rows and 8 columns.
199+
#' @section Data dictionary:
200+
#' The data has columns:
201+
#' \describe{
202+
#' \item{geo_value}{The province in Canada associated with each
203+
#' row of measurements.}
204+
#' \item{time_value}{The time value, a year integer in YYYY format}
205+
#' \item{edu_qual}{The education qualification}
206+
#' \item{fos}{The field of study}
207+
#' \item{age_group}{The age group; either 15 to 34 or 35 to 64}
208+
#' \item{num_graduates}{The number of graduates for the given row of characteristics}
209+
#' \item{med_income_2y}{The median employment income two years after graduation}
210+
#' \item{med_income_5y}{The median employment income five years after graduation}
211+
#' }
212+
#' @source This object contains modified data adapted from
213+
#' Statistics Canada, \href{https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3710011501}{
214+
#' Table 37-10-0115-01 Characteristics and median employment income of
215+
#' longitudinal cohorts of postsecondary graduates two and five years after
216+
#' graduation, by educational qualification and field of study
217+
#' (primary groupings)}. This does not constitute an endorsement by Statistics Canada of this product.
218+
#'
219+
#' The data is licensed under the terms of the
220+
#' \href{https://www.statcan.gc.ca/en/reference/licence}{Statistics Canada Open License}.
221+
#'
222+
#' Modifications:
223+
#' * Only provincial and territorial regions are kept.
224+
#' * Only age group, field of study, and educational qualification are kept as
225+
#' covariates. For the remaining covariates, we keep aggregated values and
226+
#' drop the level-specific rows.
227+
#' * No modifications were made to the time range of the data.
228+
"grad_employ_subset"

0 commit comments

Comments
 (0)