Morning Workshops (3 hrs exept where noted)
- Automating Archive Policy Enforcement using Dataverse and iRODS
- Using Stata for Data Work
- 10,000 Steps a Day! A Journey in Data and GIS Literacy Using Non-Traditional Data Sources, For the New Data Professional
- Text Processing with Regular Expressions
Afternoon Workshops (2 hrs exept where noted)
- Digital Data Harmonization with QuickCharmStats Software
- Creating GeoBlacklight Metadata: Leveraging Open Source Tools to Facilitate Metadata Genesis
- Visualizing Data in R with ggplot2
- Teaching Research Data Management Skills Using Resources and Scenarios Based on Real Data
Full Day Workshop (3+2 hrs)
(3hrs except where noted)
Automating Archive Policy Enforcement using Dataverse and iRODS
Presenters: Jonathan Crabtree, UNC Odum Institute; Thu-Mai Christian, UNC Odum Institute
Tentative location: Computer lab 215
Abstract: The workshop will highlight work of the Odum Institute as part of the DataNet Federation Consortiums effort to join the Odum Institutes archive platform with the Integrated Rule-Oriented Data System. Participants will see how archive workflows within the Dataverse platform can be connected to iRODS and leverage the policy based rules enforcement capabilities of iRODS. Participants will be able to create working Dataverse virtual archives that are integrated with the iRODS storage grid technology. The workshop will describe and utilize policies sets that have been selected from the new ISO 16363 audit standards for trustworthy digital repositories. These policies are written into iRODS rules that can be machine enforced. These data management and preservation rules will enforce and monitor a wide range of policies:
- Number of preservation copies
- Checksum calculations
- Frequency of integrity checks
- Creation of preservation formats
- Verification of preservation formats
- Movement of digital objects through a secure firewall
- Scans for sensitive information to protect human subjects
- Reporting of preservation status
- Verification of geographic distributed copies
- Enforce and report access control
Participants will see machine actionable rules in practice and be introduced to an environment where written policies can be expressed in ways an archive can automate their enforcement.
Using Stata for Data Work
Presenter: James Ng, University of Notre Dame
Tentative location: Computer lab 315
Stata is a leading statistical software package in the social sciences. Although not free, it has many of the hallmarks of open source software such as a user-contributed repository of add-on modules, an active community of users, and numerous third party-run online guides and tutorials. Stata arguably strikes perhaps the best balance between sophistication and usability among all statistical software packages.
This hands-on workshop will introduce participants to some of the ways Stata is used in empirical research in the social sciences. Participants will work through a series of exercises using data in commonly encountered formats. Many of the exercises will involve reproducing tables and graphs from scratch. Topics to be covered include reading data, cleaning data, manipulating data, combining data, and using the help system. Attention will be paid to reproducibility of results, which means that participants will be writing scripts in a do-file. Detailed notes will be provided to each participant for reference.
This workshop's target audience is social science librarians and other data service professionals. By the end of the workshop, participants should have gained enough familiarity with Stata to be able to start using it independently and to provide more in-depth help to their patrons who use Stata.
This is not a workshop in statistical methods, hence no knowledge of statistics is assumed. No knowledge of programming is required.
10,000 Steps a Day! A Journey in Data and GIS Literacy Using Non-Traditional Data Sources, For the New Data Professional
Presenters: Quin Shirk-Luckett, University of Guelph; Michelle Edwards, Cornell
University; Teresa Lewitzky, University of Guelph
Tentative location: Computer lab 305
Abstract: The way that we look at and conceive of data has changed. Each of us is a walking data generator as on-line data is collected on our every page click and tweet, and our movements are tracked though our phones, and at the places we visit. Literally millions of people are joining the data revolution to collect and analyse data on facets of ordinary life such as their house temperatures, health indicators, and daily step counts. "The data available are often unstructured - not organized in a database - and unwieldy, but there’s a huge amount of signal in the noise, simply waiting to be released." (McAffe & Brynjolfsson, HBR, 2012)
Join us for an interactive workshop where we will take advantage of this data trove to learn strategies used to clean data, run key statistical tests, and visualize the data using basic GIS techniques. Our goal is to show you the fundamentals of working with data so you gain the knowledge of strategies and approaches that will work with these unique types of datasets that may cross your desk.
We are proposing to use SPSS and ESRI ArcGIS for the workshop, and will be prepared to discuss open source statistical and GIS software.
By the end of this workshop you will be able to:
- Prepare a dataset for analysis
- Import the data into SPSS and select and run basic statistical tests in SPSS
- Import the data into ArcGIS and prepare a map to visualize the data
Text Processing with Regular Expressions
Presenters: Harrison Dekker, UC Berkeley
Tentative location: Classroom TBD (NOTE: attendees should bring a laptop to use for this workshop)
Abstract: Regular expressions (regex) are a powerful and ubiquitous programming construct that facilitate a wide range of text manipulation procedures. Essentially, regular expressions provide a means of defining text patterns that can be used to perform text matching and modification operations without having to write a lot of code. Common uses include complex search-and-replace-style data cleaning operations and pattern-based data validation such as detecting properly formatted telephone numbers, email addresses, or URLS. The most typical way of using a regular expression is through a function call in a programming language or directly on a command line, but they can also be used from within many text editors like Sublime, Textmate, or Notepad++.
In this workshop you´ll learn regular expression syntax and how to use it in R, Python, and on the command line. Participants will use a browser-based notebook, Jupyter, that enables literate computing by letting us experiment with the different flavors of regular expressions as implemented in these languages. The workshop will be example-driven and you will be encouraged to follow along with live-coding demonstrations and complete in-workshop challenges. You will work with real data and perform representative data cleaning and validation operations in multiple languages.
(2hrs except where noted)
Digital Data Harmonization with QuickCharmStats Software
Presenter: Dr. Kristi Winters, GESIS Leibniz Institute for the Social Sciences
Tentative location: Computer lab 215
Abstract: QuickCharmStats 1.1 provides a digital solution for the problems of documenting how variables are harmonized. It is free and open-source software that facilitates organizing, documenting and publishing data harmonization projects. We demonstrate how the CharmStats workflow collates metadata documentation, meets the scientific standards of transparency and replication, and encourages researchers to publish their harmonization work. Currently those who contribute original data harmonization work to their discipline are not credited through citations. We review new peer review standards for harmonization documentation, a route to online publishing, and a referencing format to cite harmonization projects. Although CharmStats products are designed for social scientists who must harmonize abstract concepts, our adherence to the standards of the scientific method ensure our products can be used by researchers across the sciences.
Creating GeoBlacklight Metadata: Leveraging Open Source Tools to Facilitate Metadata Genesis
Presenters: Andrew Battista, New York University; Stephen Balogh, New York University
Tentative location: Computer lab 315
Abstract: This workshop is a hands-on experience in creating GeoBlacklight geospatial metadata. Using a re-configured installation of Omeka, an open-source content management system, we will demonstrate how to capture, export, and store GeoBlacklight metadata. This tool can be leveraged to assist researchers in the submission of GIS data and the creation of geospatial metadata, or to help librarians generate records to make collected data discoverable. Participants will generate sample data submissions and learn how to create individual or batch sets of metadata records, which can be ingested into a Solr index or incorporated into another discovery environment. While our primary example is geospatial metadata, we will gesture toward ways in which open-source platforms can be configured to capture metadata that adheres to multiple standards. We will situate Omeka as one among several options available to facilitate metadata creation for emerging data collections. It is our goal to demonstrate ways of thinking about local metadata solutions that are scalable and effective.
Visualizing Data in R with ggplot2
Presenter: Alicia Hofelich Mohr, University of Minnesota, Thomas Lindsay, University of Minnesota
Tentative location: Computer lab 305
R is a powerful tool for statistical computing, but its base capabilities for graphics can be limited, and complicated plots often require a considerable amount of code. Ggplot2 is a popular package that extends R’s capability for data visualization, allowing users to produce attractive and complex graphics in a relatively simple way. This workshop will introduce the logic behind ggplot2 and give participants hands-on experience creating data visualizations with this package. This session will also introduce participants to related tools for creating interactive graphics from this syntax (such as plotly, plot.ly/feed).
Prerequisites: Participants should be comfortable working with quantitative data and should have some basic familiarity with R, but do not need any experience with ggplot2. Ggplot2 uses a slightly different syntax than base R plotting, so participants do not need to have experience using R for data visualization. This workshop will involve reading data into R and working in the RStudio environment.
By the end of this workshop, participants will:
- - Understand the syntax and logic behind graphics in ggplot2
- - Create a variety of visualizations and learn how to customize features of the graphs, such as color scales and labeling
- - Learn about extensions for more advanced graphic capabilities using ggplot2 and additional resources for learning more
Teaching Research Data Management Skills Using Resources and Scenarios Based on Real Data
Presenters: Veerle Van den Eynden, UK Data Archive; Jared Lyle, ICPSR; Lynette Hoelter, ICPSR; Alexandra Stam, FORS; Brian Kleiner, FORS
Tentative location: Classroom TBD
Abstract: The need for researchers to enhance their research data management skills is currently high, in line with expectations for sharing and reuse of research data. Data librarians and data services specialists increasingly provide data management training to researchers. It is widely known that effective learning of skills is best achieved through active learning by making processes visible, through directly experiencing methods and through critical reflection on practice. The organisers of this workshop each apply these methods when teaching good data practices to academic audiences, making use of exercises, case studies and scenarios developed from real datasets.
We will showcase recent examples of how we have developed existing qualitative and quantitative datasets into rich teaching resources and fun scenarios to teach research data management practices to doctoral students and advanced researchers; how we use these resources in hands-on training workshops and what our experiences are of what works and does not work. Participants will then actively develop ideas and data management exercises and scenarios from existing data collections, which they can then use in teaching research data management skills to researchers.
This workshop is for people tasked with teaching good research data management practices to researchers.
(3hr session in the morning AND 2hr session in the afternoon)
Intro to Python for Data Wrangling
Presenter: Tim Dennis, UC San Diego
Tentative location: Computer Lab 205
09:00-15:30 (lunch break 12:00-13:30)
Abstract: Data professionals supporting social science researchers provide valuable services throughout the data management life cycle. According to recent surveys, up to 80% of a data scientist's time can be spent cleaning, harmonizing and integrating data (a.k.a.: data wrangling). While there are many useful tools available to assist with these types of workflows, knowledge of basic programming can be extremely empowering.
This full day workshop will provide an introduction to Python - one of the most popular and versatile languages in use today.
No prior programming experience required! The workshop will be split into two parts: "Basic Python Programming" in the morning, and "Working with Data using Python" in the afternoon.
Users will be able to:
- proficiently use scientific notebooks in the cloud
- write basic python programs
- integrate disparate csv files
- transform web-fetched json data into csv format
- reference great materials for deeper learning