Research Data Management

Data Types Guidance

Data are raw or semi-processed/cleaned up information that have been collected for the purpose of analysis. A data set is inherently neutral, may be qualitative or quantitative, born-digital or digitized, and is not necessarily tabular or even digital. Data may come in many formats, including text, numerical, tabular, multimedia, models and software, as well as discipline- and instrument-specific data.

The word "data" means different things to different people in different contexts. What type of data are you creating?

Sciences

Observational, like sensor readings/weather research

Experimental, like gene sequences/clinical trials

Model or simulation, like climate/economic models

Derived, compiled or computational, like text or data mining

Social Sciences

Quantitative data, like statistical data collected from surveys, recorded in spreadsheets and manipulated with tools such as R, SPSS, SAS or Excel

Qualitative data, like audio or video interviews and coded transcripts

Humanities

If you are in the humanities, anything could be data!

The digital humanities are an emerging field that combines methodologies from traditional humanities disciplines with tools provided by computing and digital publishing.

File Format Guidance

A file format is a standard way that information is encoded for storage in a computer file.

Experts have high confidence that relatively common, uncompressed (or at least lossless compressed) and non-proprietary file formats will be able to be rendered in the future. File formats without these characteristics are more likely to become obsolete or corrupt. If you are trying out new software, file formats or compression techniques, you may want to keep two versions of your data, one experimental, and one more stable.

At some point during your project, you may need to migrate your files to a new format. These types of transformations may affect the "look and feel" of a digital file, but should not affect its content. Extreme caution should be exercised when transforming "complex" digital objects like databases and websites. Embedded formulas in programs like Excel should be translated into values.

Data and database normalization are also important considerations, especially when conducting multi-institutional research. It is better to be explicit about how data will be collected at the onset of a new project than to try to clean this up after both institutions have collected a large amount.

Text files made up of plain text are very basic, but if they work for you, they are probably your best bet for future-proofing your data.

Organizing Data

Research data files and folders should be named and structured in such a way that someone else could make sense of them without you around.

Unlike system-generated file names, which are not always intuitive (i.e., 000001.jpg, 000002.jpg, 000003.jpg, or Document1.docx), good file names are unique, descriptive and consistent, even for people outside of your project.

Good file names do not use symbols (i.e., ! ? @ # $ % ^ & ~ ` _ + - = . , ), brackets (i.e., ( ) { } [ ] ) or spaces, which can disrupt various systems and processes. You should not assume case sensitivity, and avoid excessively long file names which may get truncated.

File versioning is an important consideration when creating a file naming convention, especially when research involves more than one person. You want to avoid accidental overwriting of files! Using Network Data Storage from GVSU Information Technology can help, but you can also keep track of versioning manually by adding a consistent suffix such as v01, v02 to your file names. You could also use version control software like Git or Mercurial.

ReNamer is a very powerful and flexible file renaming tool we use at the library which offers all of the standard renaming procedures, including prefixes, suffixes, replacements, case changes, as well as removing contents of brackets, adding number sequences, changing file extensions, etc.

Storage and Security

Fir0002/Flagstaffotos

Depending on your needs, Network Data Storage from GVSU Information Technology is likely the safest, most secure place for your research data during creation, processing and analysis.

Storage

Research data, like any other data, may be stored on networked drives, personal computers or laptops, external storage devices or the cloud. Wherever you store your data during data collection and analysis, remember to back up your files using at least two different types of storage media and to keep at least one copy of your data in a separate geographic location (i.e., original + external/local backup + external/remote backup).

Personal computers or laptops are convenient, and removable media like external storage devices are convenient, cheap and portable. Neither of these types of devices, which are susceptible to physical damage, loss and theft, should be used for storing master versions of your research data. Never use CDs and DVDs; their media life is not reliable in the long-term.

Cloud Computing is an attractive option, but posting sensitive research data, particularly human subjects data, could violate FERPA, HIPAA and other privacy protection laws.

Security

Data Security means protecting data from unauthorized users and corruption. It can be divided into three categories: network security, physical security and computer systems & files.

Regarding network security, keep confidential information off of the Internet, and put sensitive materials on computers not connected to the Internet.

Regarding physical security, restrict access to buildings and rooms not connected to the Internet, and only let trusted individuals troubleshoot computer problems. Do not share your passwords, lock your office and store external hard drives with highly sensitive data in a locked safe.

Finally, regarding computer systems & files, keep anti-virus protection up to date, don't send confidential data via e-mail or FTP, and use passwords on files and computers. Data encryption should be considered when sensitive data needs to be stored on devices other than GVSU secure servers.

You may also need to consider a means for confidential disposal of research data. Simply deleting a file will not completely remove it from your computer! Use a tool like BCWipe to delete files forever.

Finally, this may not be all you ever need to know on storage & security. Your sponsor's data security requirements may more stringent!

Have other data management needs, particularly those that arise during the data collection phase of a research project? See Information Technology Policies.

Contact the Director of Information Technology

Documentation and Metadata Guidance

Metadata is a word that has only recently come into popular usage. It means "data about data."

Documentation

Documentation is a type of unstructured metadata, usually a supplementary file or document intended to accompany data. The most basic form of documentation is the README.txt file.

According to Data Dryad, a README.txt file for tabular data should include:

Definitions of column headings and row labels, data codes (including missing data), and measurement units
For each file name, a short description of what data it includes
Any processing steps, especially if not described in the publications, that may affect interpretation of results
A description of what associated datasets are stored elsewhere, if applicable
Whom to contact with questions

A codebook is another type of documentation commonly used to describe data. According to the ICPSR, a codebook should contain:

Column locations and widths for each variable
Definitions of different record types
Response codes for each variable
Codes used to indicated nonresponse and missing data
Other indications of the content and characteristics of each variable

Metadata

What is metadata? Simply put, metadata is "data about data."

In the digital world, it typically refers to structured information that exists alongside or is embedded in a digital file that describes its content, context and, as necessary, structure. Technical information about your data and descriptive information about your project should always be included in your metadata.

Metadata should make your data intelligible to someone outside of your project, and even outside of your discipline. Most data does not make sense to be full-text searchable, so metadata is the primary means by which your data will be discovered over the web.

There are many metadata schemes for research data, ranging from the very simple (Dublin Core, DataCite), to the more complex (Content Standard for Digital Geospatial Metadata, Data Documentation Initiative).

There are also accepted domain-local standards, ontologies and nomenclature, such as:

Darwin Core for biology
Astronomy Visualization Metadata Standard for astronomy
Ecological Metadata Standard for ecology
And many more!

See the Digital Curation Centre's Disciplinary Metadata for more information on these domain-local standards and to search by your discipline.

Colectica is an Excel plug-in designed to help you clean up your data and create metadata for your tabular data. OpenRefine is another powerful open source tool for working with messy data.

Copyright & Privacy/Confidentiality Guidance

Copyright is the exclusive legal right, given to an originator or an assignee to print, publish, perform, film, or record literary, artistic, or musical material, and to authorize others to do the same.

Do you have the right to make data available? Should the data be embargoed for a certain period of time? How will you protect privacy, security, confidentiality and intellectual property? Can you think of any privacy, ethical or confidentiality concerns, particularly for human subjects data? Do you need to anonymize or aggregate data before sharing it? Do any regulations, such as HIPAA, apply to your data? See Information Privacy and Protection from University Compliance for more information.

Is your data covered by copyright? Copyright can be waived under a CC0 license.

Have other copyright or licensing questions for your data? Wondering how to apply a CC license to your data set? See the University Libraries' Copyright Basics site, or:

Contact the Scholarly Communications Outreach Coordinator

Archiving & Sharing Data Guidance

A data management horror story by Karen Hanson, Alisa Surkis and Karen Yacobucci.

Archiving

Backups are great, but on their own are insufficient for long-term preservation. Digital preservation is much more comprehensive, and takes into account issues of metadata, security, documentation, auditing, weeding, sharing and discovery, format obsolescence, media corruption and failure, organizational risks, etc.

Many researchers have a lot of experience managing their data, but most are not in a place to manage their research data long-term. Similarly, most network data storage providers are not in the business of long-term preservation. Instead, they operate under that assumption that data loses value over time.

Fortunately, a good data repository or archive will take care of all of this for you. In fact, as recently as March 21, 2014, the Office of Science and Technology Policy (OSTP), an office in the Executive Office of the President (EOP), issued a letter encouraging granting agencies to include "a strategy for leveraging existing archives, where appropriate" in their plans.

Sharing Data

In general, sharing our data will allow others to do follow-up research, do new research and scrutinize your findings. It is also a provision of the NSF Award and Administration Guide, Section VI.D.4. Sharing your data using a disciplinary data repository will allow you to share your data more widely by making it findable through search engines, track data impact metrics and receive credit for re-use of published data via data citation. And it's more effective than e-mail.

Need a place to share your data? ScholarWorks@GVSU is the University Libraries' format agnostic platform for dissemination of scholarly and creative content. While it is not meant for long-term curation of research data, ScholarWorks@GVSU is available to help meet basic data sharing requirements.

Don't know where to start? Your sponsor likely has terms and conditions for sharing, and may even suggest a particular data repository. If not, the Registry of Research Data Repositories and the Open Access Directory of Data Repositories are great resources that can help you find discipline-specific data repositories for archiving, sharing and/or re-use. Figshare is a repository where users can make all of their research outputs available in a citable, shareable and discoverable manner, and works well for individual researchers.

Re-use, Re-distribution

Identify who will be allowed to use your data, how they will be allowed to use your data, and whether they will be allowed to disseminate your data. This may refer to local users on your team or in the larger research community. Ask yourself:

Will any permission restrictions need to be placed on the data?
Who is your audience? Which bodies/groups are likely to be interested in the data?
Who may be interested in your data in the future and what might it be used for?

The PI is responsible for sharing data without violating federal law or regulation ( Export Controls) or compromising individual rights ( FERPA, FOIA, HIPAA or Intellectual Property).

Contact the Office of Sponsored Programs

Contact the University Counsel Office

Budget

Photo credit: www.LendingMemo.com

You are encouraged to put data sharing costs in your budget proposal.

NIH guidelines state that:

NIH recognizes that it takes time and money to prepare data for sharing. Thus, applicants can request funds for data sharing and archiving in their grant application. (See also the section on What to Include in an NIH Application.) Investigators who incorporate data sharing in the initial design of the study may more readily and economically establish adequate procedures for protecting the identities of participants and share a useful dataset with appropriate documentation.

And, according to the NSF Social Sciences Directorate:

Any costs should be explained in the Budget Justification pages.

Need More Help With Your Research Data Management?

University Libraries

Contact the University Libraries to share your data, get help writing a DMP, or for help with data management generally.

Cara Cadena
Head of Collections

Phone: (616) 331-7337

[email protected]

Information Technology

Contact Information Technology for IT related policies and guidelines, especially during data collection.

Sue Korzinek
Director of Information Technology

Phone: (616) 331-2035

[email protected]

University Libraries

More research tools:

Tuesday's Hours

Research Data Management

Data Types Guidance

Sciences

Social Sciences

Humanities

File Format Guidance

Organizing Data

Storage and Security

Storage

Security

Documentation and Metadata Guidance

Documentation

Metadata

Copyright & Privacy/Confidentiality Guidance

Archiving & Sharing Data Guidance

Archiving

Sharing Data

Re-use, Re-distribution

Budget

Need More Help With Your Research Data Management?

University Libraries

Information Technology

Sponsored Programs

Contact

Social Media

Committed to Equality