PresQT Preservation Quality Tool

Needs Assessment Results

In the Summer/Fall of 2017 Participants were invited to contribute answers for the PresQT research study, entitled "Data and Software Preservation Quality Tool Needs Assessment" related to the PresQT Project, University of Notre Dame Study # 17-04-3850 DOI 10.17605/OSF.IO/D3JX7. Data Collection closed Sept 1, 2017 at 5 PM EDT. Participants' answers to a series of questions related to their past practice, and anticipated future needs as researchers and/or software developers contribute to a better understanding of what tools and/or tool suites would be of benefit those preserving and/or sharing data and software.

The Needs Assessment questionnaire and response data are available on the project page.

Questionnaire (PDF)
Data

Tools/Usefulness/Sort
|
Researcher Behavior
|
Developer Behavior

Tools/Usefulness/Sort

1. Indicate whether implementation or integration of tools like those below would ease your path to publishing, sharing, curating, or reusing data or software: (tools_use_matrix)

Provenance

Workflow

Fixity

Assignment

Profile Based Recommender

De-identification

Quality

2. Do you have a data or software preservation quality tool need this project could help you develop? If so, please describe: (tools_data_preserv)

Sample of responses:

4CeeD cloud-based system that collects data and metadata from microscopes, and then allows scientists to curate the data on the cloud.
A 'map of the data preservation' landscape showing canonical repositories, types of data it stores, metadata requirements, costs, etc.
A Thermodynamic model
A clear way to save large MD trajectory files that is cost-effective
A major challenge in my main field, which is molecular dynamics simulation, is the question of how to share and curate simulation trajectory files. These files are the base output of every molecular dynamics run - they are the equivalent of providing a sample of a material itself in experiment. The central challenge is that these files are very large (we have ca. 100 TB of trajectory files from our group alone). It is also extremely difficult to track their provenance and the precise metadata associated with the conditions under which they were generated. Nevertheless, a facile ability to share these files among the community could exponentially amplify the communities productivity by permitting renanalysis of existing trajectories, rather than a constant need to redo work someone else has done. There is presently no good solution.
A place put large ([greater than] 1 TB) datasets (and associated metadata) for preservation at no cost to the data producer and that will remain publicly accessible. Preferably, this service would have an API so that datasets could be easily integrated into other services.
A suite of R packages for reproducibility
A tool that can assess what needs to be preserved / documented to maintain long-term access to data and software
A tool that would ensure that a proper lab-book/log entry was provided for each data recording session. A generalized and very fast/easy tool for checking in data analysis lab book entries. It's fine to have analysis code, but without a lab book and demo/docs its nearly impossible to run or understand the code.
A version system that is good with videos or specific files with software such as Unity3D or Autodesk Maya
A way of identifying the right repository for my data.
All of my projects work with speech and natural language data, therefore I have extensive experience creating and deploying software solving all fo the problems mentioned on previous page
All of the things on the previous page would be useful
Already have sufficient tools for this.
Anonymization of data; version control;tools to ensure integrity of data files
Any tools for developing and sharing ontologies in OWL.
Anything that helps tracking who, when or what data were added or changed. Overall tracking of projects and project-based metadata standards are needed. One huge problem in data preservation is lack of translation services that allow data in obsolete software or stored on obsolete hardware to re recovered more easily.
As part of a user facility: we are developing data & software preservation protocols right now, so suggestions of current best practices and canned tools could be very helpful
Assigning DOIs to datasets. Finding relationships among disparate datasets and mapping concepts across them
Basic spectral data format
Better integration of version control software and ids that describe what revision of a software was used for publications
Collecting data from social media
Comparison of hosted platforms
Computational and experimental data. Software.
Could use a tool that can apply anonymized identifiers across multiple data files.
Curate instances of code used in published papers where that code has now been developed further.
Currently using Samvera, which does not make it easy to export metadata to create an offsite archival copy. Currently my biggest need for our repository!
Currently using available tools to track and preserve software and data.
Currently we archive our data with the LTER Network Information System (NIS) Data Portal. However, as we move forward with new projects we will have other csv files that needs to be archived/preserved. Right now we are planning to use curateND for those files.
Custom student data for educational research in STEM
Customers of an institute I run may have this need.
Data Mining on Big Data for Automated User Profiling
Data collected from children's learning app, so privacy/anonymization of students' identities is crucial. This data is currently stored in set of SQL tables.
Data from engineering education study
Data from insurance claims that need to be made anonymous
Data from robotic telescope
Data need. Test and analysis data and softwares used to generate the analytical data.
Data tool for archiving evaluation plans for citation [see] the recent AERA-NSF workshop on data sharing and transparency.
Database of observational data
Databases of mathematical classifications (similar to crystolography tables) that need timeless and language independent modelling.
De-anonymization
De-identification of qualitative data (e.g., video or screencapture). Project level organization of data.
Different types of phylogenomic data (sequences, SNPs) and data processing and analysis pipelines.
Don't know - not sure what "this project" is and what resources are available to researchers outside of Notre Dame.
Don't think so
Drupal or Wordpress plug-ins for data quality/management tasks.
Easy metadata entry and retrieval
Easy to use database to identify projects, users, etc. associated with large batches of raw data.
FLexible system for creating meta data when uploading unstructured datasets
GitHub
How to make ethnographic interviews anonymous in multiple languages.
How to preserve linux command pipelines from bioinformatic analyses?
I am developing a search engine where the search index should be shared but it is too big to be simply uploaded.
I am involved in developing preservation of workflows in computational modeling
I am not sure but we are developing a software in Matlab as a part of the project
I am not sure what "this project" does, so that's hard to answer. In general, though, I think it is more a 'community' level issue, rather than an individual researcher need.
I am not sure. I am working on a global millipede species database, MilliBase.org. Currently, I think we are doing ok.
I collect experimental data and would like to do a better job of archiving it.
I do but I am not sure a general purpose tool could help.
I generate lots of data...
I have A LOT of different kinds of data (video, written, digital artifacts, etc.) - it would be really useful if there were some sort of tool that allowed me to organize all of that data so that it could be easier to analyze it.
I have a large amount of coal geochemical and petrographic data gathered over the past 39 years.
I have a project archiving information on survey data quality. contact me another way to discuss this
I have a stellar population synthesis code that is managed on bitbucket, but could probably be better managed. This is particularly true because part of the software makes use or large training sets that are themselves too large to be managed on bitbucket. At present they are just stored on our group's website without any real version control.
I have been developing modeling software for 10+years focused on a single project, and it has gone through many revisions and extensions.
I have data [and] software to preserve, but am wary of another source or group that wants to deal with this other than myself
I have data from electronic structure calculations (VASP) from which pieces are extracted and then stored and analyzed in Excel spreadsheets. Very inconvenient for long term storage.
I have many types of data from -omics to ecosystem fluxes to imagery.
I know of several tools that do parts of the process, but having a tool that navigates the expositions tools would be great (and meta!). It would also be easier for me to find the gaps and answer this question, since different projects require different tools.
I mainly need something that makes it easy to archive large numbers of large files (the lab can easily generate more than 1 GB of data per day if we wanted to) and ideally also makes it easy to tag them with various metadata.
I only do qualitative research and use NVivo or HyperResearch to manage my text based data
I produce artifacts for almost every one of my papers, so I have dozens??_
I recommend that imaging data be stored as fits fiiles
I run the Paleobiology Database, and we would like to create archivable snapshots of the database for reference.
I struggle with how to combine usage of GitHub for collaboration on data with long-term data storage. Ideally, I'd want only one copy of each dataset, but I'm not sure GitHub is the correct location for long-term storage, therefore I have to either have two copies of datasets, or temporarily move things to GitHub and then to long-term storage.
I use LTER Network tools.
I use a variety of existing software packages that address all or most of the issues in the previous question.
I wish there were tools for more easily documenting changes to query structure in relational databases.
I work in chemistry. I would like disparate data to carry an RFID like tag that can easily collated by another software rather than forcing the student to always curate. Automate the coherent collection of all data into one location
I work with Google Docs. I'm not sure that is what you mean.
I work with fossils from other countries that I photograph for future use - these are not particularly well-preserved fossils and they are identified to different taxonomic levels. I'd like a place to deposit these images and have them be searchable and groups based on stratigraphic relationships (I think Marcostrat and PaleoBioDB have attempted this, but they don't have a place for many many many photos)
I would like to digitize and organize data from cabinets at a field station.
I would like to have an easy to use tool to document comments on individual data bits in a large data file preserved in excel, a database format, or something similar.
I would need more information on what a "software preservation quality tool" is to answer this. I do develop software that produces large amounts of data for both scientific and educational settings.
I'd like to use version control in my group among team members, but be able to see who did what.
I'm collecting a large amount of RFC 4898 TCP stack data that could use effective de-identification and management tools.
I'm not completely sure what you mean. Maybe this - I have old stopped-flow data created on an old Mac-driven system. It'd be pretty tough to look at those data now.
I'm not sure - we have code ( openmd.org ) and data (large trajectory files) that could be curated and archived better than on CRC afs space.
I'm not sure what you mean by "preservation quality tool."
I've planned on using the widely used version control system, git, for my code
I've worked with PSLC's DataShop in the past, and would love a more general tool for education data that is not action-level from a system.
Identifying cases where de-anonymization is possible; reliable provenance and timeliness indicators.
If I understand the question correctly, what would be useful to know is who worked on what, in what sequence, and what the exact updates were. Currently, much of this is accomplished through Github, Dropbox, and Microsoft Word tracking which has a combination of these abilities.
Interwoven data sets, with some common variables shared.Which variables are shared across sets changes.
It's not something I've thought very hard about, beyond satisfying NSF preservation requirements and enabling replication.
Keeping track of multiple projects over years
Large data and video files that must be made available
Large data files. Software preservation is done through Github/open source.
Linux Provenance Modules
Many tools exist... I never know the best one to select, and they're all difficult to search practically. They need to be user-friendly from the perspective of a data user, rather than the provider of data and the data manager
Maybe, but we develop quality assessment tools ourselves and can tailor them to our specific needs.
Method to archive assign DOIs to reusable workflows - see [our project]
Model My Watershed online GIS in the WikiWatershed.org toolkit
Montage image mosaic engine
More than I want to take the time to describe here
Much useful phylogenetic and microscope image analysis/control software becomes difficult or impossible to use only because Mac and PC operating systems change. Such software would still be useful if it were updated!
Multiple ad hoc bioinformatics pipelines vulnerable to changes in data and software versions.
My case is more on the data side, regarding workflow and provenance.
My main problem is WHERE to archive and preserve my data and software. I do not have a permanent (or even temporary) project website and I do not believe that NSF offers a place to archive project data. I often work with Stata, SAS or Excel, so I do not see a major problem with documentation and preservation unless, perhaps you wish to move everything to a flat ASCII file. A more serious preservation problem comes from using proprietary programs such as AnyLogic (www.anylogic.com) or TreeAge (www.treeage.com) where version changes can make it difficult to run older versions of these programs.
Need a better option for permanent archival and curation of datasets and minting of DOIs.
Need a tool that can help anonymize data prior to sharing.
Network-related experiments such as for Internet measurements or security/privacy are hard to undertake and replicate. Some tool to help "standardize" this type of experiment would be useful.
No, I am actually leading a data preservation project for NSF in my field of atmospheric chemistry so I wanted to see how these tools are being developed elsewhere.
No, I do mainly theoretical research and have minimal data needs.
No, at present our needs are met by existing tools (github, etc.)
No, but we have some advanced tools we've developed in house. We've been the subject of an ethnographic study on our software tool development, with significant overlap on this issue.
No, my field uses github which does some of this
No, the ones I checked above as most useful exist in Linux (e.g. version control systems)
No. But we anticipate a need in the next 2 years.
No. We make use of the OSF (osf.io) for collaboration and openness
No. We use github, containerization, FAIR practices.
Nope - we archive our educational materials on partner websites
Not at this time but data management is a focus overall
Not at this time. My current approach is to store my software in public/private GitHub repos. If we want to release software developed as part of our research, we add an open source license to it and switch the repo from private to public.
Not really. I typically work with data that is stored on a community server and is accessed as needed. A key aspect of the data (seismic) is that it is in a nearly raw state and uniformly saved to maximize its ability to be used in new ways.
Not sure. It is not a priority given other considerations.
Not sure. We have CODAP, an online open source platform/library. It logs data of student actions. Does that come under this project?
Nothing in particular. In general I need a place to archive code and data for journal articles.
Possibly - too early to know for sure. Ask again in 1 yr.
Possibly, but how would this be better than existing tools?
Possibly. I have many data sets that need to be archived.
Preservation of analytical data from archaeological assemblages
Preservation of mathematical software tools in a runnable, resusable form.
Preservation of software (and associated workflow) related to published papers
Preservation of theoretical software tools in HEP (FEWZ) http://gate.hep.anl.gov/fpetriello/FEWZ.html
Preservation of various levels of community data, and associated software
Preservation tools for qualitative data would be really useful!
Preserve large amounts of mysql dumps, python notebooks, etc
Preserving data
Presevration and tracking of custom data from annotated videos
Probably not at this time - we use some standard version control/tracking software for software tool development that works well for us currently
Probably not, but workflow preservation tools are interesting
Protocol navigator, it acquires meta data
Provenance tools
Providing permanent PI accounts on open git repos (e.g., bitbucket, github. etc) would go a long way to preserving data. This is not so much the lack of a tool, but the lack of permanent funding to use a tool.
Public data hosting repository with permanent reference to be used in published articles. Perhaps a link/reference to the published article.
Quality tool, De-identification tool
SQL database
Secure, affordable/free long-term storage & data sharing option
So far I am posting my papers on ArXiv and don't have much other data that needs to be officially preserved.
Software stack "versioning" - ensuring software / scripts can be re-used long after release
Something like this [LIGO Document Control Center Portal url provided] but robust and distributable backed by a google strength team
Something to help with storage of experimental data in a single location, accessible from anywhere
Something to keep track of the multitude of different data-file types that our equipment produces, along with the metadata about experimental settings, etc. An easy way to organize and access this information that doesn't require a degree in computer science and only minimal understanding of databases is necessary for any tool to be adopted.
Standards for expressing and encoding provenance
System that can be used during development and quickly clean up and select what should be made public at the time of publication.
The biggest problem I have is that tools from prior work don't work due to software platforms evolving and the tool not getting updated or missing libraries.
The provenance tool would be most useful now, I'm doing some research on the history of glacier exploration in the western US
There are lots of places we could use help...
There is a great need to develop tools for capturing the provenance of biomodels (e.g. data sources and assumptions used to build models) to increase the comprehensibility of models and make models easier to modify and combine
There is a lot of development work out there already.
There is a strong need to be able to record corrections or supplemental data provided by users of our specimens (or their digital representations) as a layer separate from the original specimen metadata but searchable and displayable with original metadata
There is no standard for preserving/publishing NMR and other spectroscopy data for synthesized compounds, analogous to the CIF or PDB.
Too many! However, you might check out the work that Vicky Steeves at NYU has done. Although, youre probably already aware.
Tool to capture metadata with minimum user intervention would be of great help.
Tools that help provide appropriate metadata for computer simulation products I generate, following a template specified by a research program.
Tools that support the proposed Metabolomics Data Standards [respondent provided citations]
Unbdergraduate research projects that span multiple institutions and leverage course embedded research projects would greatly benfit from a shared data tool.
Unsure
Video and physiological data on orofacial movements in humans and animals
Virtual Reality applications
WE HAVE DEVELOPED AN APP THAT IS INTENDED TO GATHER CROWD SOURCED DATA. OUR CONCERN IS PERPETUATING THAT DATA
Way to easily organize, share, and preserve anonymity of data
We already have structures and processes in place for our data management.
We are curating (or helping programs curate) many biodiversity databases. Helping with metadata, provenance, all this would be useful.
We are struggling with the increase in the usage VMs and Containers in research workflows. Developing a tool to aid in preservation/curation of thes would be extremely helpful.
We deposit data in public repositories such as the Gene Expression Omnibus - [and] thus use their tools.
We don't have need at this time
We generate roughly 1 TB/yr of data, mostly in the form of stacks and arrays of 2D data (images) and spectra. We extensively use custom code to process this data. We co-develop technical manuscripts and presentations to communicate this data.
We have a good deal of cycle data from engines that must be preserved.
We have collected an archive of ~4000 images of 4th and 4th graders written work on fraction arithmetic problems. These images have each been tagged with ~10-15 identifiers. There are ~500 tags in total with a complex, nested structure. We are interested in developing tools to preserve and expand access to this archive.
We have some software we have developed on GitHub. We have research data that we have had to construct our own RDBM schema for. Deidentification of data and good metadata/ontologies would be helpful.
We have two sets of data, both involving numerous student papers as well as surveys, interviews, etc.
We just completed a full analysis of about 150 TB of particle physics data and are publishing the results. This effectively puts this data out in the public domain (partly by requirement of the journal). We do not currently have a way to do this efficiently. More broadly speaking, many other experiments at the national lab I am working at are in a similar situation and the lab itself does not provide for site-wide public data preservation solutions.
We maintain several data resources for both internal and external use and are interested in many such tools, more than a simple survey could cover.
We need a tool that will help our users better manage their data. Manage means - deposit, attach metadata, attach DOIs, etc.
We need a tool to preserve the workflow in either xml or json format.
We need both data and software preservation tools, as well as training to use them for our projects that develop and apply first-principles calculations to carrier dynamics in materials
We need tools for data consolidation and maintenance
We use GitHub to preserve software -- that tool is sufficient. GitHub does not handle large digitial files.
We use google docs and leave suggestions/comments
We use the tools freely available to us at [our university]. De-identified data is stored electronically on secure [university] servers to which only project and research team members have access. Hard-copy data is housed and managed by project evaluation and research team and PI, and kept for only 1 year following the project year they were collected. For long-term use, data is housed in [departmental] secure file server. Access to all research files on the server are protected using NTFS permissions, thus restricted to only those individuals with appropriate individual or group-level permissions.
We would dearly love a tool that helps us with a project that collects reams of image data (confocal and widefield) and then quantifies images. An ability to move between Zen Black, Zen Blue, Fuji, and other image analzysis software would be amazing - if this is possible?
We write code to perform our measurements (in Labview). It would be helpful to know what version of the software was running to take particular dataset. We have thought about using SVN or hg but these are cumbersome solutions
We're planning to develop a software preservation portal that might be an avenue for collaboration.
Wea re creating databases of images of forams including manual segmentation; We are creating databases of wearable data for individuals including physiological and environmental sensing
What I really need is a way to share resources (e.g., research-based instructional materials) that will be available to the public in perpetuity, without having to worry about maintaining a server, fixing breakage as infrastructure software evolves and updates, etc. (It's a problem many of us in Physics Education Research have.)
Working on a website to store curriculum modules.
Wow, that's a really interesting question. My primary focus these days seems to be Visual Analytics. I would be very interested in a system for preserving and annotating visualizations. One of the most critical failures I see in final analyses of data, is that dozens of visualizations of the data may be generated, and these are so poorly annotated and cataloged that it becomes almost impossible to reproduce an identical visualization after even a few days of mental bit-rot. The result is an ever-growing stack of randomly stored visual analyses that are essentially useless, because it's impossible to completely understand their content. If you're actually interested in collaborating, this is a sufficiently interesting project that it might be worth talking to the NSF about specifically funding it.
Yes, I am part of a teem working on the development of software for biodiversity specialists. We are trying to envision all those issues in our product.
Yes, data about pharmaceutical quality and about lead assay results
Yes, metadata and persistent identifiers for both individual data and data bundles
Yes, we have extensive longitudinal data files and could use many types of tools to make it easier to archive the data set
Yes, we have lots of DNA sequence data that are difficult to archive and distribute. We use Github to make reproducible scripts also for dissemination but I am concerned about longevity issues.
Yes. i am currently curating a metadata set for a five-year NSF project with multiple types of data in multiple formats. i would be grateful for a tool that would help me do this curation efficiently and effectively.
Yes. As our work is funded by NSF, we have to comply with their desires for data management.
Yes. Currently working on the best format for preserving the data collected by the project.
Yes. Data base of atmospheric cloud measurments and data processing software
Yes. I need a tool to manage data and metadata and version control.
[respondent creates widely used softare] Currently the data and software are archived in different ways in different places.
[respondent describes a need where they have] de-identified data by hand in order to be able to publish the data and [need] for additional tools that remind about potential decisions, or that could even take a data set and automatically de-identify for public presentation, are potentially useful.
[respondent describes developing] an organization standard for brain imaging data and [how] it would be great to get additional help with building out the validator and related tools
[respondent describes] large software project (Einstein Toolkit) which generates data (gravitational waveforms) which are used by other projects, where it would be important to have such tools.
[respondent has] Numerous investigators that has interest in facilitating the access, maintenance, and preservation of various kinds of science based models for managing water quality and living resources in the watershed, airshed, estuary, etc. and our investigators also of course have "data management" needs associated with their research grants and publications
[respondent provided URL to a paper on LACE2: Better Privacy-Preserving Data Sharing for Cross Project Defect Prediction] http://menzies.us/pdf/15lace2.pdf
[respondent provided grant number ]
[respondent] disseminates and archives modeling software, but relies on GitHub for version control. Preservation capability would be useful.
a GitHub.com plugin for scientific software
a tool that can keep software current/updated/working with the latest versions of OS and programming languages/compilers
a way to categorize and archive coding choices and decisions over organization of data
aerosol forcing data
all software stops working after a while because the environment changed
assistance with preservation of audio data
cloud-based storage associated with PI rather than institution.
collaboration tools. Robust and compatible with Word, as easy too use as Google Drive, and tracks changes by user as well as Word does.
converting data in old (no longer used) data formats into plain text
data: large scale proteomics, transcriptomics and metabolomics
data
de-identifying
detailed research data
different tools for each category listed above
easier ways to resolve conflicts in github so that more people will use it without fearing entering a state they cannot navigate
git and mercurial already do a good job of most of these things. What we need is for the NSF and other funding agencies to REQUIRE PIs to use good practices.
github
how to assess comprehensiveness of archive files
hydroshare.org could benefit from this.
i have data of all kinds and have not really thought about preservation
integrated use of identifiers from trusted sources (wikidata, orcid ...)
integration within Jupyter
jupyter notebooks on github
laboratory data on sediment transport experiments and computer models
large streams of time-series data that is interconnected even when in separate files
long-term data for several decades that contains information on multiple thousands of individuals
multivariate data analysis
no, but would be glad to use/test anything coming out of the project
no, our data quality is addressed at the analytical stage.
normally use github
not really - my research is mostly with VR tool-building and human subjects experiments in immersive virtual environments
not really because there are so many kinds of data and its only a trained human eye that can tell whether the original data along with experimental conditions was recorded completely.
not sure
not that I am immediately aware of, but I would be interested in exploring what is developed
no
personal diligence is better than hard-to-use tools
previously uncurated data associated with astronomical journal publications
refining specimen lists based on precision/resolution of locality data would be cool
repositories of benchmarks for research in VLSI CAD
synchrotron tomography data processing
the concept/ keyword tagging would be amazing.
tool checking published metabolite tables for matching names to structure or DB identifiers
tool that helps to define and then confirm Climate and Forecast Metadata standards for unstructured grid ocean models
tool that summarizes the IP, terms of use, etc., from the data source
vr environment
we are struggling with project management tools due to a multi lab pipeline for generating data
will need to comply with NSF rules to make qualitative interview data publicly available to other scholars w anonymization
yeah, but it will be very complicated to develop.
yes massive amounts of field reconnaissance data collected by different PIs with different instruments across dates and locations.
yes, anonymizing data to share with others
yes, energy data that has bee collected

3. Is there a tool gap in your digital ecosystem or workflow? (tools_gap)

A Mac/PC compatible research log that feels like the traditional laboratory notebook
A clearly appropriate repository for data relating to a publication.
A database that links raw data and all analyses, results, or publications associated with that data
A gap exists in porting data from platforms (or resources) that support active use, to archival or preservation-oriented platforms.
A huge gap i.e. from raw data to scripts to final statistics different software is used as raw data can be acquired in different tools. So most processes e.g. de-identifying is done semi-manuell
A simple database to backup, store, and share data would be helpful. My data files are large, so TB of storage would be needed. Due to this, the database could run on a server such as Amazon web services
A tool for multi-stie data sharing.
A tool to widely invite and manage data from multiple sources
A way to hide previous versions behind the current would be useful
A workflow versioning/cloning/preservation system that has a shallow learning curve for undergraduates
API i/o to gov databases like uniprot
Ability to interface/move data among different applications aimed at similar tasks (recording digital specimen data)
Acquiring matrices from users is manual, as is archiving them. It would be great have a tool directly accessible from MATLAB that says 'save this data forever' in my collection.
All my data are stored piecemeal as it is generated, and I would like to eventually have a single database
An easy to use database to quickly search and retrieve data from multiple different types of simulations would be beneficial. Reproduction of literature description of simulations is also challenging.
An easy way to store data so that it is accessible.
An electronic lab book that actually works well
Analysis tools that don't result in proprietary files.
Anonymization is the main challenge. It would seem this is becoming an impossible goal however.
Are there any environments for preserving antiquated web applications?
As before, still working on selecting the system for storing the data.
At present, the Johns Hopkins University library has a data management team that is doing an excellent job of managing data in publications. The system also houses software, but these features are new and I am not as familiar with their capabilities and weaknesses. I do know that they are more for archiving and reference and do not facilitate active management with version control.
At this point, we are looking at ways to make data accessing, data managment and software sharing more accessible across multiple institutions
Autogeneration of metadata, one-click upload to institutional repository.
Automatic acoustic speech analyses would be terrific. Unfortunately, that is tricky.
Automatic name replacement or removal doesn't provide the anonymous level required
Bascially, we have been using fairly standard tools; Excel work sheets, basecamp, project directories, Oracle for more detailed work. What I miss is an effective template for organizing data from different sub-sets of the project to show its relation to other sub-sets and the whole project.
Better to ask our tDAR developers. Automated conversion of diverse digital objects to preservation formats is one.
Better tools for metadata creation. More user-friendly. Wizard-like.
Bitbucket, SourceTree
Capturing the whole workflow and changes across disparate things (scripts, programs, config files, ...)
Code is easy to maintain and share on GitHub. Making large datasets available is a bigger challenge. A free git for large data sets would be great.
Collapsing & expanding large data sets easily to see larger trends
Common collection site for student's and other internally developed code
Coupling models that use different variables, data formats, etc.
Curating digital workflows is time consuming. An independently verified workflow tool would be nice
DOn't think I have a digital ecosystem. I have files and backups
Data backup does not happen immediately on our system but must be manually commanded.
Data file curation from CRC afs space would be wonderful.
Data upload/download
Data, derived data, metadata
Development of metadata for a data set that adheres to a research program's standard, including accessing a library of standard names, units for the data, its space-time descriptors and generation sources and methods.
Digital repositories, esp. for sharing / merging computational and experimental results
Digital scholarship workflows are not well defined and often rely on 3rd party software. Assessing those softwares would be helpful (i.e. Scalar, timeline.js)
Easily getting data out of the databases in a useable form for different end-users and making sure that quality issues are flagged in a way that will be apparent.
Easy version control for wet lab protocols
Expertise in using relational database software (Filemaker)
File conversions (netCDF -> ArcGIS raster, for example) remains a time sink. A tool to easily convert common filetypes would be useful.
Finding a way to track changes in data analysis-i.e. which data sets are most current, modified, ability to return to initial unedited data
For now Git(Hub) works pretty well for our needs + institutional library infrastructure.
For the moment there is no digital ecosystem support for research offered to professors at our institution
For this type of project, we don't have an end to end workflow. Much of the processing is ad hoc pieced together from tools available online.
Free and open electronic lab notebook software integrating chemical structures
Frequently, data capture, processing and analyses must be performed in different software packages, which often don't read the same file formats. I'm constantly looking for methods to simplify this. Currently, I'm diving into R as a one-stop shopping tool; however, the difficulties here are with identifying what tools to use when and learning how to use them.
GPU support for tools like Jenkins
Gap between raw data and stored annotated data - big data
Good and intuitive project management tools: task assignment and tracking and tagging across projects. Intuitive organization and easy to update.
Good latex sourcecode annotation
Good task management software
Had to get help with importing spreadsheet (Excel) data into a data matrix format
Hard to say whether what we now do by hand is automatable or not...
Having an interactive Gantt chart would be helpful - assigning tasks and checking off tasks would be functional
Heterogenous Data management
Huge gaps. We are an academic lab with limited resources, thus we cannot develop sophisticated data management tools from the ground up.
I am embarrassed to admit that we are way behind in the process of preparing and sharing our data with other members of the research community. One consideration that has kept us on the sidelines for too long is that we conduct research on public understanding of climate change, and we are concerned (and we have funders who are concerned) that our data, if made publicly available, will be used by opponents of climate action to confuse the public and delay climate action. So, selective sharing of data with only qualitified research is an interest of ours.
I currently don't have any digital workflow or project coordination tools besides e-mail and dated files
I do event data development, which means content coding of (typically) news stories. Keeping track of multiple coder marks is critical to this process and evaluating inter-coder reliability
I do need better means to document procedures used to treat data and to release intermediate data products.
I do not even have a digital workflow yet.......
I do not have a convenient storage open to the community to large collection of computaitional data that was obtained in the last few years
I do systems research; reproducing those sorts of things requires specific HW and very controlled execution environments.
I don't even know enough to know where that gap woudl be. I use data management provided by my research site and/or institute... so it changes a lot. None of them are easily searchable when I'm looking for someone else's data.
I don't have a workflow ... so that's my gap
I don't have the resources (personnel, time, money, etc.) to diligently document all the data my research group generates.
I don't think so. There are a lot of good tools out there for analysis, version control, and digital note taking.
I have copies of questionnaires and SPSS files from many past surveys. Anything that would help me go beyond that stage with a minimum of effort would be a help
I have no idea what "digital ecosystem" or "workflow" mean.
I have tried a variety of tools to serve as "Digital Lab Notebooks"--and none of them are very good.
I move between many tools and those connections could be automated or at least turned into a standard workflow.
I need safe, easy-to-access cloud data storage. This is not as easy to find as you might think.
I share data via standard repositories (e.g., Dryad) but am otherwise old-school and wouldn't really know where to begin in response to this question.
I use Asana to track projects, but I don't want to pay for the professional version that has the tools I really need.
I use R for most of my data analysis, but am on several working groups for evaluation of STEM education programs. Would love a tool to let programs contribute data from common assessments and then run basic EDA and statistics.
I use RMarkdown for drafting manuscripts (good reproducibility), but find other tools (like google docs) better for collaborative editing, especially with collaborators who do not use R markdown. What's difficult is bridging the gap between the two without error. Current approach is to use a diff-check tool to ensure both versions are the same.
I use excel to prepare data files, put them into JMP for analysis, sometimes going back and forth between excel and JMP. Once the analysis is complete, I use Sigmaplot to graph the data and Word to make the tables. It would be nice to have one package that did all these things, and seamlessly and quickly. I don't us R, partly because it has a steep learning curve and I'm older and don't have time to devote to it, and from observing my students use it, it seems very clunky. JMP is ideal for my needs because it is quick, easy to learn, and interacts well with excel.
I use video data, so it is very tough to anonymize
I work with many other who do not necessarily posses the knowledge to work in relational or graph database environments, so I often get flat files which I have to import myself. I wish I could provide users with a friendly format for them to import their data into a useful structure and provide metadata so that I would have to do less of that on the back end.
I would like something that stores raw digital data, processed data, data graphs, and metadata all in one system.
I would like to be able to set up arbitrary scripts or data elements to aid in metadata creation.
I would like to bring tools for search together into one framework but all existing approaches are very isolated and challenging to apply.
I'd love a tool that allows me to track which projects I'm working on, since i work on so many. For example, I could toggle which project and it would keep track of the time over the course of the week so that I could have an accurate count of time.
I'm new to large data sets, so do not know
I'm not sure - it has been difficult to find a tool that is protected that allows me to upload all of the different types of data into it (especially large video files).
I'm not sure if this is what you are looking for, but we have a need for better notebooks that are easily updatable by multiple individuals but also can be backed up routinely and securely both to the cloud and locally.
I'm wishing to create digital excavation forms (archaeo) and better data archives.
I've started using git several times, but the difficulty of undoing mistakes led me to drop it every time
I've written most of the software that I use myself (since I also designed and built the main data-collecting instrument). Simple ways to securely archive the data are big gaps.
Identifying contributors and contributions at all stages; integrating provenance and identification with dropbox, google drive and other integrated shared folders
If I could export both data and metadata from Samvera, I would use the DC Package Tools to create bags for export, or run them through Archivematica. Automating this workflow would be useful, but not a big gap
In computational hydrology, there are a variety of data sources, formats, software tools, etc that scientists use. There is no simple solution that can handle this variety of data and software tools.
Integration between tools & development of standards are my two major challenges.
Ipad integration
It is difficult for students to keep track of the many files generated in computer simulations on many different platforms.
It is difficult to document the source of a parameter or observation in my datasets and mocedls. For example, Stata now allows rather long labels but I would like an additonal "Source of the data" field apart from the label for the variable.
It is hard to gage compatibility of many existing repositories with future releases of Ubuntu OS and ROS version.
It is still very challenging to capture all the different forms of data we produce in the lab and the field.
It would be nice to be able to migrate between systems -- I currently use git, overleaf, dropbox, and svn, depending on the collaborator. It's not very transparent.
It would be useful to have better tools to manage and curate training data sets that are needed to go with software, but that are too large (~1 TB in size or more) for standard code management tools like git or mercurial to be useful.
Just time to learn new things
Knowing when and who changed what
LIGO has a proprietrary data format supported by an extremely complicated and evolving reader available for a subset of data analysis environment. While poverful and comfortable for insiders, it limits future public access
Long term storage and archiving of very large datasets.
Long term sustainability - archiving material to live after the grant
Lots of gaps. Most tools are unrelated to each other so a lot of manual tracking.
Lots, but mostly we develop our own tools to fill those gaps. Tagging of right-to-left scripts was a recent example.
Main problem is getting students to follow procedures.
Maintaining version control. I need to be better about this.
Many. One critical one is record of the content of a link (e.g., from social media tweet, post)
Metadata tool
Metadata/tagging needs to be easier
More a cultural gap. Data are not often shared between labs.
More a gap in knowing how to use the tools...
Most of the "tools" listed on the first page do not exist in my current data collection, cleaning, and management environment.
Most research in the atmospheric sciences involves analysis of TB (or larger) datasets using custom-coded programs in languages ranging from FORTRAN and C to R and Python. Archival of such data sets and tracking of the software's evolution (over years or decades) has not been practical. The solutions I've seen advertised are for vastly smaller datasets and commercial-off-the-shelf (COTS) software. That's not relevant to the atmospheric sciences.
Most workflowsare external and thus poorly captured, if at all
My university does not support workflows
Need a better option for permanent archival and curation of datasets and minting of DOIs.
Need a tool to help us collect data from multiple surveys and analyze the data
Need a way to easily and effectively navigate versions and cloned resources.
Need better ways to archive a variety of documents and resources in standardized and searchable forms for sharing. Need ways to archive collections of large data files for sharing.
Need digital notebook for lab members and students at reasonable cost
No easy way to publish & preserve python notebooks and data
No extensive metadata tool that is easy to use
No gap, because we are developing our own quality assessment tools in our data analysis pipelines.
No good tools for aggregating data the data needed to build biomodels, organizing this data, using this data to build models, describing how models are built from data, or tracking the provenance of this data
No permanent archive that may be made available outside of the institution
No simple annotation file tracking
No standard databases used. No requirements from journals.
No. There are some challenging de-identification issues when we link our basic data sets to county-based data sets, but most of this requires judgment rather than automation.
No.
Not quite sure what you mean-but it would be helpful to have something that better helps us track data, enter data from multiple data collectors, see where we might be missing data etc-better than access or filemaker pro.
Not sure. Some of the things described can be done with standard version control software; intent of the others in the questionaire was not clear from the brief description given.
Not sure. Currently, for researchers such as myself, I am not even aware of what tools are out there that could be useful.
Of course there are. It depends on which ecosystem or workflow, which project, which collaborators. You're oversimplifying with this question. Just building more tools isn't going to fix the myriad tool gaps! But there are hundreds if not thousands of researchers who desperately need open source cross-platform multi-user non-cloud-based content analysis software. It's a niche market with such terrible software options that some people actually use Office instead, which is really no better than using highlighters on printouts of transcripts, and quite possibly worse. Dedoose is the only tool that starts to meet the need, and it's simply terrible on a number of dimensions.
Often proper version control software, or its strict use.
Online help and maintenance templates
Only that we don't make good use of existing systems because folks who end up in social sciences are sometimes scared off by the fact that these are generally tailored for use by computer scientists, etc.
Organizing metadata
Our STEM student database. We have a disconnect between our Office of Institutional effectiveness data and the data we need to track our STEM students. We have a home-grown system to track students' activities and accomplishments, but it is limited in its capacity.
Our preservation of workflows and computer-generated results is very much ad hoc. We have difficulty associated specific research data with specific publications.
Our workflow occurs through continuous communication - so email, telephone, cloud storage, and video conferencing provide all we currently need.
Perhaps, but immediate problem is sorting through files that were not always in the same format.
Platforms that provide tools
Power and networking monitoring software
Preservatio of workflow and generation of appropriate metadata are both problems.
Preservation and compatibility
Provenance and workflow capture/preservation tools are missing. No way to preserve large data sets in a re-useable way.
Provenance tools: right now, it's a set of versioned data files... not the greatest solution
Provenance tracking that includes software down to the last library
Provide metadata for others, store large quantities of data (Gb!)
Pushing paleomagnetic data to the cloud and searching the data internally
Reproducing other people's network experiments or understanding exactly what steps other researchers have done in their experiments.
Right now I'm doing a lot of simulation and find it a challenge to keep track of everything. I keep good notes and a spreadsheet of analyses and parameters but feel there must be a better way
Saving workflows...or giving a workflow to another person
See previous and I use Wiki for process documentation, but it can be difficult to access outside of the network at times.
Software to easily de-identify computed tomography images of patients acquired on multiple vendor CT scanners
Something that helps me keep up with recent related developments in the literature
Somewhat; tDAR has some tools but more would be welcome
Sort of, the accepted databases for paper/manuscript data on major publishers like Elsevier don't include GitHub or Zenodo. They exist, though, so it's not really a gap for me.
Standard way to create a data repository. Is this possible without professional/human support?
Standardization of software is lacking
Storage and annotating workflows.
Storage
Submission of DNA sequence data to GenBank
The biggest gap is not having an integrated task manager like Asana built into the OSF platform.
The entire workflow enterprise for phylogenomics is problematic for those of us who don't wish to invest huge sums of time in learning cryptic bioinformatics programs.
The gap is in merging data collected from multiple instruments to perform retrievals of meteorological parameters. Also the gap is in between processing collected data, cleaning it and plotting.
The gap, I think, is in my knowledge and experience. Better use of R, automation, github I think would address my needs. So, gap is in ease of learning and training. I know there are no shortcuts, but any tool to help learn those is great!
The issue for me is that we have an abundance of workflow and management tools, each project taking on different ones, making it difficult for an individual to manage between systems. E.G. one group loves Trello, another Google, and so on and so forth.
The main gap is due to the prevalence of high energy and nuclear physics specific file formats preventing adoption and reuse of data and algorithms. Translator tools could be very valuable.
The preservation piece is spotty at best
The software to compute and load data is proprietary and weakly maintained, an XML or public standard would be best.
The team has set up a protocol for data managment.
There are good tools already around that we don't use, because I can't get enough of my colleagues to leave their "Email plus Word plus Excel" comfort zone. :(
There are many -- all the issues you asked about and more. I am in the social sciences which is in even worse shape than the geoscience and the biosciences.
There are many gaps - but I'm a tool-builder by trade, so that's not surprising.
There are many tool gaps in my research workflows. I've tried to fill these gaps by developing software, but the software is usually fragile and not very robust. I don't have research funding to sustain software development in the long run and so I haven't been able to keep all of the tools I have developed current. The tools I use include software for loading streaming sensor data into operational databases, data models for storing and managing the data, software for visualizing and performing quality control post-processing on the data, etc.
There are no readily available tools that can capture the metadata without manually curating the data
There is no simple, standard way to archive data in the way that NSF requires.
This may be out-of-scope, but it would be hugely useful to create some type of CV database tool that could export differing formats from the same set of information, so faculty would not have to constantly re-produce different CV's (i.e. 2-page, full, CV for a public website, CV for P&T)
Tool for organaizing a data librray (like mead librray) but on multiple resourses can be extremly useful
Tools could at least do a better job documenting what version of each library etc. is needed
Tools for analyzing qualitative data don't provide any support for preservation or data sharing.
Tools for brokering
Tools for easily sharing LARGE data-sets with data-sharing agreement requirements
Tools that link conceptual models from different tools.
Tools to automate meta data gathering as users/workflow do their work.
Tracking changes in data files (e.g. from data cleaning and preparation)
Transfer of hand written paper records to digital. Like it or not, hand written records are going to be a fact of life in both the lab and field.
Transferring data or files from our local computers to archival services is laborious and time consuming
Translation to standard data (electrophysiology, fluorescent images)files for open deposition
Understanding metadata and provenance completeness
User friendly bioinformatic tools.
User knowledge--I can build a database but I can't make people use it.
User-friendly database management information
Version numbers, or update notifications
Versioning of GIS datasets - when was something changed and by whom - across multi-institutional projects
Way too much manual effort in anonymizing across multiple data sources.
We are able to make backups of data on our campus, but we do not have checksums or point in thime backups to ensure data have not become corrupt and become part of the backup data.
We are collecting data on game usage. Having a way to make that available to researchers at UVM in real time, without impacting our edu game servers would be helpful.
We are still lacking a tool to find and import georeferences in other databases for identical localities represented in our collection.
We could benefit from many of the suggested improvements to current system, all of which I checked as extremely useful.
We currently have to manually rework the data naming and organization to be more user friendly upon release. This is largely caused by our need to build custom data collection software for our experiments (autonomous robots).
We do not have effect workflow tools or curation tools
We have done well at coordinating between multiple tools and resources that allow a digital ecosystem to be organized and managed. Each has strengths and weaknesses, but also allows for flexibility as needs change. A single tool to do all this would be extremely useful, but not sure whether it is feasible in terms of the flexibility aspect.
We have found tools, but platforms change over time, and it all ends up being in an Excel file. So simple and stable would be better, but nothing specific now.
We haven't completely developed specific workflows yet for data distribution, but are in the process.
We implement various ad hoc strategies for tracking workflows and provenance
We need a way to database mass spec data in a searchable way. We also need a simple way to combine mass spec sample groups for simple comparisons.
We need better facilities for making software tools work together
We need to backup multiple versions of large datasets. Also it would be useful to identify the projects and workflow associated with the datasets.
We really need to move to an electronic notebook system; however, I have many concerns that need to be addressed before doing this.
We run huge numbers of experiments, some conceived of, started and ended within minutes. It would be great to somehow archive all of this (software and data), but if it slows down our workflow then that would be problematic.
We struggle to get our analysis pipeline working automatically
We use OSF, but some files (excel spreadsheets for example) won't open in that system.
We use nothing now.
What data? We're pretty well set for domain data, and human subjects are externally constrained
While there are certainly tool gaps, often a greater hurdle is linking all of the tools together into an actual workflow that can be successfully documented.
Workflow preservation and automation. Preservation of code build / linked libraries information, and preservation of files that were patched in a given build / calculation. Tools to improve code development would also be welcome
Workflow preservation and reuse
Would like to be able to preserve images and maps as well as other data, and make these searchable.
YES. WE ARE GATHERING A LOT OF DATA, BUT IT IS PURELY INTERNAL AND WHEN OUR FUNDING ENDS OR OUR INTEREST WANES THE DATA WILL BE LOST AND FORGOTTEN
Yes there are gaps in tools for that matter, but it is hard to describe them without going into too much specifics. We identify such gaps and try addressing them in our own work.
Yes, I generate high-throughput physical data that then needs to be transcribed into our current database.
Yes, but it is highly specialized based on our bioinformatics-drive research. Our workflows are constantly changing as new tools and approaches are developed (usually by others).
Yes, database updating.
Yes, few tools exist for provenance of data.
Yes, no standard tool for metadata management
Yes, of course. Anyone saving heterogeneous data has a gap, usually several. Again, more that a simple survey could cover.
Yes, the database is limited
Yes, there are significant differences among funding agencies as to the standards of digital preservation and sharing. This means that every dataset to be archived needs to be customized, along with metadata, often in an awkward (i.e., click-based web form) way.
Yes, tools for tracking data workflow and provenance are lacking.
Yes, tools to monitor, assign, record workflow for team members.
Yes, we don't have any real tools or processes for this at the moment, we are in the beginning stages of building a community data respository to start understanding the issues and providing basic capabilities.
Yes. For all the things I answered on previous question!
Yes. There are no community standards for documenting large data sets. HDF5 is adequate for storing data, but there should be standards for documenting what the data is, when it was acquired, what its units are, and so forth. There is also no way to guard against data theft (re-use without acknowledgement).
Yes. A database tool that integrates well with different platforms especially linux
Yes. Currently I only used manual reproducibility methods: scripts, notes, record of checksums, etc. Not aware of or tried any tools out there.
Yes. De-identification is not handled yet.
Yes. It is very hard to model 1000 of curves and keep the results secured, automated, comprable
Yes. It's completely ad hoc and varies even student to student (even as the group tries to come up with reasonable shared practices).
Yes: a workflow system for large, distributed-systems experiments
Yes; it is too hard to share our data because there are no standards; my lab is working on a standard framework for neuroscience
a better tool to search pdfs for content
a tool to update software; web-based computing
a) abiltity to make the data anonymous, b) linking person with file changes, c) easily accessing images and quantified data files, d) a way to make the data available online to invited folks
abetter way to connect, preserve and indicate changes
archive data availability and distribution
archiving of software and OS used to create data
automated commenting tool for when code is updated
automated process that manages the change of custodianship / provenance, when data / software has become inactive.
better de-identification tools would be helpful
better document / todo list / annotation integration
better organisation of notes, preprints, data and program files
better segmenters for webpages to aid in analysis and indexing
better tools for crearing wiki based documentation that can also produce document versions, and be linked to code versions
big data metadata archival tool
biggest gap is teaching more students about digital workflows
co-temporal collaboration of dozens of people in the same document in real time
collaborative databases
collection analysis
combining very very large sets of data with missing items by item numbers
concept / keyword nudging
connecting the data and code we develop in-house to the metadata and other standards of repositories. Always seems like a daunting slog to post data, thus easy to put it off.
cross-platform fixity
data lifetime management that integrates regeneration vs retrieval access methods
datasets currently do not have a good way of maintaining metadata
decision tracking
deidentifying quickly
document stored data and archiving analysis scripts across lab members
easily managing duplicate and superseded files
easy automatic backup on Windows + organization of data files
easy to use, freely available laboratory data analysis tool
embedded provenance collection and querying
field data that periodically collected are added to a master data file. This file then has to be used to develop data sheets for next time data collection. This is a laborious process full of potential errors. This reprezents a gap in the workflow.
file format conversion to common
gaps: model provenance to model building; model execution and calibration; model analysis and relationship to data.
geological maps present a special challenge due to their inherent nature of being both data and interpretation. Currently we are in a electronic format but the general problems that have always existed remain.
git fills most gaps; the only problem we run into is students being afraid of branching/checking in/etc. because conflicts are hard to resolve
good reference management
have not implemented any version tracking for either software or map data
image processing
large enough data storage for large digtial remote sensing files
lifecycle management
making data sets citable
making sure data from excel spreadsheets are entered into a structured, relational database
many; important ones are ways of entering and preserving metadata and data quality and uncertainty
my digital ecosystem is rather unorganized, with many parallel species running wild.
organize data to be easily accessible
paper data of various kinds (not all neat) to digital database
perserving provenance in Mathematica notebooks
preservation of old paper seismograms
provenance tool would be useful to keep track who/when/how curated and changed data/metadata over the years (microscope data live very long time).
see previous response. I think we need more standardized input of data at the community level; eg, Dryad (ie, more like Genbank)
seismology has multiple tools to do this. Mostly for data preservatoin and use, maybe not for workflows.
shared digital lab notebook?
sharing of essays with multiple drafts; use of Google docs helps but the drafts get lost
simulation configuration documentation, post-processing provenance
the gap is really about getting the tools to work in a reproducible containerized way with our data standard - which we are doing in the BIDS Apps project (http://bids-apps.neuroimaging.io/)
there are many academic tool gaps, given the existence of a very strong electronic design automation software industry. (Access to commercial tools and "design enablements" is not consistent across research groups.)
there may be. our data are stored by project, and at this point we have accummulated too much and need a way of organizing it at the lab level
tools for rapidly and easily identifying, labeling, and classifying file and documents associated with different experiments
tools that enable old code to still run without modification
tracking revisions to finite element software developed in research
unix pipelines and workflows need to be more accessible
updated FAQ or manual pages for all features put in by the users
version control tools
versioning software specifically for data rather than code
versioning, logging chain of used data+software combinations
we are looking into electronic lab book and collaboration software. Right now products are mostly biology focussed...
we have plenty of tools - what's lacking is user discipline and adoption
we really need a way to better facilitate metadata record creation, especially for software
we use dropbox, so I defintelt have a little fear that someone might delete or change something that was not correct.
yes we need a better platform for developing algorithms for use on sensitive data
yes, I need a system to help maintain and share resources
yes, microattribution (connected to multiple authoring of data)
yes-getting sequence data from the provider onto the mainframe takes too many steps
yes: I need anonymous Globus Online access to data at a supercomputing site

Researcher Behavior

4. How familiar are you with tools used to share, publish, cite and preserve data or software? (res)

5. Do you anticipate publishing or sharing your own data or code over the next five years? (res_pubshare_5)

6. In the past, how often have you made your research data free to access, reuse, repurpose, and redistribute? (res_open_data)

7. Is any of your data or code published or shared now on a repository or website? (res_pubshare)

8. In the past three years, have you or your research group made publicly accessible the following items through your or your institution's website or a third-party repository? (res_open_data3)

9. Do you need better tools to share, re-use, cite, publish or preserve your own or others' Data and/or Software? (res_tools_share)

10. When asked to submit keywords to describe your own or others' research is it usually: (res_keywords)

11. When asked to submit keywords to describe your own or others' resarch how accurately do the terms available usually describe your work? (res_keywords_avail)

12. Does your employer require you to make any of your publications or data openly available? (res_employ_open)

13. Do any of the organizations who fund your work require you to make any of your publications or data openly available? (res_fund_open)

14. Please name the software/tools use use most to share, publish or preserve your research (res_open_tool)

Sample of responses:

3i
AMBER
Academia.edu
Access
Adobe Acrobat
Adobe Photoshop/Raw
Agave
Alaska Arctic Geoecological Atlas (http://alaskaaga.gina.alaska.edu)
Apache
ArXiV
ArcGIS
Avizo
Axiell's KE EMu software
BCO-DMO
BEST
BamBam
Bitbucket
Blog
Box
Browser
CSDMS Model Repository
CUAHSI Hydrologic Information System
CVS/SVN/GIT
Cambridge Crystallographic Database
CurateND
Cureus
CyVerse Data Store
DLM (A LAKE MODEL)
DNASTAR to share DNA sequences
DOIs
DSpace
DataDryad
Databrary
Dataverse
DeDoose
DesignSafe
Digital commons
Dreamweaver
Drobo harddrive storage cluster
Dropbox
Dryad
ESRI
Earth System Grid
EarthChem
Endnote
Ensemble Computing Education Portal computingportal.org
Etsin
Excel
Fedora
Fedora/Samvera
FigShare
Filemaker Pro
Forge
GEO Omnibus
GEO subroutines
GenBank
Gene Expression Omnibus
Git
GitHub
GitLab
Google
Google Drive
Google Scholar
HTML
Hard drives and lab notebooks
Hub0
Hydroshare from CUAHSI
ICPSR site
IDEALS website with the help of the sediment experimentalists network
IDL
IRIS DMC
Ideally, SIUE SPARK http://spark.siue.edu/ but we they are not yet well set up to accept data
In-house TaxonWorks
InDesign
Institution's website
Integrated Data Viewer
Joomla CMS - DocMan
Journals
LTER website
LaTeX
MATLAB
MDSplus
MS Word
Marine Geoscience Data System (MGDS)
Materials Project software and repository (DOE funded)
Mathematica
Matlab
Mendeley
MetabolomicsWorkbench.org
Miami University Scholarly Commons
Microsoft Excel
Microsoft Office
Microsoft SQL server
Microsoft Word
Microsoft access
Microstructural model
MongoDB
Morpho
Mozy in sync and other backup software
MySQL
NASA DAAC
NCAR research data archive
NCBI
NCBI / Genbank
NCBI database
NOAA NGDC database
NODC
NVivo
Neotoma Paleoecology Database www.neotomadb.org
Olex2/Shelx
OneDataShare
Open Context
Open Science Framework
Open access publishers
OpenDAP
OpenICPSR
OpenQBMM
Overleaf
PASTA
PDB files
Paleobiology Database
Pass 8 Fermi pipeline
Peer reviewed publications with extensive supporting information
Personal WWW page
Platform for Experimental Collaborative Ethnography
Protein Data Bank
PubMed
R
RAID software drivers
ROOT
RStudio
Research Gate
SAS
SPSS
STATA
Sakai-based learning management system
Scholarworks
SeaBASS
Sequin (GenBank)
ShareLatex
Skills Commons
Socaster
SourceForge
SourceTree
Specify Software (for museum data management)
Stata
Studiocode
Subversion
Symbiota
Time Machines (Mac OSX)
TopSpin
anecdata
apache
arXiv
authorea
bitbucket
box.com secure storage to share
browser
chemDraw
comses.net
cran.r-project.org (published code)
crawdad
design-safe
digital scholarship
drupal
dryad
einstein toolkit
email
endnote
erdapp
ftp
griidc
idigbio
igv
lab archives
lab website
matLab
nanohub.org
ncbi
netcdf
nextcloud
no "tools"--saved files, text processing.
oceans2.0 (ocean networks canada)
online journals
online publishers/archives
openfmri.org
pgadmin III
python
qualtrics
sequin - genbank
sfst
sftp
solidwork
sourceforge
spss
subversion
svn
sync programs such as Syncback Pro
tDAR (http://tdar.org)
university institutional repository
wiki
xBio:D
xppaut
zotero

15. In your estimation, which of the following currently have the infrastructure required to provide long-term public access to your research data? (res_pub_infrastruct)

16. Do you actively create and manage metadata to make sharing, finding, or documenting provenance of your own or others' data or code easier? (res_metadata)

17. How important to your work are Web-based applications that provide access to the following specialized resources? (res_tools)

18. Do you create and/or use software, code, or scripts in your research? (res_code)

19. Have you ever authored software, code or scripts to analyze or produce your data? (res_author_sw)

20. Have you ever hired or supervised someone to author software code or scripts to analyze or produce your data? (res_hire_swdev)

21. For others to reproduce your results would they need your software, code or scripts? (res_reproduce)

22. Do you use or revise commercial and/or freeware software, scripts, or tools to analyze or produce your data more often than writing new code or hiring someone to write custom software to analyze or produce data for your project(s)? (res_external_sw)

23. Please name some of the software/tools you use most to analyze or produce your data (res_external_sw_name)

ABAQUS
ACER Conquest
ADAM
AMBER
AMDIS
ANSYS
ANSYS FLUENT
ASE
Abaqus
Access
Adobe Photoshop
Aladin
Anaconda (Python 3.x)
Antelope
Anvil
ArcGIS
Astronomical Image Processing Software
Astropy
Atlas
Autosignal
Aviso
Axiell's Ke EMu
Axon PClamp
BASE-9
BBSketch
BLAST
BinBase
BisQue
Bowtie
Breseq
C/C++
CESM
CGAL
CHARMM
CIAO
CODAP
COMSOL
CellProfiler
Chemstation
CholMod
Cipres Science Gateway
Clojure
CloudCompare
Cobra
Common linux utitilies: xml_grep, sed, grep, awk, Python...
Community Earth System Model
Corinna
Custom software
DAVID
DFT Codes
Datavyu
DeDoose
Dipy
Domo
Drupal
EEGLab
ENVI
ESRI
EasySpin
Eclipse
EinsteinToolkit
Enzo, enzo-project.org
Excel
FEWZ
FLAC (a finite difference program for Fast Lagrangian Analysis of Continua)
Fast Holographic Deconvolution
Fermi Pass 8 pipeline
Filemaker Pro
Firebase
Fortran
GAFQMC
GAMESS, Molpro, Cfour
GAMS
GISMOTools
GMT
GNPS
GSAS
GXSM
Galaxy
Gaussian
Gaussian 16
GenAlEx
GeneSpring
Geneious
Gephi
GitHub
Gnu software, lots of different packages
Goldvarb
Google
Google Earth Engine
Google Sheets
Google spreadsheet
GrADS
GraphPad/ Segmentation based softwares
Grapher
Graphpad Prism
HDExaminer
HEAP
Hadoop
Home made Python and C++ code
HydroShare
IDL
IGOR
INHOUSE CODES
IRAF
IRAF (Astronomy data analysis SW)
Idea InteliJ
IgPet
Igor
Igor Pro
IgorPro
Image J
ImageJ
ImageJ/FIJI
ImageQuant
In house custom
InterBase XE
Interactive Data Languge (IDL)
J.A. Woollam Company WVASE32
JMP
JMP (SAS)
JMP, sklearn, Matlab, CPLEX, ...
Jasp
Java Ocean Atlas
Javascript
Jmp
Julia
Jupyter python
Kaleidagraph
Kaliedagraph
Kinetasyst
Kinetic Studio
LAMMPS
LPILE
Lab View
LabView
Lammps
Landlab
Lapack
Letax
Lumerical
MAGMA
MANOA
MARS
MATLAB
MDSplus
MNova
MOOSE
MOTHUR
MRBAYES
MS Access
MS Excel
MS Office
Macaulay2
Magma
Make
MatLab
Matab and its toolboxes
Materials Studio
Mathematica
Matlab
Matlab and Octave
Matplotlib
MestReNova
Mestranova
Microsoft Excel
Microsoft Excell
Microsoft Office
Minitab
Molcas
Mostly R libraries and custom scripts
Mplus
MySQL
N/A
NASA's Global Climate Models
NCAR command language
NCL
NCO and CDO operators
NEMO5
NLTK
NVivo
Naff for orbital frequency analysis
NetLogo
NetworkX
Nikon Elements
NovoExpress (ACEA)
Numpy
Nvivo
OPUS
ORCA
ORIGIN
OTO: ontology term organizer
Octave
Open Heat Map
OpenFOAM
OpenMD (our group code)
OpenRefine
Operational and research models
Origin
Oxcal (Oxford Radiocarbon Lab)
PASTA
PAUP
PAUP*
PC-Ord
PERTURBO
PHP and MySQL
PINY_MD
PORTA
PRIMER+
PYTHON
Paleobiology Database
Paleomag / and PmagPy
Panorama by Provue
ParaView
Paup*
Pgplot
Phenix, Coot, etc
Photoshop
Phyluce
Plexon Omniplex Software
Praat
PyMOL
PyMOL and ProMOL
PySB
Pyhon, pandas, numpy, etc
Pymol
Pythin
Python
Python (Numpy, Scipy, Matplotlib, Pandas)
Python SciPy Stack
Python and Python open source Python libraries
Python and its libraries
Python packages such as NLTK, networkx
Python, C++ compilers
Python, Jupyter
Python-based code: scipy and specialized astronomical packages
Python/Numpy/Scipy
Python/SciKit
Python/anaconda
Q-Chem
QIIME
Qiime
QtiPlot
Qual software tools
Qualtrics
R
R (esp. packages psych and lavaan)
R (many packages)
R (statistics)
R AIC modeling
R Studio
R and R packages
R libraries
R packages
R statistical software
R statistical software packages
R statistics
R stats
R+ packages
R-based packages for phylogenetics (ape)
R/R Studio
RAMS (https://en.wikipedia.org/wiki/Regional_Atmospheric_Modeling_System)
ROMS
ROOT
ROOT data analysis software (root.cern.ch)
RapidMiner
RoboMongo
Root
Root/Roofit/PyRoot
Rstudio
Rstudio (esp. Bioconductor packages)
SAMtools
SAOimage/ds9
SAS
SCIRun
SPREAD SHEETS
SPSS
SPSS/AMOS
SPike 2
SQL Server
STATA
Sage
SageMath
Samtools
SciPy
SciStat
Scipy software stack
Seismic Analysis Tool
Sequel Pro
SigmaPlot
SigmaPlot / Sigma Stat
Sigmaplot
Simca
Software that I write
SolarSoft/IDL
Solo3
Spss
Stanford NER system
Star-CCM+
Stata
Statistica
StickWRLD
Studiocode
SurveyMonkey
Symbiota
Systat
Systematic Analysis of Language Transcripts
TEI markup
TEOS-10
TURBOVEG
Tableau
Tabulator
The R Project for Statistical Computing
The Unscrambler
The Weather Research and Forecasting (WRF) Model
This is too broad a question
Tilia
Tools at nanoinfo.org
TopSpin
TreeCorr
Trilinos
TurningPoint
Tuxedo suite
UCSC genome browser
VASP
VHDL
VMD
Varian/Agilent VNMRJ
Vasp
Vicon Blade
VisIT
VisIt (LLNL)
Vortex
WEKA tools
WRF-CHEM
Weka
Weka / Tensorflow
WinUV
Xcode
Zoho
a variety of programs in R
access
ambertools
anafora
arcGIS
arduino
atlas.ti
bash
bash scripts
bioformat
biopython
blast
bowtie
cern root, other cern libraries
command line
compilers & performance libraries, e.g., gfortran, gcc, mpi
comsol
custom data reduction code
custom python scripts
daisy lab
dcfldd
dedoose
don't understand
dplyr
elastic search and other indexers
electronic structure codes
excel
excel/spreadsheets
gcc compiler
github
gnuplot
homemade
iPython / Jupyter
iPython notebooks
idel/python
idl
igor
igor pro
imageJ
in-hourse developed software
in-house, custom
ipsumdump
iraf
iraf/pyraf
java
javascript
kaleidaGraph
keysight ICCAP
labview
latex
mathematica
matlab
matplotlib
matplotlib/pyton
mestrenova
microsoft excel
migrate-n
miscrosoft
mothur
my own code for solving kinetic equations of gas dynamics
my own software
ncl
ncss
netlogo
nipype
numerical packages
numpy, scipy, scikit-learn
nwchem
octave
origin
originpro
orion (radiation-magnetohydrodynamics code developed by a consortium)
our own Python code
ovito
own programs
paraview
perl
powerworld
program R
proprietary software
proprietory
pySPACE
python
python (numpy, scipy, matplotlib)
python and libraries
python custom codes
python notebooks
python scripts
python/numpy
python/numpy/matplotlib
python/scipy
r
r programming environment
real essi simulator
root
sage
sas
scikit-learn
sequence alignment programs
sigmaplot
snakemake
spreadsheet - Excel
sps
spss
stacks
standard tools like Exel, SigmaPlot, SAS
stata
suitesparse.com
survey monkey
tcptrace
too many to list
toolboxes in python
tools I have created
trackpy
tuxedo suite
various C/C++ libraries
various companies that produce sensors
various phylogenetic inference tools
velvet
visit
visual basic
vtk
website building tools
xBio:D
xmgrace
yt
zeiss axiovision

Developer Behavior

24. The next questions are for those who develop, administer or maintain software and/or systems used to share, publish or preserve data or software. Are these sorts of tasks your responsibility? (dev)

25. Do you collaboratively develop and/or publicly share code in an version control repository like GitHub or bitbucket? (dev_code_verCon)

26. Do you develop, administer, maintain or support any (select any that apply): (dev_sw_dams)

27. How long do you expect people to use the software you develop, administer, maintain or support ? (dev_sw_eol)

28. For the typical software you develop, administer, maintain or support, how many users are there: (dev_user_count)

29. How do you typically license the software you develop? Select all that apply: (dev_license)

Some of the licenses mentioned in Other:

BSD (includes BSD 3-clause, and BSD3)
Apache
Not licensed
n/a