Codata workshop


SESSION 3: DATA SHARING AND SPECIFIC DATA ISSUES



Yüklə 475,22 Kb.
səhifə5/9
tarix03.01.2019
ölçüsü475,22 Kb.
#89274
1   2   3   4   5   6   7   8   9

SESSION 3: DATA SHARING AND SPECIFIC DATA ISSUES





  • S&T data and information preservation, access and sharing successes and challenges

  • Development of a list of core datasets

  • Training and development of good practices


Chair: Dr Daisy Selematsela, National Research Foundation

Panellist: Dr Anthony Cooper (CSIR)




Some thoughts on data preservation, access and sharing




Standards

There are numerous different kinds of standards, some of which are complex and others of which are simple. Certain standards are used without even thinking about them.



Research standards

There are discipline-specific standards for research and data, which are well clarified in some disciplines but less so in others. These include methodologies and best practice, which do not have to be formally recognised as official standards. They could include formal descriptions of models, which makes it easier to reproduce what others have done, and could take the form of notated mathematics. Since most models are now implemented on computers, they use specification languages for codification – e.g. Unified Modeling Language (UML). Benchmarks and standard references are other forms of standards, including the compilation, evaluation and dissemination of reliable numerical data


The quality of the research and data is another issues, and there are two aspects: (1) achieving quality (against norms) and (2) recording, disseminating and interpreting quality on the basis of the way in which it was documented.

Data standards

Data standards involve data formats and coded character sets (e.g. ASCII, Unicode, ISO 10646). Data organisation includes eXtensible Markup Language (XML), databases, etc. Metadata is an important component, which can be described as ‘data about data’, or documentation describing the content of a data set


Finally, there are standards for classification, data dictionaries and catalogues.

Data management

Data management is concerned primarily with archiving, which includes:



  • Preservation of original manuscripts

  • Preservation of oral histories of researchers, explorers, communities, etc

  • Preservation of artefacts

  • Preservation of digital data sets, with which there are critical related problems, including issues of the persistence of media (CD Roms do not last indefinitely, contrary to general belief; the older type did, as lasers punched holes in the disc, but DVDs and rewritable CDs may last only 3–4 months, as lasers change the polarity of fields); transience of hardware and the knowledge to read old media; reliability and integrity of the archive; security; preventing contamination and destruction (which could be deliberate)

  • Intellectual property rights (IPR)

  • Appropriate norms (e.g. creative commons)

  • Documenting and protecting intellectual property rights

  • Embedding digital rights management in data sets.



Accessibility

The most important initial issue is awareness of what is available and where; for this, catalogues, indexes and portals are useful. It is important to prevent duplication of data capture; for this, cooperation is necessary. On the technical side, one needs interoperability (for integrating data sets), with which the following considerations are associated: (1) adhering to common standards (there are open standards as well as those being developed by proprietary organisations, which may be expensive to join, but because of their resources, they are often able to develop standards more quickly) and (2) availability of metadata. Unless the data can be interpreted, it is not possible to determine the fitness for use of the data. For spatial data, the spatial referencing is critical, as there are differing spatial units of analysis (e.g. water-related data may be recorded according to catchment area, while soil data is captured for different units). With spatial data, people assume that a coordinate is accurate, but even GPS will not give accurate readings if the equipment is not properly configured or influenced by atmospheric conditions (e.g. a stationary GPS can give a reading differing by 100 m overnight).


With respect to cross-disciplinary data, it is necessary to be aware of data from other disciplines that can be used.
The cost and availability of access to data represents the digital divide. The costs include real broadband Internet repositories, DVD/CD-ROMs, etc. Then there is the cost and availability of tools for processing data, including open source tools, high performance computing platforms and grid and cluster computing.
In South Africa, the Promotion of Access to Information Act (Act No. 2 of 2000) has made much South African public data available at nominal cost. Copyright may be used to deny access to data. One of the problems with copyright relates to concern over original data being altered (e.g. digitising the 1:50 000 national mapping series) but branded as original. When users pick up errors, they complain to the holder of the original data, not the commercial interests that altered and made the data available. Hopefully the metadata should convey any changes to the data.

Development of a list of core data sets

This is being done for geospatial data, for instance:



  • Mapping Africa for Africa (MAFA), which is an initiative of the International Cartographic Association (ICA). At the ICA Congress in Durban in August 2003, the Durban Statement on Mapping Africa for Africa was made. It deals with issues of gathering the fundamental data sets for mapping Africa and gathering the necessary resources.

In pursuance of this goal the South African Chief Director: Surveys and Mapping launched a project. One of the reports produced was ‘Determination of the fundamental geospatial datasets for Africa through a user needs analysis – a synthesis report’. It was launched by the Committee for Development Information (CODI) Subcommittee on Geo-information (Geo) of the United Nations Economic Commission for Africa (UN ECA) and will soon be available at www.eis-africa.org/EIS-Africa/publications/gfd/gfd?sn=10. The project was driven from South Africa with partners across Africa. It was a difficult process because of the lack of metadata.


Fundamental geospatial data themes include:

  • The geodetic control network

  • Remotely sensed imagery (e.g.: aerial photography and satellite imagery)

  • Hypsography (e.g. contours, digital elevation models (DEM), spot heights, etc.)

  • Hydrography (e.g. rivers, streams, water bodies, etc.)

  • Administrative boundaries (e.g. international, provincial, district, etc.)

  • Geographic names (i.e. gazetteer)

  • Land management units/areas

  • Transportation

  • Utilities and services

  • The natural environment.



Panellists: Dr Martie van Deventer (CSIR) & Dr Heila Pienaar (University of Pretoria)




Investigating the need for a Virtual Research Environment (VRE) for Malaria research in SA (interim findings)




Virtual research environments

Virtual research environments (VREs) comprise digital infrastructure and services that enable research to take place within the virtual multidisciplinary and multi-organisation partnership context. The VRE concept helps to broaden the popular definition of e-science from grid-based distributed computing for scientists with huge amounts of data to the development of online tools, content and middleware within a coherent framework for all disciplines and all types of research (Fraser 2005).


The specific aim of a VRE is to help researchers manage the increasingly complex range of tasks involved in carrying out research. Therefore a VRE provides a framework of resources to support the underlying processes of research on both small and large scales, particularly for those disciplines which are not well catered for by current (UK) infrastructure (JISC 2006).
VRE is much wider than data generation, management and/or curation. Our interest was VRE enablement, not becoming experts in biotechnology or even malaria.
The VRE environment functions in the wider context of the evolution of research. There are remnants of each of the stages of the evolution of research (the experimental science stage, the theoretical science stage, the computational stage), and it is necessary to take cognisance of these in working in a VRE. The data management cycle also has to be taken into account.
Malaria was identified as the area of investigation as it was considered to be a field that would meet the primary objectives:

  • To investigate how the VRE could improve researcher efficiency

  • To create a conceptual model of the entire research process in a South African context that could be shared with management

  • To surface the VRE needs across constraints across the organisational boundaries between the CSIR and University of Pretoria (UP) boundaries, linked to a specific project

  • To identify the conceptual requirements for developing a pilot VRE for the CSIR/University of Pretoria project.

There is already a relationship in place to manage such initiatives between the two organisations. The team approached the Southern Education and Research Alliance (SERA) relationship managers and received a positive response. The matter was discussed matter with executives and research managers responsible for research at both institutions, who identified the African Centre for Gene Technologies (ACGT) as having suitable cross-institution projects and suggested Malaria research as a suitable topic. The ACGT manager confirmed Malaria research as suitable, given that the South African malaria research initiative is consortium based through the South African Malaria Initiative (SAMI) and wider than just the CSIR and University of Pretoria.


The research was designed very much as a qualitative research project. Candidates for interviews were selected from a set list of SAMI members. Interviews were based on the naïve interview technique, in which the interviewer deliberately does not get to know the subject of research. Interviewees were asked to describe a typical day in the life of a researcher. This approach was based on research done for the Integrative Biology VRE hosted by Oxford University and followed a similar process. The investigators tried to identify all the tools, data sets and information required for each aspect of the research process. The presentation shares only the findings on data generation aspect of the scientific workflow.

Interim findings: data




  • Much data is generated, and each analysis machine generates data sets (often in electronic format).

  • There appears to be very little integration within the discipline and virtually no integration across disciplines and even projects. Standards have been developed and metadata are available, but researchers generally do not apply them.

  • Many e-components already exist, but each is treated as a separate entity, which represents a silo rather than an integrated environment.

  • The fact that some infrastructure exists does not mean that the facility is widely known or that it is used to full potential.

  • Data management/curation appears to be done, but procedures are not transparent, nor are they consistent across the discipline.

  • The paper-based lab book is still the norm. The greatest improvement in efficiency (‘killer application’) might be related to improvements in practice related to the lab book.

  • With respect to long-term curation, there appears to be little evidence of standardisation in terms of metadata and annotation because the paper lab book and electronic files are not integrated.

  • Researchers need easy retrieval and management delivered via a simple platform.

  • Researchers acknowledge that there may be a problem and are willing to help solve the problem as long as it does not detract from their research.



Interim recommendations




  • Implementation of an electronic lab book system that is integrated with the current data curation files.

  • IP constraints do exist but should be managed rather than treated as a stumbling block.

  • Sharing data on a shared platform is key to progress.



Next steps

The next steps will entail confirming our findings with SERA management, developing the conceptual model for a malaria VRE (including tools and resources), addressing all steps in the research cycle, investigating the linkages with the National Biodiversity Network and the functional genome information system under development in the University of Pretoria’s Department of Bio-Informatics, and creating a focus group to validate the conceptual model.



Comments on presentations





  • Comment: A dedicated server with large capacity would be required to support a virtual research environment, given the need to upload the lab book.

  • Comment: Sophisticated instruments are available for digital manipulation but researchers sometimes paste printed digital photographs and statistical analysis into their lab books. Lab books are sometimes even rewritten to make them neat, which is not allowed. It the lab book were integrated with the data, it would provide annotation in the metadata environment for understanding the data in subsequent decades.

  • Comment: The duplication referred to could be intentional as part of the research process in the interests of repeatability and not because of a lack of knowledge of what others are doing. In the university environment, duplication would be encouraged in developing and teaching methodologies, whereas the CSIR is more geared to solving problems and avoids duplication.

  • Comment: With respect to connecting research teams together, a possibility to consider would be to identify similar research projects and labs in Africa and connect them by means of a wiki on which documents and research findings could be uploaded. In the field of malaria, for instance, a wiki page could be created on the Internet and researchers and funders in the field invited to participate.

  • Comment: Malaria research involves health, geospatial and ecological elements.

  • Comment: With respect to the proposals on lab books, there will be a dividing line between what is private, what is preliminary, what is already published, and intellectual property that the researcher does not yet wish to reveal. Publication takes place when something is put into the public domain for general use after passing certain controls. It would be a matter of concern to practising scientists if all lab notebooks were to be put into an electronic repository to which others have access. However, once published, researchers would like to have the fullest access to what has been reliably tested through peer review.

  • Response: Dr van Deventer agreed with these sentiments and clarified that she had not been suggesting that the whole lab notebook should become available, although consideration could be given to making subsets available.



Panellist: Avinash Chuntharpursat (SAEON)




Data management in a collaborative GIS




What is SAEON




  • The South African Environmental Observation Network (SAEON) is a research facility that establishes and maintains nodes (environmental observatories, field stations or sites) linked by an information management network to serve as research and education platforms for long-term studies of ecosystems that will provide for incremental advances in our understanding of ecosystems and our ability to detect, predict and react to environmental change.

  • SAEON has international linkages with the International Long-term Ecological Research Network (ILTER) with respect to planning, training, projects, fundraising, data & information sharing, as well as with the Environmental Long-Term Observatories of Southern Africa (ELTOSA)



Information management of SAEON

The needs are being met by a collaborative GIS network (CoGIS) for SAEON, the Department of Minerals and Energy and the CSIR (which acted in the capacity of project management). The system, which comprises a 2-4 CPU server and a 7-terrabyte storage area network (SAN), is being hosted at the Innovation Hub, and features are included to overcome the problems of low bandwidth in South Africa. The consortium is looking into involving the Development Bank of Southern Africa (which has a large virtual private network covering a number of municipalities), with a view to increased bandwidth, as well as the Agricultural Research Council and the National Department of Agriculture.



Organisational infrastructure and costs

In developing CoGIS, the three partners (SAEON, CSIR and Department of Minerals and Energy) have spent R1 050 000 (U$155 000). Hosting costs will entail a once-off payment of R100 000 (U$15 000) and a monthly payment of R50 000 (U$7 500), divided among the three partner organisations.



Comments on presentation





  • Comment: Recent institutional initiatives show good progress and opportunity to network. Initiatives include setting up a portal for remote sensing data.

  • Comment: The Global Earth Observation System of Systems (GEOSS) has a data-sharing policy. International CODATA has an action item for drafting amended data policy guidelines for GEOSS. GEOSS is trying to promote access free of charge for non-profit and educational institutions.

  • Comment: The South African Earth Observation System (SAEOS) has undertaken negotiations to make geospatial data from all South African government departments available to research and education and institutions free of charge and to others at a very reduced price and is looking into similar arrangements for with other data holders.



Questions and discussion

The following issues were raised in discussion:



  • All the initiatives have the potential for follow up by CODATA in terms of establishing a data inventory for the region, and promoting policy and practice for data management, stewardship, access and utilisation. They all have socially important applications and fall within the thematic areas of health and environment for development. These are examples of initiatives with the potential to attract funding and are consistent not only with the objectives of the CODATA task group but other organisations represented at the present workshop. It may be advisable to build on initiatives that have already seen substantial work, and there could be benefits of joining forces. Demonstrating near-term results is a means of generating support.

  • With respect to training, in the past schools of library and information science taught the skills and techniques for compiling bibliographies control. In more recent years, students were taught cybernetics. Dr Cooper’s presentation identified skills and techniques needed for maintaining data standards today as well as the unique characteristics of certain discipline-specific skills and how to deal with issues of accessibility. One of the ways forward is to consider what kinds of skills are immediately required to facilitate data archiving and curation management, which may not necessarily be taught in universities.

  • One of the principle products of NISC (National Inquiry Services Centre) (based in Grahamstown, apart from African Journals Online, is bibliographic databases, with both a longitudinal and current trajectory. In compiling such databases, wisdom and insight are needed to identify scholarly papers from the past that may still be relevant.

  • Training is not just an Africa-specific problem, but a more general issue. Training in data management has been neglected everywhere and is perhaps an incipient area of education. Different disciplines’ approaches to data are a neglected area of research importance. Only in last the 20 years has research become data intensive. The university system has not caught up with the reality of the current and emerging education needs in information management. Many data managers and software developers are people who have changed professions and were not formally trained in data management. Bioinformatics practitioners, for example, have not been specifically trained but stepped into the niche as an ad hoc, bottom-up approach. There is a huge need for more systematic education and training, both at the graduate and postgraduate levels. Greater value should be placed on the importance of this work for research, for which it provides infrastructure, efficiency and support. This is a pedagogical, infrastructural issue. Approaches could be promoted at an institutional level. The possibility of online training through a data academy was one of the recommendations of the 2005 workshop, and consideration is being given to implementing this through UN GAID.

  • A hot topic for developing countries is the links between the data holder and data user, not only through online practice but also through sharing user experience. CODATA communities are sharing their experiences, which is another type of learning. Consideration could be given to organising a national CODATA committee meeting in prospective African member countries as a way of encouraging membership.

  • Online training has already been identified as an action. The means of identifying people to conduct the training and to participate are needed in order to translate the vision into action.

  • Universities and students represent supply and demand. If the demand for data management graduates and the possibilities of competitive remuneration for such positions can be demonstrated, universities will engage in data management training. Related courses already on offer could be identified and consideration could be given to ways in which lecturers could be persuaded to adapt these.

  • All SADC countries are ICSU members. ICSU South Africa could play a role in promoting CODATA membership, but an enthusiastic contact person is needed in each country to form a nucleus and mobilise this. The benefits of CODATA membership have to be made clear.



What can be done in the interim?





  • It is necessary to ensure that standards generating bodies such as the South African Qualifications Authority (SAQA) develop the correct unit standards for data.

  • There needs to be cooperation among South African universities around data management.

  • Education is needed at two levels. At one level, it is necessary to educate data management professionals who will manage a data centre or archive and provide a service to the research community; and at another level, short courses are required to teach guidelines for data management best practice and principles for active researchers (including ethics, law, retrieval, access and management of data findings).

  • Another issue is the reward system associated with data work in higher education. Even though it has become an important aspect of research, it still tends to be undervalued, especially in evaluating people for promotion to more senior positions. Data management could arguably be regarded as being as important a part of the research process as research publication. This situation prevails across the world.

  • Could the compilation of quality and unique data sets that are used by many be an important indicator in assessing applications for research grants?

  • Setting such a criterion for evaluation could start with the normal publishing process through the growing requirement of editors that data sets on which a published paper is based be housed in an off-page repository to which reference is made in the paper. When referees review the paper, they should have access to the data set on which it is based. This would start to bring data sets into the peer-reviewed domain as a gift of the process.

  • The background South African reality is the battle within education over the standards for degrees at universities, including postgraduate degrees and doctorates. Efforts have been made to promote information literacy as one of the requirements that have to be met for a higher degree, and in which the candidate would have to satisfy the examiner that he/she was proficient. The draft Higher Education Framework goes beyond the laissez faire approach of allowing the individual university to determine the requirements for a PhD degree. The CHE HEQC establishes best practice guidelines for institutions to form the basis for testing quality in a quality assurance audit. It should be included in those guidelines that the award of a postgraduate degree depends on demonstrated competence by the candidate in listed areas that include data management, curation and integrity. This would be a possible approach to inskilling across the board.

  • Johan Mouton at the University of Stellenbosch runs a very successful postgraduate programme in information management, offering seven well-subscribed modular masters degree programmes, with enrolment from a range of different disciplinary backgrounds.

  • Publishing in journals has a strong normative culture with respect to what the author may and may not do. There should be the same culture of norms and standards for data management, which would form the basis of an education and training programme.




Yüklə 475,22 Kb.

Dostları ilə paylaş:
1   2   3   4   5   6   7   8   9




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©muhaz.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin