Why do we need to connect the data published on the Web?

In the first article of this series , we address the data supply problem that has been growing exponentially in the digital economy, but with very low quality and reusability. As already explored, these data are predominantly in unstructured form – which limits their description and reuse by other applications and people. In addition, due to the poor quality of the data available, the reuse process has been expensive [1].

In this direction, new approaches around the data have been developed over the years and currently, the aim is to establish a concept of data that can be widely used without restrictions of use and applications, in such a way that the production cycle of knowledge can be richer and improved [2]. The concept of open data has been established in this context and consists of Data that can be freely used, reused and redistributed by any person – subject only to the requirement to assign the share through the same license.

Open data enables people and organizations to freely use public information to generate applications, make analyzes, or even tradeable products. For a data set to be considered open, it must allow the citizen to easily access and use or redistribute it without restrictions. In addition, the data need to be easily found in an indexed place, without impediment of reading by machines or legal restrictions [4].

At the governmental level, three laws, ie the conditions for a given government data to be considered open, have been established in order to conceptualize how government data should be open: [5] – If the data can not be found and indexed on the Web , He does not exist; – If it is not open and available in machine-readable form, it can not be reused; and – If any legal device does not allow replication, it is not useful.

In addition, The Association of Computing Machinery has issued a recommendation for government data, which states that:

“The data published by the government should be in formats and approaches that promote the analysis and reuse of such data.” [6].

In this way, the concept of open government data emerged as a strong reference to the publication of data on the web, creating new channels of communication between governments and their citizens, where innumerable web portals and catalogs were developed at the continental level, such as European Union (collecting catalogs from 29 countries) at the national level, such as the United States , the United Kingdom and Brazil , and at the local level, such as the State of Alagoas, offering thousands of data sets online. These initiatives have been strongly promoted at the global level, such as the establishment of the Open Government Partnership [7], which brings together some 65 countries (including Brazil) around the establishment of more transparent, participatory and engage society in co-creation and collaboration around solutions of public interest.

Thus, the volume of data and information produced, as well as the current decentralization of these production structures, imposes ever greater challenges, since decision-making needs to be subsidized by integrated information, usually resulting from the cross-referencing of several databases. In this context, data consumers see that the current data supply widely spread by the web represents a great inconvenience, since there is a need to first obtain and store this data locally, before it can be used to produce relevant information [8] .

It should also be noted that even if public sector information is available in open format, it may be published in a chaotic way. In addition, the same information can be found in different web sites and, without having any connection between such sources of information, for example, presenting the most up-to-date information. Given this situation, in order for users to have confidence in the data available, they seek to analyze their origin, giving preference to those that originate from reliable sources. On the other hand, these reliable data are naturally available from distributed sources, not being uncommon the absence of hyperlinks to related information, sometimes stored in the same data repository or not [9].

The present challenge is to provide effective means to access data from distributed sources, and to stipulate mechanisms through which they can be connected and integrated [8]. Another challenge lies in limiting human beings to process and connect the current supply of available data and information, considering that the internet makes the wealth of human knowledge available to anyone, anywhere. One more challenge is how to classify and effectively use the growing volume of information available to get the answers you need.

An interesting initiative in the direction of this challenge was the proposition, by Tim Berners-Lee of a data maturity scale, known as the Open Data 5-Star Scheme [10]. This scale was established when defining the concept of Connected Data, as below [2]:

1-Star: The data is available on the web, in any format (pdf, png, jpeg); 2-Stars: The data is available as machine-readable and structured (an Excel spreadsheet); 3-Stars: Data is available in a non-proprietary format (a CSV worksheet). 4-Stars: Data is published using World Wide Web Consortium open data standards , such as (RDF and SPARQL) and has Universal Identifiers (URIS); 5-Stars: All of the above apply, as well as links to data from different sources and use of semantics, that is, the data is enriched and connected to other data.

Leave a Reply

Your email address will not be published. Required fields are marked *