What do you think of e-ScienceCity? Click here!

Data in 30"

Data formats

You’re probably aware of different data formats from working on a computer and having to save your work. The files you create on a word processor could contain not only the text you write, but also information on how it is laid out on a page, how items in the document relate to each other (such as links from citations to references) and other items including images. The way that all this information is structured and stored can depend on the software package you used to create the file, and on the options you selected when you saved it. A very common format globally (due to the prevalence of their Office suite of applications) is Microsoft’s Office Open XML (since 2008, also an agreed standard, ISO 29500) – ‘docx’ being the specific format for the Word app. An alternative is the Open Document Format (ISO 26300 since 2006), developed by Sun Microsystems for its free Open Office suite (now owned by Oracle), and used in many other programs including the current LibreOffice. The image in the document itself could be in any number of different file formats, which – just like the word processor file format that carries it – could be more or less ‘open’ or ‘closed’, depending on the software used to make or process it, and the options selected when it was saved.

Open vs. Closed
‘Open’ standards are preferred for academic research, because the way files are structured to be able to hold information and make it readable is both freely documented, in addition to being usually very well documented. Computers scientists can make use of online tools that encourage and facilitate documentation of code (including file formats and programs), such as Github, so that others can come along and understand why files are structured the way they are.
Open vs. closed standards is very much a part of the free vs. proprietary software debate. Free software means freedom to do with it as you please – not cost-free. This generally means that users have access to the code – that it is ‘open source’ – although some argue that free software also enshrines individual freedoms that go beyond just being able to see source code, and extend to the right to copy and distribute that software too. That’s quite a controversial issue. What is true is that many free software tools tend to favour free and open formats.

Why do closed formats exist at all?
Closed formats exist because they offer producers of commercial software a way to control the manner in which the data is stored – sometimes to optimise functionality, but often to compete for market share.
The downside of so-called ‘proprietary’ formats is that, if the company that produced the software that uses it ceases to exist, the format might be left unreadable. This is unfortunate and annoying for individuals who can no longer access their own files, but catastrophic for scientists, historians and others working with digitally archived files.

Raw Data Now
Social and physical scientists around the world agree that choosing the right file format is crucial. For many, it’s not simply a case of choosing the right open format for each type of data file (image, sound recording, video), but choosing the simplest possible implementation: plain text files for text-only formats, or formats that can be read as plain text files, are preferred for both numerical and word-based data.

Inventor of the World Wide Web, Tim Berners-Lee, gave a talk at TED (Technology-Education Design) in 2008 called “Raw Data Now”. In it, he explains that the file formats used should be the simplest, most ‘open’ standard, such as text files for text and numerical data, and other open standards such as Ogg Vorbis for audio, and PNG for graphics. You can see the video here: