Data science is playing a more leading role for many businesses, helping to bring together data from a wide range of sources to create meaningful insights that can guide business strategy.
There is a real buzz around “big data” and the ways companies are using it, however, it’s also important to understand that not all data is created equal. Different data is acquired from different sources – the data generated from text messages, social media posts or emails, for example, is completely different to the data generated by point-of-sales or supply chain systems.
The main differences come down to whether or not the data is structured or unstructured, and more often than not, this is dictated by whether the data is qualitative or quantitative.
What is the difference between structured and unstructured data?
Structured data is highly organised and formatted so that it’s easily searchable in relational databases. Unstructured data has no predefined format or organization, making it much more difficult to collect, process, and analyse.
When we talk about structured vs unstructured data, we are not promoting a conflict between the two. Businesses select one or the other, not based on their data structure, but on the applications that use them; relational databases for structured, and almost any other type of application for unstructured data.
Whilst there is no conflict between the two types of data, the analysis of these two categories of data is creating a growing tension for businesses. Especially when they are dealing with a lot of unstructured data sources. The technology for the analysis of unstructured data is not mature and whilst there has been much R&D in the area of unstructured data analysis, there are still business financial decisions to be made as to whether it’s worth the investment in analytics for unstructured data or whether it’s possible to aggregate the two into better business intelligence.
On top of this, there is simply much more unstructured data than structured. Unstructured data makes up 80% or more of enterprise data and is growing at the rate of 55% to 65% per year. Without the tools to analyse this massive amount of data, organisations are leaving vast numbers of business insights on the business intelligence table.
Understanding structured vs unstructured data
Structured vs. unstructured data can be understood by considering the who, what, when, where, and the how of the data:
- Who will be using the data?
- What type of data are you collecting?
- When does the data need to be prepared – before storage or when used?
- Where will the data be stored?
- How will the data be stored?
These five questions highlight the fundamentals of both structured and unstructured data and allow general users to understand how the two differ. They will also help users understand nuances like semi-structured data and guide us as we navigate the future of data in the cloud.
What is structured data?
Structured data is most often categorised as quantitative data, and it’s the type of data most of us are accustomed to working with. Think of data that fits neatly within fixed fields and columns in relational databases and spreadsheets.
Examples of structured data include names, dates, addresses, credit card numbers, stock information, geolocation, and more.
Structured data is highly organised and easily understood by machine language. Those working within relational databases can input, search, and manipulate structured data relatively quickly using a relational database management system (RDBMS). This is the most attractive feature of structured data.
Pros: structured data is easily used by machine learning algorithms and can be easily manipulated and queried. It is also easier for business users to understand and interpret.
Cons: structured data lacks flexibility and can only be used for its intended purposes, limiting its use cases. There are also limited storage options as structured data is typically stored in data warehouses. Some of this is being mitigated now by the use of closed-based data warehousing, however, there are still limitations to the storage of structured data.
What is unstructured data?
Unstructured data is most often categorised as qualitative data, and it cannot be processed and analysed using conventional data tools and methods.
Examples of unstructured data include text, video files, audio files, mobile activity, social media posts, satellite imagery, surveillance imagery – the list goes on and on.
Unstructured data is difficult to deconstruct because it has no predefined data model, meaning it cannot be organised in relational databases. Instead, non-relational or NoSQL databases are the best fit for managing unstructured data.
Extracting insights buried within unstructured data isn’t an easy task. It requires advanced analytics and a high level of technical expertise to really penetrate the data and extract valuable insights. Such data analysis can be expensive for many companies.
Pros: those able to harness unstructured data are at a competitive advantage, providing a much deeper understanding of customer behaviour and intent. Unstructured data is also accumulated at much faster rates and can be stored in cloud data lakes which allow for massive amounts of data storage.
Cons: we’ve already talked about the costs associated with analysing unstructured data and this is one of the biggest drawbacks. Ideally, you would need a specialised data scientist in order to maximise the opportunities presented by unstructured data. In addition to someone to interpret the data, you are also likely to need to invest in specialised tools for data analysis which are expensive. While the costs are coming down, they may still be prohibitive to many businesses.
What is Semi-structured Data?
Semi-structured data is a third category that falls somewhere between the other two. It’s a type of structured data that does not fit into the formal structure of a relational database. But while not matching the description of structured data entirely, it still employs tagging systems or other markers, separating different elements and enabling search.
A typical example of semi-structured data is smartphone photos. Every photo taken with a smartphone contains unstructured image content as well as the tagged time, location, and other identifiable (and structured) information. Common examples of semi-structured data are JSON and XML and we see these used by website developers to help Google to understand more about the content on a particular web page.
The future of data analytics
As the volume of big data we collect continues to rise, investment continues to grow in line with that growth with the big data analytics market set to reach $103 billion by 2023. In line with that, it is expected that the market for data scientists will also grow, with a projected 2.72 million new jobs created in the field over the next few years.
Soon, data storage will no longer be an issue, with cloud-based storage set to shake up the way structured and unstructured data is currently stored. This will again create additional opportunities for data scientists to maximise the potential insights from the data they are collecting, creating new algorithms capable of leveraging the data in unique ways.
Regardless of whether data is structured or unstructured, businesses that can accurately collect and analyse data from the most relevant sources will gain a significant competitive advantage. It is said that “Data is the new oil” and those organisations that can extract insights to aspects of customer behaviour, to which they can align their products and services, will grow much more rapidly than organisations that rely on more traditional engagement methods.