What Is Semi-Structured Data?
Finally, semi-structured data represents a hybrid of these two data types. This group could include Excel spreadsheets that contain important financial information, but the data itself is hard to extract. These data objects may have structure within them, but they lack the external structure needed for standard data management processes. Like unstructured data, these objects contain important insights that can be hard to extract and apply without an intelligent data governance strategy.
Semi-structured data refers to any information that uses a self-describing schema, such as XML or JSON. These types of data have an open-ended schema that enables application data flexibility. Sometimes, this type of data is combined with structured data to record additional properties for specific types of records within a structured data store.
The open-ended schema means that semi-structured data does not rely on the application that created it to define the embedded structure. For example, an Oracle database would be considered a structured data type. The rules governing the database are bound and applied by the application that creates the file, or, in this case, the database.
With semi-structured datasets, the definitions and constraints are embedded within the file, regardless of the application that created them. For example, XML files and cascading style sheets for web pages are both forms of semi-structured data. They can be created by almost any kind of application — such as Notepad, a website builder app or an Office app like Word — so there is no way for the application to apply structure or rules to these data types.
Semi-structured data is challenging for organizations to manage because it does not necessarily have the same level of organization and predictability as structured data. It does not reside in fixed fields or records. At the same time, it does have more rigidity than unstructured data, because it does contain elements that can separate the data into various hierarchies (think of comma delimited files or tab delimited files).
Unlike structured data — which represents data as a flat table —semi-structured data can contain n-level hierarchies of nested information. This means that it can be easy to apply standard data management processes to semi-structured data, and it can be easy to extract insights from. The real issue is making sure your business has the tools and technology necessary to load the data into structured or unstructured data models, which can be managed via data governance.