While structured data, like the neat rows and columns in databases or CSV files we discussed earlier, is common, much of the data generated today doesn't fit cleanly into tables. This brings us to semi-structured data, a category that sits between the rigid format of structured data and the complete lack of organization found in unstructured data (like plain text documents or images).
Semi-structured data doesn't conform to a fixed, formal schema like relational database tables, but it does contain tags or other markers to separate semantic elements and enforce hierarchies of records and fields. Think of it as having organizational clues, but not a strict blueprint. This flexibility makes it very useful for systems where the exact structure might evolve or vary.
Two widespread formats you'll frequently encounter when extracting semi-structured data are JSON and XML.
JSON is a lightweight text-based format that's easy for humans to read and write, and easy for machines to parse and generate. It became extremely popular with the rise of web APIs (Application Programming Interfaces) because it maps very naturally to data structures found in many programming languages.
JSON data is built on two primary structures:
{}
. Keys are strings, and values can be strings, numbers, booleans (true
/false
), arrays, or even other nested objects.[]
. Values in an array can be of any valid JSON type.Here’s a simple example representing information about a person:
{
"firstName": "Jane",
"lastName": "Doe",
"age": 30,
"isStudent": false,
"address": {
"streetAddress": "123 Main St",
"city": "Anytown",
"postalCode": "12345"
},
"phoneNumbers": [
{
"type": "home",
"number": "555-1234"
},
{
"type": "work",
"number": "555-5678"
}
]
}
Notice how the address
is a nested object, and phoneNumbers
is an array containing multiple phone number objects, each with its own type
and number
.
XML is another text-based format designed to store and transport data. It uses tags to define elements and attributes to provide additional information about those elements. XML is hierarchical, meaning elements can be nested within other elements, creating a tree-like structure.
While perhaps less common now for new web APIs compared to JSON, XML is still widely used in configuration files, document formats (like Microsoft Office XML formats), and enterprise systems for data exchange.
Here's the same person information represented in XML:
<person>
<firstName>Jane</firstName>
<lastName>Doe</lastName>
<age>30</age>
<isStudent>false</isStudent>
<address>
<streetAddress>123 Main St</streetAddress>
<city>Anytown</city>
<postalCode>12345</postalCode>
</address>
<phoneNumbers>
<phone type="home">
<number>555-1234</number>
</phone>
<phone type="work">
<number>555-5678</number>
</phone>
</phoneNumbers>
</person>
In this XML example, <person>
, <firstName>
, <address>
, etc., are tags defining elements. Notice the type="home"
within the <phone>
tag; this is an attribute providing extra detail about the phone element.
Extracting data from JSON or XML sources presents different challenges compared to pulling data from a relational database table.
phoneNumbers
array, find the object where type
is "work", and then get the value associated with the number
key.For example, imagine extracting the street address from the JSON structure shown earlier. The process involves accessing the top-level object, then the address
key, and finally the streetAddress
key within the nested address object.
This diagram illustrates the nested structure common in JSON and XML data, showing objects containing other objects or arrays.
Understanding how to handle these semi-structured formats is essential in modern ETL, as they are common outputs from web services, IoT devices, and various application logs. The extraction stage needs to be equipped to parse these formats and pull out the relevant information before the transformation stage can begin.
© 2025 ApX Machine Learning