Parsing JSON and Code Snippets

Producing structured output from a model is a common requirement, but the raw response is often not immediately usable. Large language models are text generators at their core, and their output can be wrapped in conversational text, enclosed in markdown blocks, or even contain minor formatting errors. Directly parsing this with a standard library like Python's json module can be brittle and lead to application failures.

This is where specialized parsers become indispensable. They are designed to find and extract structured data from the noisy, unpredictable text that LLMs sometimes produce, giving you a reliable way to get from a raw string to a usable data structure.

Extracting JSON from Surrounding Text

A frequent pattern you'll encounter is an LLM providing a JSON object wrapped in explanatory text or a markdown code block. For instance, you might ask for user data and receive a response like this:

Of course! Here is the user information you requested, formatted as a JSON object:

```json
{
  "name": "Alice Johnson",
  "email": "[email protected]",
  "age": 28,
  "roles": ["developer", "team_lead"]
}
```

Let me know if you need any other details.

A standard JSON parser would fail on this text. The extract_json function, however, is built to handle this exact scenario. It scans the text, identifies the JSON block, and extracts it for parsing.

from kerb.parsing import extract_json

llm_output = """Here's the user data you requested:

```json
{
  "name": "Alice Johnson",
  "email": "[email protected]",
  "age": 28,
  "roles": ["developer", "team_lead"]
}
```

This data was extracted from the user database."""

result = extract_json(llm_output)

if result.success:
    user_data = result.data
    print(f"Successfully extracted user: {user_data.get('name')}")
else:
    print(f"Failed to extract JSON: {result.error}")

The function returns a ParseResult object containing the outcome. The success attribute tells you if the operation worked, and if so, the data attribute holds the parsed Python dictionary. This makes it simple to build resilient workflows that can handle variations in LLM verbosity.

Handling Malformed and Embedded JSON

Sometimes, an LLM's output is almost correct but contains minor syntax errors that would break a strict parser. Common issues include trailing commas, missing quotes around keys, or single quotes instead of double quotes.

Additionally, a model might embed a JSON object directly within a sentence without any special formatting. For example: The configuration is: {"api_key": "sk-xxx", "timeout": 30} which should work.

To handle these cases, you can use parse_json or extract_json with a more forgiving parse mode.

ParseMode.STRICT: Requires the input to be a perfectly formatted JSON string.
ParseMode.LENIENT: Attempts to fix common errors like trailing commas or missing quotes before parsing. It will still only parse if the entire string is intended to be JSON.
ParseMode.BEST_EFFORT: Scans the string to find the first valid-looking JSON object or array, even if it's embedded in other text.

Let's see how ParseMode.LENIENT can automatically fix a malformed JSON string.

from kerb.parsing import parse_json, ParseMode

# This JSON has unquoted keys and a trailing comma
malformed_json = """{
    name: "Bob",
    age: 35,
    active: true,
}"""

result = parse_json(malformed_json, mode=ParseMode.LENIENT)

if result.success:
    print(f"Data: {result.data}")
    print(f"Was the JSON fixed? {result.fixed}")

When dealing with JSON embedded in text, ParseMode.BEST_EFFORT is your best option. It will find and pull out the structured part, ignoring the rest.

from kerb.parsing import extract_json, ParseMode

llm_output = 'The configuration is: {"api_key": "sk-xxx", "timeout": 30, "retries": 3} which should work.'

# BEST_EFFORT will find the JSON within the sentence
result = extract_json(llm_output, mode=ParseMode.BEST_EFFORT)

if result.success:
    print(f"Extracted config: {result.data}")

Using these modes provides a safety net, making your application more reliable to the small but common variations in LLM output quality.

Parsing Code Snippets

Extracting code follows a similar pattern. While LLMs are good at generating code, they often wrap it in markdown blocks and add explanations. A reliable way to extract code is to instruct the model to place it inside a JSON object. This turns the problem of parsing code into the more manageable problem of parsing JSON.

For instance, you can prompt the model to return its output in a specific JSON structure.

Prompt:

Analyze the following Python function and convert it to a JSON representation with these fields: "name", "docstring", "parameters", and "return_type".

```python
def calculate_bmi(weight: float, height: float) -float:
    """Calculate Body Mass Index."""
    return weight / (height ** 2)
```

Return only a single, valid JSON object.

The model would ideally respond with a clean JSON object, which you can then parse using the techniques we've discussed.

from kerb.parsing import extract_json

# Simulated LLM response to the prompt above
llm_response = """
```json
{
  "name": "calculate_bmi",
  "docstring": "Calculate Body Mass Index.",
  "parameters": [
    {"name": "weight", "type": "float"},
    {"name": "height", "type": "float"}
  ],
  "return_type": "float"
}
```
"""

result = extract_json(llm_response)

if result.success:
    code_info = result.data
    print(f"Function Name: {code_info.get('name')}")
    print(f"Docstring: {code_info.get('docstring')}")

This approach uses the model's ability to follow formatting instructions, making the output predictable and easy to parse with tools you already have. By treating code as a string value within a larger JSON structure, you can reliably extract it for use in code analysis, generation, or execution workflows.

Was this section helpful?

References

JSON encoder and decoder, Python Software Foundation, 2024 - Official documentation for Python's built-in JSON parsing module, illustrating its standard, strict behavior which often struggles with variations in LLM output.
JSON mode, OpenAI, 2024 - Practical guide on prompting large language models to reliably generate structured JSON output, a key technique in advanced prompting that requires robust downstream parsing.
JSON, Douglas Crockford, 2017 - The official website defining the JSON data interchange format, outlining its strict syntax rules that necessitate specialized parsers for handling common deviations in LLM-generated text.