Handling Non-Standard JSON with json.JSONDecodeError

Handling Non-Standard JSON with json.JSONDecodeError

It’s astonishing how often developers encounter broken JSON in the wild. You’d think that with all the standards and best practices floating around, people would get it right. But no, it seems like a rite of passage for a programmer to deal with malformed JSON. Whether it’s a missing comma, an unquoted string, or an extra bracket, these issues crop up just when you least expect them.

One of the main culprits is the human factor. JSON is meant to be easy to read and write, but that simplicity can lead to sloppiness. A developer might be in a hurry and forget to validate their output. Or perhaps they are generating JSON from a database query, and the data isn’t properly sanitized. This kind of oversight can lead to a cascade of errors that are difficult to track down.

Another issue arises when different systems interact. If you’re pulling data from an API that someone else maintains, you have to expect the unexpected. Different programming languages and frameworks handle data serialization in their own ways, which can lead to inconsistencies. You might receive a JSON response that works perfectly in one context but breaks in another due to differences in how data types are interpreted.

import json

def load_json(data):
    try:
        return json.loads(data)
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON: {e}")

# Example of broken JSON
broken_json = '{"name": "John", "age": 30, "city": "New York",}'
result = load_json(broken_json)

In this case, the trailing comma after “New York” is the problem. It’s a trivial mistake, but it can create a frustrating debugging experience. To mitigate such issues, always validate your JSON data before processing it. Tools and libraries exist to help with this, and they can save you a lot of time and headaches.

When working with third-party APIs, it’s essential to have a robust error handling mechanism. You can implement retries with exponential backoff for transient errors and log the details of any failures for later analysis. This way, you maintain a certain level of resilience against the inevitable imperfections in your data sources.

def fetch_data(url):
    import requests
    
    response = requests.get(url)
    
    if response.status_code == 200:
        return load_json(response.text)
    else:
        print(f"Error fetching data: {response.status_code}")
        return None

Ultimately, you need to be prepared to face the reality of broken JSON. Building a system that can gracefully handle these imperfections is key to keeping your application running smoothly. You’ll find that a little extra effort in error handling can lead to a much more stable application overall.

Wrangling broken data back into shape

So, your try-except block caught an error. Great. Now what? You can’t always go back to the API provider and tell them to fix their shoddy work. Sometimes, you’re stuck with the data you’re given, and your job is to make it work. This is where things get interesting. Instead of just failing, we can try to programmatically clean up the most common JSON syntax errors. It feels a bit like being a digital janitor, but it’s a necessary skill.

Let’s start with that classic, the trailing comma. It’s valid in many modern programming languages, so it’s an easy mistake for developers to make when hand-crafting JSON. While the json library is strict about it, a simple regular expression can wipe it away before the parser even sees it.

import re
import json

def fix_trailing_commas(json_string):
    # This regex finds a comma, followed by optional whitespace,
    # right before a closing brace or bracket, and removes it.
    json_string = re.sub(r',s*([}]])', r'1', json_string)
    return json_string

# Our old enemy, the JSON with a trailing comma
broken_json = '{"name": "John", "age": 30, "skills": ["Python", "SQL",],}'
fixed_json = fix_trailing_commas(broken_json)

try:
    data = json.loads(fixed_json)
    print("Successfully parsed JSON after fixing trailing commas.")
    print(data)
except json.JSONDecodeError as e:
    print(f"Still failed to parse: {e}")

The magic here is in the regex: ,s*([}]]). It looks for a literal comma ,, followed by any amount of whitespace s*, which is then followed by either a closing brace or bracket [}]]. We capture the closing character in a group () and then put it back in the replacement string using 1. The comma and whitespace are simply discarded. It’s a surgical strike that cleans up the mess without collateral damage.

Another frequent offender is JSON that looks suspiciously like a JavaScript object literal. You’ll see keys without quotes and strings delimited by single quotes instead of the required double quotes. Again, we can’t just throw our hands up. We can build a fixer for this, too. It’s a bit more involved, but the principle is the same: use regex to transform the non-standard syntax into something the strict JSON parser will accept.

import re
import json

def fix_js_object_literal(json_string):
    # 1. Add quotes to unquoted keys.
    # Looks for a brace or comma, whitespace, a word, then a colon.
    # e.g., {key: or , key:
    json_string = re.sub(r'([{,]s*)(w+)(s*:)', r'1"2"3', json_string)

    # 2. Replace single quotes with double quotes.
    # This is a bit naive and can fail on strings with escaped single quotes,
    # but it works for many common cases.
    json_string = json_string.replace("'", '"')
    
    return json_string

# A string that looks more like a JS object than JSON
broken_json = "{name: 'Jane Doe', city: 'London', active: true}"
fixed_json = fix_js_object_literal(broken_json)

try:
    data = json.loads(fixed_json)
    print("Successfully parsed JS-like object.")
    print(data)
except json.JSONDecodeError as e:
    print(f"Parsing failed: {e}")
    print(f"Attempted to parse: {fixed_json}")

This two-step process first wraps any unquoted keys in double quotes and then swaps all single quotes for double quotes. Be warned: the single-quote replacement is a blunt instrument. If your string values contain apostrophes, like in 'O'Malley', a simple replace() call can create invalid JSON. But for the common case where an entire system was built using single quotes instead of double quotes, this simple fix is often sufficient. You have to know your data and decide if a simple heuristic is worth the risk.

Finally, let’s talk about comments. The JSON standard, in its infinite wisdom, has no syntax for comments. But does that stop developers from adding them? Of course not. You’ll find // and /* ... */ style comments littered throughout JSON configuration files across the globe. Fortunately, these are also easy to strip out before parsing.

import re
import json

def strip_comments(json_string):
    # Strip C-style /* ... */ comments (non-greedy)
    json_string = re.sub(r'/*.*?*/', '', json_string, flags=re.DOTALL)
    # Strip C++-style // comments
    json_string = re.sub(r'//.*', '', json_string)
    return json_string

json_with_comments = """
{
    // User ID from the legacy system
    "id": 42,
    "username": "testuser", /* This should be migrated to email */
    "roles": ["admin", "editor"]
}
"""
fixed_json = strip_comments(json_with_comments)

try:
    data = json.loads(fixed_json)
    print("Successfully parsed JSON after stripping comments.")
    print(data)
except json.JSONDecodeError as e:
    print(f"Parsing failed: {e}")

By chaining these cleaning functions together—first stripping comments, then fixing quotes and keys, then removing trailing commas—you can build a surprisingly robust ingestion pipeline. It allows you to be lenient on input while remaining strict in your application’s internal data representation. Each function acts as a layer of defense against the chaos of real-world data, making your application more resilient and saving you from late-night pages about a trivial syntax error in a third-party data feed.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *