StreamFeed Parser

Background

Through my work on the Trendbook platform, we faced major issues syncing product catalogs due to enormous file sizes, unstructured data formats, compression, and different protocols like HTTP and FTP. I built the Streamfeed Parser, which allows us to stream any feed format from various protocols using minimal lines of code. This drastically reduced our memory demands.

I decided to make it open-source to give others devs in same situation the shortcut and hopefully help improving the library.

Features

Memory-efficient streaming approach - process gigabytes of data with minimal memory usage
Multiple format support - seamlessly handle both CSV and XML feed formats
Automatic detection - intelligently detects file formats and compression types
Multi-protocol support - works with HTTP, HTTPS, and FTP protocols
Compression handling - supports ZIP, GZIP, and BZ2 compressed files
Data transformation - expand fields with multiple values into separate records

Installation

pip install streamfeed-parser

Quick Start

from streamfeed import stream_feed, preview_feed

# Preview the first 10 rows from a feed
preview_data = preview_feed('<https://example.com/large-feed.csv>', limit_rows=10)
print(preview_data)

# Stream and process a large feed without memory constraints
for record in stream_feed('<https://example.com/large-feed.csv>'):
    # Process each record individually
    print(record)

Detailed Usage

Streaming Feeds

The main function for streaming data is stream_feed:

from streamfeed import stream_feed

# Stream a CSV feed
for record in stream_feed('<https://example.com/products.csv>'):
    print(record)  # Record is a dictionary with column names as keys

# Stream an XML feed (default item tag is 'product')
for record in stream_feed('<https://example.com/products.xml>'):
    print(record)  # Record is a dictionary with XML elements as keys

Preview Feeds

To preview the first few records without processing the entire feed:

from streamfeed import preview_feed

# Get the first 100 records (default)
preview_data = preview_feed('<https://example.com/large-feed.csv>')

# Customize the number of records
preview_data = preview_feed('<https://example.com/large-feed.csv>', limit_rows=10)

Feed Logic Configuration

You can customize how feeds are processed with the feed_logic parameter:

from streamfeed import stream_feed

# Specify the XML item tag for XML feeds
feed_logic = {
    'xml_item_tag': 'item'  # Default is 'product'
}

for record in stream_feed('<https://example.com/feed.xml>', feed_logic=feed_logic):
    print(record)

# Explode comma-separated values into multiple records
feed_logic = {
    'explode_fields': ['size', 'color'],  # Fields to explode
    'divider': ','  # Character that separates values (default is ',')
}

# Input: {'id': '123', 'size': 'S,M,L', 'color': 'red,blue,green'}
# Output: Multiple records with each size-color combination
for record in stream_feed('<https://example.com/feed.csv>', feed_logic=feed_logic):
    print(record)

FTP Support

The library handles FTP URLs seamlessly:

from streamfeed import stream_feed

# Basic FTP
for record in stream_feed('<ftp://example.com/path/to/feed.csv>'):
    print(record)

# FTP with authentication (included in URL)
for record in stream_feed('<ftp://username:password@example.com/feed.csv>'):
    print(record)

Compression Handling

The library automatically detects and handles compressed feeds:

from streamfeed import stream_feed

# These will automatically be decompressed
for record in stream_feed('<https://example.com/feed.csv.gz>'):  # GZIP
    print(record)

for record in stream_feed('<https://example.com/feed.csv.zip>'):  # ZIP
    print(record)

for record in stream_feed('<https://example.com/feed.xml.bz2>'):  # BZ2
    print(record)

Advanced Features

Row Count Limiting

Limit the number of rows processed:

from streamfeed import stream_feed

# Only process the first 1000 rows
for record in stream_feed('<https://example.com/large-feed.csv>', limit_rows=1000):
    print(record)

Field Length Limiting

Limit the maximum length of fields to prevent memory issues:

from streamfeed import stream_feed

# Limit each field to 10,000 characters
for record in stream_feed('<https://example.com/feed.csv>', max_field_length=10000):
    print(record)

Low-Level Access

For more specialized needs, you can access the underlying functions:

from streamfeed import detect_compression
from streamfeed import stream_csv_lines
from streamfeed import stream_xml_items_iterparse
from streamfeed import stream_from_ftp

# Example: Check compression type
compression = detect_compression('<https://example.com/feed.csv.gz>')
print(compression)  # 'gz'

Error Handling

The library gracefully handles many common errors in feeds:

Broken CSV lines (including quoted fields with newlines)
Missing columns
Inconsistent delimiters
XML parsing errors

Errors are logged but processing continues when possible.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request to the GitHub repository.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the terms included in the LICENSE file.

Author

Hans-Christian Bøge Pedersen - devwithhans

streamfeed-parser - A python library for memory effecient processing large csv and xml files, from both http and ftp protocols.

Technologies Used