Performance Optimization¶
This guide provides tips and strategies for optimizing the performance of Medium Converter.
General Performance Tips¶
Hardware Considerations¶
- CPU: Multi-core processors help with parallel processing
- Memory: 8GB+ recommended, especially for PDF generation and LLM enhancement
- Disk: SSD improves file I/O performance
- Network: Stable internet connection required for fetching articles
Software Considerations¶
- Python Version: Python 3.11+ offers better performance than older versions
- Dependencies: Keep dependencies updated for performance improvements
- Virtual Environment: Use a dedicated virtual environment for consistent performance
Optimizing Article Fetching¶
HTTP Client Configuration¶
from medium_converter.core.fetcher import fetch_article
# Configure HTTP client for better performance
await fetch_article(
url="https://medium.com/example",
http_options={
"timeout": 30, # Increase timeout for slow connections
"follow_redirects": True,
"http2": True, # Enable HTTP/2 for better performance
"limits": {
"max_connections": 10,
"max_keepalive_connections": 5,
"keepalive_expiry": 60.0 # seconds
}
}
)
Caching¶
Enable caching to avoid refetching the same articles:
from medium_converter import convert_article
# Enable caching
await convert_article(
url="https://medium.com/example",
cache_options={
"enable": True,
"ttl": 86400, # Cache for 24 hours
"cache_dir": "~/.medium-converter/cache"
}
)
Optimizing Export Process¶
Format-Specific Optimizations¶
PDF Optimization¶
await convert_article(
url="https://medium.com/example",
output_format="pdf",
export_options={
"compress_images": True, # Optimize image size
"defer_images": True, # Defer image loading for faster initial processing
"optimize_for": "size" # Options: "size", "quality", "speed"
}
)
Markdown Optimization¶
await convert_article(
url="https://medium.com/example",
output_format="markdown",
export_options={
"lazy_image_loading": True, # Use lazy loading for images
"minimize_whitespace": True, # Reduce file size
"skip_metadata": True # Skip frontmatter for faster processing
}
)
LLM Enhancement Optimization¶
Model Selection¶
Different models offer different performance characteristics:
| Model | Processing Speed | Quality | Token Limit |
|---|---|---|---|
| GPT-3.5-Turbo | Fast | Good | 4K-16K |
| GPT-4 | Slow | Excellent | 8K-32K |
| Claude 3 Haiku | Very Fast | Good | 200K |
| Claude 3 Sonnet | Medium | Very Good | 200K |
| Claude 3 Opus | Slow | Excellent | 200K |
| Mistral Medium | Fast | Good | 32K |
| Local 7B Model | Varies | Good | Varies |
Chunking and Batching¶
For large articles, process content in chunks:
from medium_converter.llm.enhancer import enhance_article
from medium_converter.llm.config import LLMConfig
llm_config = LLMConfig(
# Provider settings
chunking={
"enable": True,
"max_chunk_size": 1000, # Characters per chunk
"overlap": 100, # Overlap between chunks for context
"batch_size": 5 # Process 5 chunks at once
}
)
await enhance_article(article, llm_config)
Parallel Processing¶
For multiple articles, use parallel processing:
from medium_converter import batch_convert
await batch_convert(
urls=["url1", "url2", "url3"],
output_format="markdown",
output_dir="./articles",
max_concurrent=3, # Process 3 articles at once
enhance=True,
enhancement_options={
"parallel_blocks": True # Process multiple content blocks in parallel
}
)
Memory Management¶
Reducing Memory Usage¶
# Configure for lower memory usage
await convert_article(
url="https://medium.com/example",
memory_options={
"low_memory": True, # Use more disk, less RAM
"cleanup_temporary": True, # Aggressively clean temp files
"stream_output": True # Stream output rather than building in memory
}
)
Handling Large Articles¶
For very large articles, use streaming mode:
from medium_converter import convert_article_stream
async for chunk in convert_article_stream(
url="https://medium.com/example-large-article",
output_format="markdown",
chunk_size=1000 # Process 1000 lines at a time
):
# Process each chunk as it becomes available
with open("large_article.md", "a") as f:
f.write(chunk)
Profiling and Benchmarking¶
Performance Logging¶
Enable performance logging to identify bottlenecks:
import logging
from medium_converter import convert_article
# Set up logging
logging.basicConfig(level=logging.INFO)
performance_logger = logging.getLogger("medium_converter.performance")
performance_logger.setLevel(logging.DEBUG)
# Add a handler to log to file
fh = logging.FileHandler("performance.log")
fh.setLevel(logging.DEBUG)
performance_logger.addHandler(fh)
# Now perform conversion with performance logging
await convert_article(
url="https://medium.com/example",
output_format="markdown",
debug_options={
"profile": True, # Enable profiling
"measure_time": True, # Measure execution time of each step
"log_memory": True, # Log memory usage
}
)
Comparing Configurations¶
import time
import asyncio
from medium_converter import convert_article
async def benchmark():
url = "https://medium.com/example"
configs = [
{"name": "Default", "options": {}},
{"name": "Optimized I/O", "options": {"http2": True, "cache": True}},
{"name": "Low Memory", "options": {"low_memory": True}},
{"name": "High Performance", "options": {
"http2": True,
"cache": True,
"parallel_blocks": True,
"optimize_for": "speed"
}}
]
results = []
for config in configs:
start_time = time.time()
await convert_article(
url=url,
output_format="markdown",
**config["options"]
)
end_time = time.time()
results.append({
"name": config["name"],
"time": end_time - start_time
})
# Print results
for result in results:
print(f"{result['name']}: {result['time']:.2f} seconds")
asyncio.run(benchmark())
Advanced Configurations¶
Configuration for High-Performance Servers¶
from medium_converter import convert_article
# Configuration for high-performance servers
await convert_article(
url="https://medium.com/example",
server_options={
"workers": 8, # Number of worker processes
"thread_pool_size": 20, # Size of thread pool
"use_uvloop": True, # Use uvloop for better asyncio performance
"use_process_pool": True # Use process pool for CPU-bound tasks
}
)
Distributed Processing¶
For very large batch operations, you can distribute work:
import asyncio
from medium_converter import batch_convert
from concurrent.futures import ProcessPoolExecutor
async def process_batch(batch_urls):
return await batch_convert(
urls=batch_urls,
output_format="markdown",
output_dir="./articles"
)
async def distributed_processing(all_urls, num_processes=4):
# Split URLs into batches
batch_size = len(all_urls) // num_processes
batches = [all_urls[i:i+batch_size] for i in range(0, len(all_urls), batch_size)]
# Process each batch in a separate process
with ProcessPoolExecutor(max_workers=num_processes) as executor:
loop = asyncio.get_event_loop()
tasks = [
loop.run_in_executor(executor, asyncio.run, process_batch(batch))
for batch in batches
]
results = await asyncio.gather(*tasks)
return [item for sublist in results for item in sublist] # Flatten results
# Usage
all_urls = [f"https://medium.com/article{i}" for i in range(1, 101)]
asyncio.run(distributed_processing(all_urls, num_processes=4))
Final Performance Checklist¶
- ✅ Choose the right output format for your needs
- ✅ Configure HTTP client appropriately
- ✅ Enable caching for repeated access
- ✅ Select appropriate LLM model for your quality/speed requirements
- ✅ Use batching and parallel processing when appropriate
- ✅ Monitor memory usage for large articles
- ✅ Keep dependencies updated
- ✅ Use Python 3.11+ for best performance