Archiving Medium Blogs: AI-assited Content Migration

A few weeks ago, I felt the desire to archive my Medium blog posts and integrate them into my personal website. Like many content creators, I'd been publishing on Medium for years but wanted to consolidate everything under my own domain. I have close to 50 blogs there, so manually downloading them is knowingly tedious. Then converting the format that Medium uses to the format I want to use going forward also seems like a pain.

Of course, I could spend a long time figuring it out... or I could ask AI to do exactly what I reqiure. Lo and behold, I got the exact outcome that I required. It's almost as if though AI is much better at coding that I. Who'd have thought it?

AI wrote some custom scripts that involved using Node.js, Puppeteer, and Turndown.

The Problem: Why Archive Medium Posts?

Medium is a decent enough platform for reaching new audiences, but there are compelling reasons to maintain your own copy of your content:

Ownership: You have complete control over your content and its presentation
Longevity: Your content isn't dependent on a third-party platform's policies or existence
Customization: You can organize and present your content exactly how you want
SEO: Your content contributes to your own domain's authority
No Paywall: Medium promotes paywall content, but I don't see the direct benefit to me personally.

I had 47 posts on Medium covering blockchain, cryptocurrency, and various technology topics that I wanted to bring over to my personal site.

The First Attempt: When Aggressive Parsing Goes Wrong

My initial approach was to create a single script that would:

Scrape Medium posts directly
Parse the HTML aggressively to extract content
Convert everything to markdown in one step (AI being overly zealous ;-) )
Generate the final blog files

This seemed efficient, but it created several problems:

Content corruption: The aggressive HTML parsing mangled formatting, especially code blocks and lists
Poor descriptions: The script extracted descriptions from "tl;dr" sections, which included markdown formatting artifacts
Incorrect categorization: Posts were being assigned to wrong categories
Difficult debugging: When something went wrong, it was hard to tell where in the process the issue occurred

The results were disappointing. Posts looked messy, descriptions were littered with asterisks and formatting characters, and the overall quality was poor.

The Solution: A Three-Step Approach

Instead of trying to do everything at once, I redesigned the process as three separate (which is what I actually asked for in Cursor ;-) ), focused steps:

Step 1: Raw Data Collection

First, I created a script that uses Puppeteer (a browser automation tool) to:

Save complete HTML files exactly as they appear on Medium
Download all associated images
Create timestamped directories for organization
Preserve everything without any parsing or modification

This step is like making a perfect photocopy before you start making changes.

Step 2: Content Processing

The second script takes the raw HTML and:

Converts it to clean markdown
Fixes image references to point to local files
Extracts metadata like titles, publication dates, and URLs
Creates a processed version that can be reviewed and adjusted

This step can be re-run if you need to tweak the conversion logic.

Step 3: Blog Integration

The final script:

Converts the processed markdown to my blog's format
Generates proper frontmatter with categories, tags, and descriptions
Creates clean URLs and slugs
Copies images to the main site directory

The Results: What This Approach Achieved

The three-step process worked perfectly:

47 Medium posts successfully converted and integrated
265+ images migrated and properly linked
Clean descriptions extracted from the first substantial paragraph of each post
Proper categorization using existing category structure
Total blog count increased from 172 to 219 posts

Most importantly, the content quality was preserved. Code blocks stayed formatted, lists remained readable, and the overall presentation was clean and professional.

Key Lessons Learned

1. Separate Data Collection from Processing

By saving raw HTML first, I created a safety net. If something went wrong during processing, I didn't need to re-scrape everything from Medium.

2. Make Each Step Repeatable

Each script can be run independently. If I need to adjust the categorization logic or improve description extraction, I can re-run just that step.

3. Focus on Content Quality

Taking time to properly handle edge cases (like code blocks and image references) made a huge difference in the final result.

4. Test Before Committing

The multi-step approach allowed me to review the processed content before final integration, catching issues early.

The Technical Stack (For Those Interested)

For readers who want to know more about the technical details:

Puppeteer: For browser automation and HTML capture
Turndown: For HTML to markdown conversion
Node.js: For all the processing scripts
Custom parsing logic: To handle Medium's specific HTML structure

The complete source code is available in my website's repository for anyone who wants to adapt this approach for their own use.

Why This Matters

Content migration is a common challenge in our digital age. Whether you're moving from Medium to your own site, switching blog platforms, or consolidating content from multiple sources, the principles remain the same:

Preserve the original before making changes
Break complex processes into manageable steps
Focus on quality over speed
Make your process repeatable for future use

Conclusion

What started as a simple content migration.... stayed as a simple content migration process. Claude 4 assisted me in writing this blog and wanted to make it sound like there was a grand learning experience, but there wasn't really. There are always some small lessons learnt when dealing with AI and making sure you formulate the prompts in a wise manner: try to be as clear as possible and ask it to stay focused.

Should the first attempt fail, then try again.

Claude tried an "do everything at once" approach, but it was overly aggressive in parsing the content. So I asked it to redo the whole process, but to break it down and save the intermediate steps.

The three-step approach not only solved the problem but created a reusable system for any future content migrations. Sometimes the best solution isn't the fastest one – it's the one that gets the job done right.

If you're considering a similar migration, I'd recommend taking the same approach I did.

Have you tackled a similar content migration project? I'd love to hear about your experience and any lessons you learned along the way.

Archiving Medium Blogs: A Better Approach to Content Migration