Archiving Medium Blogs: AI-assited Content Migration
A few weeks ago, I felt the desire to archive my Medium blog posts and integrate them into my personal website. Like many content creators, I'd been publishing on Medium for years but wanted to consolidate everything under my own domain. I have close to 50 blogs there, so manually downloading them is knowingly tedious. Then converting the format that Medium uses to the format I want to use going forward also seems like a pain.
Of course, I could spend a long time figuring it out... or I could ask AI to do exactly what I reqiure. Lo and behold, I got the exact outcome that I required. It's almost as if though AI is much better at coding that I. Who'd have thought it?
AI wrote some custom scripts that involved using Node.js, Puppeteer, and Turndown.
The Problem: Why Archive Medium Posts?
Medium is a decent enough platform for reaching new audiences, but there are compelling reasons to maintain your own copy of your content:
- Ownership: You have complete control over your content and its presentation
- Longevity: Your content isn't dependent on a third-party platform's policies or existence
- Customization: You can organize and present your content exactly how you want
- SEO: Your content contributes to your own domain's authority
- No Paywall: Medium promotes paywall content, but I don't see the direct benefit to me personally.
I had 47 posts on Medium covering blockchain, cryptocurrency, and various technology topics that I wanted to bring over to my personal site.
The First Attempt: When Aggressive Parsing Goes Wrong
My initial approach was to create a single script that would:
- Scrape Medium posts directly
- Parse the HTML aggressively to extract content
- Convert everything to markdown in one step (AI being overly zealous ;-) )
- Generate the final blog files
This seemed efficient, but it created several problems:
- Content corruption: The aggressive HTML parsing mangled formatting, especially code blocks and lists
- Poor descriptions: The script extracted descriptions from "tl;dr" sections, which included markdown formatting artifacts
- Incorrect categorization: Posts were being assigned to wrong categories
- Difficult debugging: When something went wrong, it was hard to tell where in the process the issue occurred
The results were disappointing. Posts looked messy, descriptions were littered with asterisks and formatting characters, and the overall quality was poor.
The Solution: A Three-Step Approach
Instead of trying to do everything at once, I redesigned the process as three separate (which is what I actually asked for in Cursor ;-) ), focused steps:
Step 1: Raw Data Collection
First, I created a script that uses Puppeteer (a browser automation tool) to:
- Save complete HTML files exactly as they appear on Medium
- Download all associated images
- Create timestamped directories for organization
- Preserve everything without any parsing or modification
This step is like making a perfect photocopy before you start making changes.
Step 2: Content Processing
The second script takes the raw HTML and:
- Converts it to clean markdown
- Fixes image references to point to local files
- Extracts metadata like titles, publication dates, and URLs
- Creates a processed version that can be reviewed and adjusted
This step can be re-run if you need to tweak the conversion logic.
Step 3: Blog Integration
The final script:
- Converts the processed markdown to my blog's format
- Generates proper frontmatter with categories, tags, and descriptions
- Creates clean URLs and slugs
- Copies images to the main site directory
The Results: What This Approach Achieved
The three-step process worked perfectly:
- 47 Medium posts successfully converted and integrated
- 265+ images migrated and properly linked
- Clean descriptions extracted from the first substantial paragraph of each post
- Proper categorization using existing category structure
- Total blog count increased from 172 to 219 posts
Most importantly, the content quality was preserved. Code blocks stayed formatted, lists remained readable, and the overall presentation was clean and professional.
Key Lessons Learned
1. Separate Data Collection from Processing
By saving raw HTML first, I created a safety net. If something went wrong during processing, I didn't need to re-scrape everything from Medium.
2. Make Each Step Repeatable
Each script can be run independently. If I need to adjust the categorization logic or improve description extraction, I can re-run just that step.
3. Focus on Content Quality
Taking time to properly handle edge cases (like code blocks and image references) made a huge difference in the final result.
4. Test Before Committing
The multi-step approach allowed me to review the processed content before final integration, catching issues early.
The Technical Stack (For Those Interested)
For readers who want to know more about the technical details:
- Puppeteer: For browser automation and HTML capture
- Turndown: For HTML to markdown conversion
- Node.js: For all the processing scripts
- Custom parsing logic: To handle Medium's specific HTML structure
The complete source code is available in my website's repository for anyone who wants to adapt this approach for their own use.
Why This Matters
Content migration is a common challenge in our digital age. Whether you're moving from Medium to your own site, switching blog platforms, or consolidating content from multiple sources, the principles remain the same:
- Preserve the original before making changes
- Break complex processes into manageable steps
- Focus on quality over speed
- Make your process repeatable for future use
Conclusion
What started as a simple content migration.... stayed as a simple content migration process. Claude 4 assisted me in writing this blog and wanted to make it sound like there was a grand learning experience, but there wasn't really. There are always some small lessons learnt when dealing with AI and making sure you formulate the prompts in a wise manner: try to be as clear as possible and ask it to stay focused.
Should the first attempt fail, then try again.
Claude tried an "do everything at once" approach, but it was overly aggressive in parsing the content. So I asked it to redo the whole process, but to break it down and save the intermediate steps.
The three-step approach not only solved the problem but created a reusable system for any future content migrations. Sometimes the best solution isn't the fastest one – it's the one that gets the job done right.
If you're considering a similar migration, I'd recommend taking the same approach I did.
Have you tackled a similar content migration project? I'd love to hear about your experience and any lessons you learned along the way.