Train A Larget Language Model With Custom WordPress Posts
Many large language models (LLM) started to allow users to feed their own data into the model, like OpenAI’s ChatGPT and Anthropic’s Claude. With the feature, you can upload your blog posts, code, or datasets to customize the model’s responses. In this post, you will learn to extract the content from your WordPress website and feed it to Claude, then write or translate new posts in your writing style.
Collect Your Posts
There are some plugins to export posts from WordPress in CSV or JSON format. You can directly export from the database if you don’t want to install any plugin. Just log in to the MySQL database, find the table `wp_posts`, then run the following SQL command to get all published posts in the database.
SELECT *
FROM wp_posts
WHERE post_type = 'post'
AND post_status = 'publish'
If everything goes well, you can click “Export” button in the menu, adjust the format as CSV, then download the file.
You can also enter the admin dashboard, go to Tool > Export, then download the file. However, the file is in XML format, which is difficult to parse.
Clean WordPress Comments And HTML Tag
Once the CSV file is downloaded, the next step is to clean the content. Here are examples of before and after cleaning.
You can load it as a pandas DataFrame, then extract the content using the following code. A sample code of Jupyter Notebook is available on my Github.
import re
import pandas as pd
from bs4 import BeautifulSoup
def extract_wordpress_content(content):
# Remove WordPress block comments
content_without_comments = re.sub(r'', '', content)
content_without_comments = re.sub(
r'', '', content_without_comments)
# Parse the HTML
soup = BeautifulSoup(content_without_comments, 'html.parser')
# Extract text from paragraphs and headings
extracted_text = []
for element in soup.find_all(['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']):
extracted_text.append(element.get_text())
# Extract code blocks
for code_block in soup.find_all('pre', class_='wp-block-code'):
extracted_text.append(f"Code:\ncode_block.get_text()")
# Join the extracted text
full_text = ' '.join(extracted_text)
return full_text
def extract_post_content(df=None, file=None, limit=None, output_path=None):
'''
Extract post content from a CSV file.
'''
assert df is not None or file is not None, 'Either df or file must be provided'
if file is not None:
df = pd.read_csv(file)
df = (df
.sort_values(by='post_date', ascending=False)
.reset_index(drop=True))
print('Total posts:', len(df))
if limit:
df = df.head(limit)
# Convert to string to avoid error caused by NaN
df['post_title'] = df['post_title'].astype(str)
print(df.head())
post_content = ''
collected_posts = 0
for i in range(len(df)):
# Continue when post_content is NaN
if df.loc[i, 'post_content'] != df.loc[i, 'post_content']:
continue
extracted_content = extract_wordpress_content(
df.loc[i, 'post_content'])
post_content += df.loc[i, 'post_title']
post_content += extracted_content
post_content += '=' * 20
collected_posts += 1
print(f'Successfully collected collected_posts posts')
if output_path:
with open(output_path, 'w', encoding='utf-8') as f:
f.write(post_content)
return post_content
Create A New Project In Claude And Upload Documents
Log in to your Claude account and click the “Projects” button in the left panel, then click “Create New Project”.
On the project page, you can upload the cleaned post files.
Everything is ready now! You can start a new conversation to create new content. For example, you can ask Claude:
“Write a new blog post about applying machine learning to stock market prediction in my writing style. Use similar language, tone, and structure to my existing posts. Include typical elements like code samples, headings, and my common phrases.”
Or you can ask Claude to translate your post into another language, like:
“Translate the blog post about machine learning in the stock market into Traditional Chinese. The translation should maintain the original meaning and style of the post.”
Hope this post is helpful for you and effectively improves your content creation workflow!