SEO IRL
Pipeline
upload_file
Import & EmbedStep 1
filter_alt
DeduplicateStep 2
hub
ClusterStep 3
cleaning_services
DenoiseStep 4
lightbulb
Generate TopicsStep 5
content_copy
Deduplicate TopicsStep 6
account_tree
Label ParentsStep 7
Visualizers
grid_viewTreemap ViewhubRadial Tree
Generator
descriptionContent Brief
Presented by
Amin Foroutan
content_copyStep 6 of 7
chevron_right

Topic Deduplication

Remove duplicate topic clusters using semantic similarity and AI-powered comparison

settingsConfiguration

No Step 5 topic files found. Looking for files like: "Step5 - filename - topics - date.csv"
Complete Step 5 first, or check the browser console for file list.

Topics with embedding similarity above this threshold will be compared

How Do We Remove Duplicate Topics?

Using embeddings and AI comparison, we identify and eliminate duplicate topic clusters, keeping only the highest-quality version of each topic.

C1

EV Charging Infrastructure

article12 keywordsinsights8,500
ev charging stationselectric vehicle chargerscharging network+9
C2

Electric Vehicle Charging Stations

article10 keywordsinsights7,200
ev charger installationcharging station locationspublic charging points+7
C3

EV Battery Technology

article15 keywordsinsights9,800
electric vehicle battery lifeev battery capacitybattery replacement cost+12
C4

Electric Car Battery Performance

article11 keywordsinsights6,900
ev battery rangebattery degradationbattery warranty+8
C5

Tesla Model Comparison

article18 keywordsinsights12,400
tesla model 3 vs model ybest tesla modeltesla price comparison+15

codeStep 6: Backend Implementation

Topic deduplication using semantic similarity and GPT-powered comparison

account_treeDetailed Processing Pipeline

label
1. Extract Cluster Labels
Load Step 5 output → Extract unique cluster labels from all topics
cluster_label.unique()
psychology
2. Generate OpenAI Embeddings
Call OpenAI API to generate semantic embeddings for each label
embeddings.create()
analytics
3. Calculate Cosine Similarity
Normalize embeddings → Compute similarity matrix → Find pairs above threshold
cosine_similarity()
hub
4. Group Similar Clusters (DFS)
Build graph from similar pairs → Find connected components using DFS
dfs(node, group)
compare
5. GPT-Powered Comparison
For each group, use GPT to compare clusters and select the best one
chat.completions.create()
save
6. Remove Duplicates & Save
Filter out duplicate clusters → Save deduplicated CSV & removed clusters report
to_csv()

starKey Features

  • psychologyOpenAI embeddings for semantic similarity
  • compare_arrowsGPT-powered cluster comparison
  • hubDFS-based connected component grouping
  • sports_kabaddiTournament-style selection for large groups
  • historyComplete API logs for transparency
  • tuneConfigurable similarity threshold

datasetResponse Model

class TopicDeduplicationResponse(BaseModel):
    input_filename: str
    output_filename: str
    removed_duplicates_filename: str
    input_download_url: str
    output_download_url: str
    removed_duplicates_download_url: str
    original_clusters: int
    final_clusters: int
    removed_clusters: int
    original_keywords: int
    final_keywords: int
    removed_keywords: int
    similar_groups_found: int
    created_at: str
    api_logs: list[dict]
    preview_clusters: list[dict]

apiAPI Endpoint

POST /keywords/deduplicate-topics
Deduplicate similar topic clusters using semantic embeddings and GPT comparison