Pipeline

Visualizers

grid_viewTreemap View hubRadial Tree

Generator

descriptionContent Brief

Presented by

Amin Foroutan

content_copyStep 6 of 7

chevron_right

Topic Deduplication

Remove duplicate topic clusters using semantic similarity and AI-powered comparison

settingsConfiguration

Select Topic File (Step 5 Output)

No Step 5 topic files found. Looking for files like: "Step5 - filename - topics - date.csv"
Complete Step 5 first, or check the browser console for file list.

Similarity Threshold: 96%

Topics with embedding similarity above this threshold will be compared

How Do We Remove Duplicate Topics?

Using embeddings and AI comparison, we identify and eliminate duplicate topic clusters, keeping only the highest-quality version of each topic.

EV Charging Infrastructure

article12 keywordsinsights8,500

ev charging stationselectric vehicle chargerscharging network+9

Electric Vehicle Charging Stations

article10 keywordsinsights7,200

ev charger installationcharging station locationspublic charging points+7

EV Battery Technology

article15 keywordsinsights9,800

electric vehicle battery lifeev battery capacitybattery replacement cost+12

Electric Car Battery Performance

article11 keywordsinsights6,900

ev battery rangebattery degradationbattery warranty+8

Tesla Model Comparison

article18 keywordsinsights12,400

tesla model 3 vs model ybest tesla modeltesla price comparison+15

codeStep 6: Backend Implementation

Topic deduplication using semantic similarity and GPT-powered comparison

account_treeDetailed Processing Pipeline

label

1. Extract Cluster Labels

Load Step 5 output → Extract unique cluster labels from all topics

cluster_label.unique()

psychology

2. Generate OpenAI Embeddings

Call OpenAI API to generate semantic embeddings for each label

embeddings.create()

analytics

3. Calculate Cosine Similarity

Normalize embeddings → Compute similarity matrix → Find pairs above threshold

cosine_similarity()

hub

4. Group Similar Clusters (DFS)

Build graph from similar pairs → Find connected components using DFS

dfs(node, group)

compare

5. GPT-Powered Comparison

For each group, use GPT to compare clusters and select the best one

chat.completions.create()

save

6. Remove Duplicates & Save

Filter out duplicate clusters → Save deduplicated CSV & removed clusters report

to_csv()

starKey Features

psychologyOpenAI embeddings for semantic similarity
compare_arrowsGPT-powered cluster comparison
hubDFS-based connected component grouping
sports_kabaddiTournament-style selection for large groups
historyComplete API logs for transparency
tuneConfigurable similarity threshold

datasetResponse Model

class TopicDeduplicationResponse(BaseModel):
    input_filename: str
    output_filename: str
    removed_duplicates_filename: str
    input_download_url: str
    output_download_url: str
    removed_duplicates_download_url: str
    original_clusters: int
    final_clusters: int
    removed_clusters: int
    original_keywords: int
    final_keywords: int
    removed_keywords: int
    similar_groups_found: int
    created_at: str
    api_logs: list[dict]
    preview_clusters: list[dict]

apiAPI Endpoint

POST /keywords/deduplicate-topics

Deduplicate similar topic clusters using semantic embeddings and GPT comparison