使用向量数据库pinecone构建应用03：推荐系统 Recommender Systems|电子爱好者

admin管理员组
文章数量:1593939

Building Applications with Vector Databases

下面是这门课的学习笔记：https://www.deeplearning.ai/short-courses/building-applications-vector-databases/

Learn to create six exciting applications of vector databases and implement them using Pinecone.

Build a hybrid search app that combines both text and images for improved multimodal search results.

Learn how to build an app that measures and ranks facial similarity.

Lesson 3 - Recommender Systems

Import the Needed Packages

import warnings
warnings.filterwarnings('ignore')
from langchain.text_splitter import RecursiveCharacterTextSplitter
from openai import OpenAI
from pinecone import Pinecone, ServerlessSpec
from tqdm.auto import tqdm, trange
from DLAIUtils import Utils

import pandas as pd
import time
import os

获取api key

utils = Utils()
PINECONE_API_KEY = utils.get_pinecone_api_key()
OPENAI_API_KEY = utils.get_openai_api_key()

Load the Dataset

with open('./data/all-the-news-3.csv', 'r') as f:
    header = f.readline()
    print(header)

Output

date,year,month,day,author,title,article,url,section,publication

df = pd.read_csv('./data/all-the-news-3.csv', nrows=99)
df.head()

Output

Setup Pinecone

openai_client = OpenAI(api_key=OPENAI_API_KEY)
util = Utils()
INDEX_NAME = utils.create_dlai_index_name('dl-ai')
pinecone = Pinecone(api_key=PINECpONE_API_KEY)

if INDEX_NAME in [index.name for index in pinecone.list_indexes()]:
  pinecone.delete_index(INDEX_NAME)

pinecone.create_index(name=INDEX_NAME, dimension=1536, metric='cosine',
  spec=ServerlessSpec(cloud='aws', region='us-west-2'))

index = pinecone.Index(INDEX_NAME)

1. Create Embeddings of the News Titles

通过文章标题来创建embedding

将输入的文章列表转换为嵌入向量（embeddings）

def get_embeddings(articles, model="text-embedding-ada-002"):
   return openai_client.embeddings.create(input = articles, model=model)

为标题生成embeddings，然后形成字典插入到pinecone索引中

CHUNK_SIZE=400
TOTAL_ROWS=10000
progress_bar = tqdm(total=TOTAL_ROWS)
chunks = pd.read_csv('./data/all-the-news-3.csv', chunksize=CHUNK_SIZE, 
                     nrows=TOTAL_ROWS)
chunk_num = 0
for chunk in chunks:
    titles = chunk['title'].tolist()
    embeddings = get_embeddings(titles)
    prepped = [{'id':str(chunk_num*CHUNK_SIZE+i), 
                'values':embeddings.data[i].embedding,
                'metadata':{'title':titles[i]},} for i in range(0,len(titles))]
    chunk_num = chunk_num + 1
    if len(prepped) >= 200:
      index.upsert(prepped)
      prepped = []
    progress_bar.update(len(chunk))

上述代码会产生总共 TOTAL_ROWS / CHUNK_SIZE 个 chunks，即数据集中总行数除以每个 chunk 的大小。在这个例子中，总行数为 10000，每个 chunk 的大小为 400，因此会产生 25 个 chunks。

这段代码的功能是：

从名为 './data/all-the-news-3.csv' 的 CSV 文件中逐个读取数据块（chunks），每个数据块包含 400 行数据，总共读取 10000 行数据。
对于每个数据块，提取其中的标题列，并将这些标题传递给一个名为 get_embeddings 的函数，以获取标题的嵌入向量。
将每个标题的嵌入向量与相应的标题一起打包成记录，并为每个记录分配一个唯一的 ID。
如果准备好的记录数量达到 200 条，则将这些记录批量插入到一个名为 index 的索引中。
在处理每个数据块时，更新进度条，以显示处理的进度。

综合起来，这段代码的主要功能是从 CSV 文件中读取数据块，并将其中的标题列转换为嵌入向量，然后将这些向量插入到一个名为 index 的索引中。

index.describe_index_stats()

Output

index.describe_index_stats()
{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 10000}},
 'total_vector_count': 10000}

Build the Recommender System

def get_recommendations(pinecone_index, search_term, top_k=10):
  embed = get_embeddings([search_term]).data[0].embedding # openai的接口
  res = pinecone_index.query(vector=embed, top_k=top_k, include_metadata=True) 
  return res

下面是对embed = get_embeddings([search_term]).data[0].embedding的解释：

它等价于openai_client.embeddings.create(input = articles, model=model).data[0].embedding

一个可能的openai请求embedding的例子：

可以看到embedding是在data这个key下面的。对于上述的很多embedding，自然有很多的data，上面选择的是data[0]，也可以选择data[1]等等

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [
        0.0011064255,
        -0.0093271292,
        .... (1536 floats total for ada-002)
        -0.0033842222,
      ],
      "index": 0
    }
  ],
  "model": "text-embedding-ada-002",
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 8
  }
}

获取推荐文章的标题：含有obama的文章推荐

reco = get_recommendations(index, 'obama')
for r in reco.matches:
    print(f'{r.score} : {r.metadata["title"]}')

Output：前面是pinecone检索的分数，后面是文章的标题

这是通过对新闻标题进行embedding，然后检索得到的新闻标题：

0.84992218 : Barack Obama just stepped off the sidelines to defend Obamacare
0.848674893 : President Obama has a new plan to fight the opioid epidemic
0.848271608 : “Our democracy is at stake”: Obama delivers his first post-presidency campaign speech
0.848052 : Obama: if you were fine with big government until it served black people, rethink your biases
0.845821619 : President Obama: Michelle & I Are Gonna Be Renters
0.844207942 : Obama meets with national security team on Syria, Islamic State
0.843172133 : Vox Sentences: Obama got a warmer welcome in Hiroshima than the Japanese prime minister
0.84271574 : Watch President Obama dance the tango in Argentina
0.840892255 : Obama and Supreme Court Tag Team on Juvenile Justice Reform
0.839049876 : Clinton, Obama pledge unity behind Trump presidency

2. Create Embeddings of All News Content

通过文章内容来创建embedding

一些pinecone的配置，创建名为articles_index的pinecone索引

if INDEX_NAME in [index.name for index in pinecone.list_indexes()]:
  pinecone.delete_index(name=INDEX_NAME)

pinecone.create_index(name=INDEX_NAME, dimension=1536, metric='cosine',
  spec=ServerlessSpec(cloud='aws', region='us-west-2'))
# 创建了一个名为 INDEX_NAME 的索引对象，该对象可以用于存储和检索向量数据。
articles_index = pinecone.Index(INDEX_NAME)

embed的函数如下：

def embed(embeddings, title, prepped, embed_num):
  for embedding in embeddings.data:
    prepped.append({'id':str(embed_num), 
                    'values':embedding.embedding, 
                    'metadata':{'title':title}})
    embed_num += 1
    if len(prepped) >= 100:
        articles_index.upsert(prepped)
        prepped.clear()
  return embed_num

(Note: news_data_rows_num = 100): In this lab, we've initially set news_data_rows_num to 100 for speedier results, allowing you to observe the outcomes faster. Once you've done an initial run, consider increasing this value to 200, 400, 700, and 1000. You'll likely notice better and more relevant results.

news_data_rows_num = 100

embed_num = 0 #keep track of embedding number for 'id'
text_splitter = RecursiveCharacterTextSplitter(chunk_size=400, 
    chunk_overlap=20) # how to chunk each article
prepped = []
df = pd.read_csv('./data/all-the-news-3.csv', nrows=news_data_rows_num)
articles_list = df['article'].tolist()
titles_list = df['title'].tolist()

# 处理每一篇文章，
for i in range(0, len(articles_list)):
    print(".",end="")
    art = articles_list[i]
    title = titles_list[i]
    if art is not None and isinstance(art, str):
      texts = text_splitter.split_text(art) # 将文章内容进行切分成小段
      embeddings = get_embeddings(texts) # 获取文章内容的embedding
      embed_num = embed(embeddings, title, prepped, embed_num) # 存入pinecone索引中

对text_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=20)的解释：

chunk_size=400：这是一个参数，指定了每个文本块的大小。在这里，每个文本块的大小被设置为 400 个字符。
chunk_overlap=20：这是另一个参数，指定了文本块之间的重叠量。在这里，文本块之间的重叠量被设置为 20 个字符。这意味着，相邻的文本块会有部分内容重叠，以确保文本的连续性和完整性。

所以，这行代码创建了一个文本拆分器对象，该对象使用指定的参数将文章文本拆分成较小的文本块，以便进行后续处理，例如将其转换为嵌入向量。

查看向量数据库中的结果

articles_index.describe_index_stats()

Output

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 1000}},
 'total_vector_count': 1000}

Build the Recommender System

reco = get_recommendations(articles_index, 'obama', top_k=100)
seen = {}
for r in reco.matches:
    title = r.metadata['title']
    if title not in seen:
        print(f'{r.score} : {title}')
        seen[title] = '.'

Output：通过文章内容进行embedding，然后检索得到的相关文章的标题

0.821158946 : Why Obama is vetting Nevada's Republican governor for the Supreme Court
0.818882763 : U.S. lawmakers ask for disclosure of number of Americans under surveillance
0.812377512 : NYPD Honcho Insulted by 'Hamilton' Star Lin-Manuel Miranda Celebrating Obama's Controversial Prisoner Release
0.806862772 : Why Jews Are Getting Themselves Arrested at ICE Centers Around the Country
0.806241512 : Trump keeping options open as Republican feud rages
of number of Americans under surveillance
0.812377512 : NYPD Honcho Insulted by 'Hamilton' Star Lin-Manuel Miranda Celebrating Obama's Controversial Prisoner Release
0.806862772 : Why Jews Are Getting Themselves Arrested at ICE Centers Around the Country
0.806241512 : Trump keeping options open as Republican feud rages

本文标签：向量数据库系统 pinecone Recommender

版权声明：本文标题：使用向量数据库pinecone构建应用03：推荐系统 Recommender Systems 内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：https://m.elefans.com/dongtai/1728180672a1148339.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

电子爱好者 - 最新技术资讯及电子产品介绍！

使用向量数据库pinecone构建应用03：推荐系统 Recommender Systems

Building Applications with Vector Databases

Lesson 3 - Recommender Systems

Import the Needed Packages

Load the Dataset

Setup Pinecone

1. Create Embeddings of the News Titles

Build the Recommender System

2. Create Embeddings of All News Content

Build the Recommender System

更多相关文章

RK3288[android 7.1]调试笔记 开机进不了系统报以下log:init: Service 'bootanim' is being killed...

Bios开启CPU虚拟化后，进不了系统，解决办法如下

ubuntu服务器闪屏，进不去系统

Mac 系统开机慢

Ubuntu(20.04)开机无法进入系统及白色下划线闪烁

linux系统无法开机及其解决办法

关于在rc.local中添加命令导致Linux系统无法开机的解决方案

【Android 12 AOSP学习】Android系统修改开机logo图片及开机动画还有遇到的问题

装完ubuntu系统后，开机无法正常进入系统，且长按shift无法进入grub

linux系统开机提示 Control-D 输开机密码界面修复方法

Ubuntu Linux系统 设置开机进入 grub 引导界面的方法

Android系统 init.rc开机执行shell脚本

简单记录下电脑Ubuntu系统卡死后无法正常开机的解救方式

国产UOS系统root登陆及开机自动启动

Ubuntu系统开机黑屏 左上角光标闪烁

win10多合一原版系统_制作WIN10多合一原版系统

win10系统镜像

Lenovo联想ThinkBook 14 Gen5+ IRH(21HW)原装Win11系统镜像

基于Windows XP SP3系统下MS08067漏洞攻击

xp计算机无法远程桌面连接,XP sp3系统下远程桌面不能连接到指定计算机的解决方案...

发表评论

推荐文章

忘记开机密码的操作

linux应用编程和网络编程学习笔记--3.1.linux中的文件IO

pdf怎么转换成word并保持格式不变

GHD outlet the market investigate

Windows XP 部署 高版本的VisualStudio运行库

热门文章

台式计算机无法启动不了,台式机和笔记本电脑主机启动不了常见原因解决方法...

制作Windows XP万能克隆镜像

怎样将优酷独播1080P视频KUX格式转换成MP4

教你一招，轻松激活Winrar

从1.0到4.0，公交WiFi卷土重来胜算几何？

树莓派和Ubuntu12.04配置无线连接与无线AP热点

Ubuntu&amp;Windows 双系统时间不同步，Windows 慢8个小时

适用于 Windows 的 12 款最佳 PDF 编辑器

linux系统无法开机及其解决办法

SonarQube之——安装完后出现SonarQube is under maintenance. Please check back later.

最新文章

强烈推荐:创业起步 八种赢利模式

创业故事

民间偏方大全

游戏开发相关

10个简单的放松运动

汽车术语（中英文）（II）

网站项目管理规范指南

身体语言密码

失眠

【转载】NetLogic买断多核芯片公司RMI

KeyLife富翁笔记

第五章 知识产权

unix命令

大型高并发高负载网站的系统架构

发一套最完整的直升机原理（绝对完整，绝对精华）

小米手机肿么还原时钟

15000流明是多少瓦

一般普通投影机功率多大?

苹果绿联转换器有些投影机不能用

坚果V9投影机具体参数?

有关九年级作文850字精选

80后90后_高一作文

中级卫生专业资格中医全科学主治医师中级模拟题2021年(9)案与解析

(精品)师范大学招考硕士研究生课程八六0试卷

ZXMVC8900(V3

【模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313】模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313 官方免费下载

【生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD】生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD 官方免费下载

【模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311】模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311 官方免费下载

【模拟人生4（The Sims 4）科幻风宇宙飞船家庭住宅MOD V20190311】模拟人生4（The Sims 4）科幻风宇宙飞船家庭住宅MOD V20190311 官方免费下载

RK3288[android 7.1]调试笔记开机进不了系统报以下log:init: Service 'bootanim' is being killed...

Ubuntu Linux系统设置开机进入 grub 引导界面的方法

Ubuntu系统开机黑屏左上角光标闪烁

Windows XP 部署高版本的VisualStudio运行库

Ubuntu&Windows 双系统时间不同步，Windows 慢8个小时

强烈推荐:创业起步八种赢利模式

第五章知识产权

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改官方免费下载