Jieba库

简介：

jieba是优秀的中文分词第三方库
-中文文本需要通过分词获得单个的词语
-jieba是优秀的中文分词第三方库，需要额外安装

Jieba分词原理：

jieba分词依靠中文词库
-利用一个中文词库，确定汉字之间的关联概率
-汉字间概率大的组成词组，形成分词结果

Jieba分词的三种模式：

-精确模式：把文本精确的切分开，不存在冗余
-全模式：把文本中所有可能的次元都扫描出来，有冗余
-搜索引擎模式：在精确模式基础上，对场次再次切分

函数及方法	描述
jieba.lcut(s)	精确某事，返回一个列表类型的分词结果
jieba.lcut(s,cut_all = True)	全模式，返回一个列表类型的分词结果，存在冗余
jieba.lcut_for_searh(s)	搜索引擎模式，返回一个列表类型的分词结果，存在冗余
jieba.add_word(w)	向分词词典增加新词w

实例1:

#中文文本词频统计--三国演义人物出场统计
import jieba as j
import os
def getText():  #打开文本
    txt = open("D:\Python\素材\THREEKINGDOMS.txt","r",encoding="utf-8").read()
    words = j.lcut(txt) #调用jieba库中的lcut函数进行分词
    return words        #返回一个列表，包含分词后的数据
words = getText()
excludes = {"将军","却说","荆州","二人","不可","不能","如此","军士","主公","如何","商议","军马","左右","引兵","次日","大喜","天下","东吴"\
    ,"于是","今日","不敢","魏兵","陛下","一人","都督","人马","不知"}
counts = dict() #新建一个字典
for word in words:
    if len(word) == 1:
        continue
    elif word == "诸葛亮" or word == "孔明曰":
        rword = "孔明"
    elif word == "关公" or word == "云长":
        rword = "关羽"
    elif word == "玄德" or word == "玄德曰":
        rword = "刘备"
    elif word == "孟德" or word == "丞相":
        rword = "曹操"
    else:
        rword = word
    counts[rword] = counts.get(rword,0) + 1     #对字典中键对应的值进行赋值
for word in excludes:
    del(counts[word])
items = list(counts.items())                        #将字典中的键值对放入序列items
items.sort(key=lambda x: x[1],reverse=True) #根据值进行排序，因为默认从小到大，所以在排序后反转
print("三国演义人物出场统计TOP10:")
for i in range(10):
    word,count = items[i]
    print("{:<10}{:>5}".format(word,count))
os.system('pause')

实例2：

#英文文本词频统计--哈姆雷特
import os
def getText():
    txt = open("D:\Python\素材\HAMLET.txt","r").read()
    txt = txt.lower()
    for ch in '!"#$%&()*+,_-./:;<=>?@[\\]^_{|}~':
        txt = txt.replace(ch," ")
    return txt
hamletTxt = getText()
words = hamletTxt.split()
counts = dict()
for word in words:
    counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x: x[1],reverse=True)
print("哈姆雷特词频统计:")
for i in range(10):
    word,count = items[i]
    print("{:<10}{:>5}".format(word,count))
os.system('pause')

I'm so cute. Please give me money.

本文作者：柠檬
本文链接：http://yoursite.com/2019/05/17/Python-Jieba%E5%BA%93/
版权声明：本博客所有文章除特别声明外，均默认采用许可协议。