好得很程序员自学网

<tfoot draggable='sEl'></tfoot>

浅谈PythonNLP入门

本文主要介绍了Python NLP入门教程,Python自然语言处理(NLP),使用Python的NLTK库。NLTK是Python的自然语言处理工具包,在NLP领域中,最常使用的一个Python库。小编觉得挺不错的,现在分享给大家,也给大家做个参考。一起跟随小编过来看看吧,希望能帮助到大家。

import nltk
nltk.download() 
import urllib.request
response = urllib.request.urlopen('http://php测试数据/')
html = response.read()
print (html) 
from bs4 import BeautifulSoup
import urllib.request
response = urllib.request.urlopen('http://php测试数据/')
html = response.read()
soup = BeautifulSoup(html,"html5lib")
# 这需要安装html5lib模块
text = soup.get_text(strip=True)
print (text) 
from bs4 import BeautifulSoup
import urllib.request
response = urllib.request.urlopen('http://php测试数据/')
html = response.read()
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
tokens = text.split()
print (tokens) 
from bs4 import BeautifulSoup
import urllib.request
import nltk

response = urllib.request.urlopen('http://php测试数据/')
html = response.read()
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
tokens = text.split()
freq = nltk.FreqDist(tokens)
for key,val in freq.items():
  print (str(key) + ':' + str(val)) 
freq.plot(20, cumulative=False)
# 需要安装matplotlib库 
from nltk.corpus import stopwords
stopwords.words('english') 
clean_tokens = list()
sr = stopwords.words('english')
for token in tokens:
  if token not in sr:
    clean_tokens.append(token) 
from bs4 import BeautifulSoup
import urllib.request
import nltk
from nltk.corpus import stopwords

response = urllib.request.urlopen('http://php测试数据/')
html = response.read()
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
tokens = text.split()
clean_tokens = list()
sr = stopwords.words('english')
for token in tokens:
  if not token in sr:
    clean_tokens.append(token)
freq = nltk.FreqDist(clean_tokens)
for key,val in freq.items():
  print (str(key) + ':' + str(val)) 
from nltk.tokenize import sent_tokenize

mytext = "Hello Adam, how are you? I hope everything is going well. Today is a good day, see you dude."
print(sent_tokenize(mytext)) 
from nltk.tokenize import sent_tokenize
mytext = "Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude."
print(sent_tokenize(mytext)) 
from nltk.tokenize import word_tokenize

mytext = "Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude."
print(word_tokenize(mytext)) 
from nltk.tokenize import sent_tokenize

mytext = "Bonjour M. Adam, comment allez-vous? J'espère que tout va bien. Aujourd'hui est un bon jour."
print(sent_tokenize(mytext,"french")) 
from nltk.corpus import wordnet

syn = wordnet.synsets("pain")
print(syn[0].definition())
print(syn[0].examples()) 
from nltk.corpus import wordnet

syn = wordnet.synsets("NLP")
print(syn[0].definition())
syn = wordnet.synsets("Python")
print(syn[0].definition()) 
from nltk.corpus import wordnet
synonyms = []
for syn in wordnet.synsets('Computer'):
  for lemma in syn.lemmas():
    synonyms.append(lemma.name())
print(synonyms) 
from nltk.corpus import wordnet

antonyms = []
for syn in wordnet.synsets("small"):
  for l in syn.lemmas():
    if l.antonyms():
      antonyms.append(l.antonyms()[0].name())
print(antonyms) 
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem('working'))
print(stemmer.stem('worked')) 
from nltk.stem import SnowballStemmer

print(SnowballStemmer.languages)

'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish' 
from nltk.stem import SnowballStemmer
french_stemmer = SnowballStemmer('french')
print(french_stemmer.stem("French word")) 
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem('increases')) 
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('increases')) 
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('playing', pos="v")) 
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('playing', pos="v"))
print(lemmatizer.lemmatize('playing', pos="n"))
print(lemmatizer.lemmatize('playing', pos="a"))
print(lemmatizer.lemmatize('playing', pos="r")) 
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
print(stemmer.stem('stones'))
print(stemmer.stem('speaking'))
print(stemmer.stem('bedroom'))
print(stemmer.stem('jokes'))
print(stemmer.stem('lisa'))
print(stemmer.stem('purple'))
print('----------------------')
print(lemmatizer.lemmatize('stones'))
print(lemmatizer.lemmatize('speaking'))
print(lemmatizer.lemmatize('bedroom'))
print(lemmatizer.lemmatize('jokes'))
print(lemmatizer.lemmatize('lisa'))
print(lemmatizer.lemmatize('purple')) 

输出:
stone
speak
bedroom
joke
lisa
purpl
---------------------
stone
speaking
bedroom
joke
lisa
purple

词干提取不会考虑语境,这也是为什么词干提取比变体还原快且准确度低的原因。

个人认为,变体还原比词干提取更好。单词变体还原返回一个真实的单词,即使它不是同一个单词,也是同义词,但至少它是一个真实存在的单词。

如果你只关心速度,不在意准确度,这时你可以选用词干提取。

在此NLP教程中讨论的所有步骤都只是文本预处理。在以后的文章中,将会使用Python NLTK来实现文本分析。

相关推荐:

分享python snownlp的实例教程

Python之正弦曲线实现方法分析

Python调式知识详解

以上就是浅谈Python NLP入门的详细内容,更多请关注Gxl网其它相关文章!

查看更多关于浅谈PythonNLP入门的详细内容...

  阅读:47次