平行语料库预处理

1
2
3
4
5
6
7
8
9
10
11
import nltk
import nltk.data

def splitSentence(paragraph):
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(paragraph)
return sentences

if __name__ == '__main__':
print splitSentence("My name is Tom, I am a boy. I like soccer!")

https://blog.csdn.net/weixin_43228814/article/details/88898300

https://zhuanlan.zhihu.com/p/98808960

/usr/local/Cellar/python@3.9/3.9.10/bin/

DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621

(1)阿里云 http://mirrors.aliyun.com/pypi/simple/
(2)豆瓣http://pypi.douban.com/simple/
(3)清华大学 https://pypi.tuna.tsinghua.edu.cn/simple/
(4)中国科学技术大学 http://pypi.mirrors.ustc.edu.cn/simple/
(5)华中科技大学http://pypi.hustunique.com/

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple zhon

pip config set global.index-url http://mirrors.aliyun.com/pypi/simple/

读写大文件
https://zhuanlan.zhihu.com/p/138015908
https://zhuanlan.zhihu.com/p/41095700

https://blog.csdn.net/routing666/article/details/113126201