平行语料库预处理

李明华

2022-02-17 (Updated: 2022-04-26)

import nltk
import nltk.data
 
def splitSentence(paragraph):
    tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    sentences = tokenizer.tokenize(paragraph)
    return sentences
 
if __name__ == '__main__':
    print splitSentence("My name is Tom, I am a boy. I like soccer!")

https://blog.csdn.net/weixin_43228814/article/details/88898300

https://zhuanlan.zhihu.com/p/98808960

/usr/local/Cellar/python@3.9/3.9.10/bin/

DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621

（1）阿里云 http://mirrors.aliyun.com/pypi/simple/
（2）豆瓣http://pypi.douban.com/simple/
（3）清华大学 https://pypi.tuna.tsinghua.edu.cn/simple/
（4）中国科学技术大学 http://pypi.mirrors.ustc.edu.cn/simple/
（5）华中科技大学http://pypi.hustunique.com/

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple zhon

pip config set global.index-url http://mirrors.aliyun.com/pypi/simple/

读写大文件
https://zhuanlan.zhihu.com/p/138015908
https://zhuanlan.zhihu.com/p/41095700

https://blog.csdn.net/routing666/article/details/113126201