归纳

在文本处理的nlp领域，经常需要将大量文本格式进行不断的转换进而达到模型输入的需求，每次转换我总在尝试，也觉得很费时间，但其实每次用到的函数大同小异，不希望经常做重复的工作，遂总结一些常见类型转换，方便以后随时调用。
常用的函数:

split: 
	str.split(str="", num=string.count(str))
	split() 通过指定分隔符对字符串进行切片，如果参数 num 有指定值，则分隔 num+1 个子字符串
replace:
	replace(rgExp, replaceText, max)：可以替换任意指定的字符
join:
	"str".join(),连接字符串数组。将字符串、元组、列表中的元素以指定的字符(分隔符)连接生成一个新的字符串
strip(str)：
	可以去除头尾指定字符，参数为空时，默认去除字符串中头尾的空格字符（常用来去掉读取txt后的换行符）

1.形式1：脱去一层list

)

all_words2 = []
for sentence in all_words:
    all_words2.append("".join(sentence))
print(all_words2)

2.形式2：将每个list里面的字符串合并成一个字符串（以适用onehot、tfidf向量的输入）

#将list of list转换为list 以适合CountVectorizer函数的格式
all_data_str = []
for i in range(len(all_data)):
    sentence= ''
    for j in range(len(all_data[i])): 
        word = all_data[i][j]
        if j>0:
            sentence = sentence+' '+word
        else:
            sentence = sentence + word
    all_data_str.append(sentence)
print(all_data_str[:2])

3.复杂的形式：保存pd.DataFrame后，再读取有时候会出现

第一步转换：

B = A[0:5]
all_words = []
sentence = ""
words=[]
for i in range(len(B)):
    sentence = B[i].strip("[]").replace("\'","").replace(",","").split("\n")
    cur_words = []
    for word in sentence:       
        cur_words.append(word)
    all_words.append(cur_words)
print(all_words)

第二步转换：

all_words2 = []
for sentence in all_words:
    all_words2.append("".join(sentence))
print(all_words2)