注册 登录
编程论坛 Python论坛

如何拆分一个大文件?

jasson_luck 发布于 2011-10-01 20:58, 1082 次点击
各位高手:
有个问题请教一下,我有一个txt文件,文件行数有600万行,里面内容是单词开头(有可能有多个单词),后面跟一堆数字,我想把单词及相应数字单独另存为一个文件,这在python里如何实现?谢谢!
数据见附件。
3 回复
#2
jasson_luck2011-10-01 20:58
只有本站会员才能查看附件,请 登录
#3
jasson_luck2011-10-10 17:21
有高手吗?
#4
nm_00112012-07-09 14:38

高手不敢当,刚学习,随手写了一下,不知道要匹配的具体内容,自己改正则表达式

import os
import sys
import string
import re

class  jassonjack():
    """
    split a big file into two files which contains words and numbers respectively
    """
    pattern_ = (r'^[a-zA-Z]\D+\d+$')
   
    def __init__(self,srcfile):
        self.srcfile = srcfile
        parentname, filename = os.path.split(srcfile)
        dstfilename = filename.replace(filename,'.txt','_word_number.txt')
        self.dstfile = os.path.join(parentname,dstfilename)
        self.wordlist = []
      
    def split(self):
        ret = False
        
        try:
            srcfd = open(self.srcfile,'r')
            dstfd = open(self.dstfile,'w+')
   
            contents = srcfd.readlines()
            for content in contents:
                while True:
                    m = jassonjack.pattern_.match(content)
                    if not m:
                        self.wordlist.append(m.group(0))
                        contentlist = content.split(m.group(0))
                        content = ''
                        for i in range(0,len(contentlist)-1):
                            content = content + contentlist[i]
                    else:
                        break;
                for word in self.wordlist:
                    dstdf.write(word)
                    dstdf.write(" ")
                dstdf.write("\n")
            ret = True
        except Exception,e:
            print("the file is not existed")
        
        return ret
   

if __name__ == "__main__":
    srcfile = raw_input("input file:")
    instance = jassonjack(srcfile)
    print instance.split()
              
1