C初学者求助：如何删除文本（txt）中重复的单词 - 编程论坛

#2

rjsp2022-03-13 17:53

题目含糊，那你举个例子嘛，比如原文是
Xiao Ming's father walks into the cafe. Someone shouted, "Xiao Ming, your family is here."
你希望结果是什么？

“Ming's” 算不算包含 Ming这个单词，“Ming,”算不算包含 Ming这个单词？
去除重复词后，原词的前后空格及其它符号是否保留？
结果中的单词是随意排列的，还是按照先后出现的顺序排列，还是按照字典顺序排列？
文本中单词大约有多少个？如果就几十几百个，那遍历最简单；如果上亿，甚至亿亿个，那不用平衡树之类的快速查找那就算根本不会解题；
单词最大长度是多少？有最长限制是一种写法，无限制那就是另一种完全不一样的写法。
…………

#3

加菲不成猫2022-03-13 21:11

回复 2楼 rjsp

拿这个例子来说Xiao Ming's father walks into the cafe. Someone shouted, "Xiao Ming, your father is here."（假设之前出现过Ming）关于Ming's和后面的Ming，我想是包含Ming这个单词的，删除后的例子（想达到的效果）：Xiao father walks into the cafe. Someone shouted, " your is here."（将后面的family改为了father）
去除重复词后对于原词周围的空格以及符号不做处理，然后安装单词先后出现顺序排列。
文本中的单词大约七百-八百个，单词最大长度不超过20。
原本想的就没那么深，只是想将文本中的单词以第一个出现的为准，将后面重复的单词删除后输出......

#4

rjsp2022-03-14 12:58

程序代码：

/*
如果所有文件都很小，那就全部读到内存中
如果最大需要处理的文件都不算大，那就使用文件读写
如果大部分文件都很大，那就关闭文件读写缓存并分块读到内存，或使用内存映射，但这种情况要考虑块和块的边界

如果单词数目少，直接遍历
如果单词数目中等，使用红黑树、hash表等
如果单词数目很大，那就使用Trie字典树等

最麻烦的就是“单词”的定义，难在一般人不是语法学家，不知道英语中“全部的”单词分割规则。
比如 ming's 如果之前出现过 ming，则 ming's 要全部删除。那么 ming-chao 要不要全部去掉，还是留下 -chao？
*/

#define _CRT_SECURE_NO_WARNINGS
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>

struct word
{
    const char* data;
    size_t length;
};
bool found_inbuf( struct word words[], size_t* words_size, const char* search_data, size_t search_length )
{
    for( size_t i=0; i!=*words_size; ++i )
        if( words[i].length==search_length && memcmp(words[i].data,search_data,search_length)==0 )
            return true;

    words[*words_size].data = search_data;
    words[*words_size].length = search_length;
    ++*words_size;
    return false;
}

void foo( const char* restrict s, char* restrict d )
{
    struct word words[1000];
    size_t words_size = 0;

    size_t idx_s=0, idx_d=0;
    for( ; ; )
    {
        // 读取非单词部分
        size_t idx_t = 0;
        int n = sscanf( s+idx_s, "%*[^A-Za-z]%zn", &idx_t );
        if( n == EOF )
            break;
        // 将非单词部分原样照搬到目的字符串中
        if( idx_t != 0 )
        {
            memcpy( d+idx_d, s+idx_s, idx_t );
            idx_d += idx_t;
            idx_s += idx_t;
        }

        // 读取单词部分
        idx_t = 0;
        n = sscanf( s+idx_s, "%*[A-Za-z]%zn", &idx_t );
        if( n == EOF )
            break;
        // 如果单词不重复，则将单词照搬到目的字符串中
        bool bfound = found_inbuf( words, &words_size, s+idx_s, idx_t );
        if( !bfound )
        {
            memcpy( d+idx_d, s+idx_s, idx_t );
            idx_d += idx_t;
        }
        idx_s += idx_t;

        // 这个单词有尾巴（比如 Ming's 中的 's）吗？若有，同上处理
        for( ; s[idx_s]>' '; ++idx_s )
        {
            if( !bfound )
                d[idx_d++] = s[idx_s];
        }
    }

    d[idx_d] = '\0';
    return;
}

int main( void )
{
    const char* s = "Xiao Ming's story: Xiao Ming's father walks into the cafe. Someone shouted, \"Xiao Ming, your father is here.\"";
    char* d = malloc( strlen(s)+1 );
    foo( s, d );
    puts( s );
    puts( d );
}

输出

Xiao Ming's story: Xiao Ming's father walks into the cafe. Someone shouted, "Xiao Ming, your father is here."
Xiao Ming's story: father walks into the cafe. Someone shouted, " your is here."

#5

加菲不成猫2022-03-14 17:27

回复 4楼 rjsp

感谢，我尝试理解下