找回密码
 快速注册
搜索
查看: 88|回复: 16

对文本文档标题中空格的处理

[复制链接]

413

主题

1558

回帖

1万

积分

积分
11498

显示全部楼层

abababa 发表于 2024-2-7 15:41 |阅读模式
如题,我现在有一些文本文档的小说,就拿《三国演义》这部来说,我的这些文本文档的文件名是001.txt, 002.txt这样的。然后每个文本文档第一行是标题,只是这些标题的格式不统一,我举三个例子:
第 三 回 议温明董卓叱丁原 馈金珠李肃说吕布
第三十八回 定三分隆中决策 战长江孙氏报仇
第四十八回 宴长江曹操赋诗  锁战船北军用武

如上几个标题,主要是空格的形式不统一。现在称“定三分隆中决策 战长江孙氏报仇”这种为“主标题”,称“第三十八回”这种为“回目标题”。我想实现下面的目的:
1.把回目标题中的空格(全角或半角)全删除
2.回目标题和主标题中间空一个全角空格
3.主标题标题中如果有空格的,全转成全角空格,多个空格只保留一个

这个要怎么才能做?

730

主题

1万

回帖

9万

积分

积分
93623
QQ

显示全部楼层

kuing 发表于 2024-2-10 00:34
手头上有什么工具?比如 python ?

413

主题

1558

回帖

1万

积分

积分
11498

显示全部楼层

 楼主| abababa 发表于 2024-2-10 15:41
kuing 发表于 2024-2-10 00:34
手头上有什么工具?比如 python ?

有一个ubuntu的虚拟机,这个里面是不是自带了python?我找不到python的图标,在命令里打python提示找不到,然后又接着提示“python3 命令来自 Debian 软件包 python3”,然后我在命令里打python3到是出来了。

730

主题

1万

回帖

9万

积分

积分
93623
QQ

显示全部楼层

kuing 发表于 2024-2-10 17:07
abababa 发表于 2024-2-10 15:41
有一个ubuntu的虚拟机,这个里面是不是自带了python?我找不到python的图标,在命令里打python提示找不到 ...
  1. import re
  2. import os
  3. def di_hui(matched):
  4.     str = matched.group()
  5.     return str.replace(' ','') + ' '
  6. # 遍历指定目录下所有的txt文件
  7. directory = './xiaoshuo/'
  8. for filename in os.listdir(directory):
  9.     if filename.endswith('.txt'):
  10.         file_path = os.path.join(directory, filename)
  11.         with open(file_path, 'r') as f:
  12.             lines = f.readlines()
  13.         # 处理第一行
  14.         lines[0] = lines[0].replace(' ',' ')
  15.         lines[0] = re.sub('^第.*?回', di_hui, lines[0])
  16.         lines[0] = re.sub(' +', ' ', lines[0])
  17.         # 重写文件,覆盖原先的内容
  18.         with open(file_path, 'w') as f:
  19.             f.write(lines[0])
  20.             for line in lines[1:]:
  21.                 f.write(line)
复制代码

将上述代码保存为 test.py (注意编码得 UTF-8)
在相同目录下新建一个 xiaoshuo 文件夹,将要替换的 txt 全放里面
然后运行 test.py 应该就可以了

点评

临时学着写的代码,不一定行☺️  发表于 2024-2-10 17:11

413

主题

1558

回帖

1万

积分

积分
11498

显示全部楼层

 楼主| abababa 发表于 2024-2-10 19:00
kuing 发表于 2024-2-10 17:07
将上述代码保存为 test.py (注意编码得 UTF-8)
在相同目录下新建一个 xiaoshuo 文件夹,将要替换的 tx ...

谢谢,果然有用。就是有的标题是用空格开头的,这种的没能换过来。我看那个代码,是有个'^第.*?回',然后查了一下,尖角号就是开头的意思,是不是把这个删了就好了?然后我改了一下,又运行了一次,就可以了。

另外网友也给发了一个,说是在当前文件夹里打开命令行,粘贴运行就行,试了一下也能用,网友的代码如下:
  1. for file in ./*.txt; do title=`sed -n '1p' $file | sed -r 's/(.*回\s+)(.*)/\1X\2/' | awk -F 'X' '{gsub(/\s/,"",$1);gsub(/\s+/," ",$2);print $1" "$2}'`; sed -i '1s/.*/'"${title}"'/' $file; done
复制代码

点评

这是啥代码😳看不懂  发表于 2024-2-10 20:58
搜索了一下,好像是sed,awk,bash什么的  发表于 2024-2-11 13:07

730

主题

1万

回帖

9万

积分

积分
93623
QQ

显示全部楼层

kuing 发表于 2024-2-10 21:48
abababa 发表于 2024-2-10 19:00
就是有的标题是用空格开头的,这种的没能换过来。我看那个代码,是有个'^第.*?回',然后查了一下,尖角号就是开头的意思,是不是把这个删了就好了?然后我改了一下,又运行了一次,就可以了。 ...

如果只删 '^第.*?回' 里面的 ^,那虽然能替换,但开头的空格就会保留。

可以改成 '^ *第.*?回'

413

主题

1558

回帖

1万

积分

积分
11498

显示全部楼层

 楼主| abababa 发表于 2024-2-11 13:05
kuing 发表于 2024-2-10 21:48
如果只删  里面的 ^,那虽然能替换,但开头的空格就会保留。

可以改成  ...

哦,我明白了,我看到最前面有空格的那个,就手动删除了空格,然后又运行了一次删除^后的代码,因为这时本来就没有最开头的空格了,所以还是正确的。

3149

主题

8386

回帖

6万

积分

$\style{scale:11;fill:#eff}꩜$

积分
65391
QQ

显示全部楼层

hbghlyj 发表于 2024-2-12 00:53
abababa 发表于 2024-2-10 11:00
说是在当前文件夹里打开命令行,粘贴运行就行,试了一下也能用,网友的代码如下:


是啊,感覺Linux Bash比Windows命令行(DOS)好用多了
例如用GS 合併 所有形如sheet2?B* 的文件,試了很久,還是不懂DOS怎麼傳参數,
for /f "tokens=* delims=" %A in ('dir /b sheet2?B*') do @echo "%A"
之类的,最後也不明白。最後是用Bash很快解決了:
  1. gswin64c.exe -sDEVICE=pdfwrite -dBATCH -o output.pdf sheet2?B*
复制代码

730

主题

1万

回帖

9万

积分

积分
93623
QQ

显示全部楼层

kuing 发表于 2024-2-12 02:06
hbghlyj 发表于 2024-2-12 00:53
是啊,感覺Linux Bash比Windows命令行(DOS)好用多了
例如用GS 合併 所有形如sheet2?B* 的文件,試了很久 ...

难怪我在 win 的命令行运行不了 5# 的,原来要 linux 啥的啊……

3149

主题

8386

回帖

6万

积分

$\style{scale:11;fill:#eff}꩜$

积分
65391
QQ

显示全部楼层

hbghlyj 发表于 2024-2-12 02:09
kuing 发表于 2024-2-11 18:06
难怪我在 win 的命令行运行不了 5# 的,原来要 linux 啥的啊……


其實3#提到了
abababa 发表于 2024-2-10 07:41
有一个ubuntu的虚拟机

stackoverflow.com/questions/771756/what-is-the-difference-between-cygwin-and-mingw
As a simplification, it's like this:
  • Compile something in Cygwin and you are compiling it for Cygwin.
  • Compile something in MinGW and you are compiling it for Windows.
What is Cygwin?Cygwin is a compatibility layer that makes it easy to port simple Unix-based applications to Windows, by emulating many of the basic interfaces that Unix-based operating systems provide, such as pipes, Unix-style file and directory access, and so on as documented by the POSIX standards.  Cygwin is also bundled with a port of the GNU Compiler Collection and some other tools to the Cygwin environment.
If you have existing source code that uses POSIX interfaces, you may be able to compile it for use with Cygwin after making very few or even no changes, greatly simplifying the process of porting simple IO based Unix code for use on Windows.
Disadvantages of CygwinCompiling with Cygwin involves linking your program with the Cygwin run-time environment, which will typically be distributed with your program as the dynamically linked library cygwin1.dll.  This library is open source and requires software using it to share a compatible open source license, even if you distribute the dll separately, because the header files and interface are included.  This therefore imposes some restrictions on how you can license your code.
What is MinGW?MinGW is a distribution of the GNU compiler tools for native Windows, including the GNU Compiler Collection, GNU Binutils and GNU Debugger.  Also included are header files and libraries allowing development of native Windows applications.  This therefore will act as an open source alternative to the Microsoft Visual C++ suite.
It may be possible to use MinGW to compile something that was originally intended for compiling with Microsoft Visual C++ with relatively minor modifications.
By default, code compiled in MinGW's GCC will compile to a native Windows target, including .exe and .dll files, though you could also cross-compile with the right settings, since you are basically using the GNU compiler tools suite.
Even though MingW includes some header files and interface code allowing your code to interact with the Windows API, as with the regular standard libraries this doesn't impose licensing restrictions on software you have created.
Disadvantages of MinGWSoftware compiled for Windows using MinGW has to use Windows' own API for file and IO access.  If you are porting a Unix/Linux application to Windows this may mean significant alteration to the code because the POSIX type API can no longer be used.
Other considerationsFor any non-trivial software application, such as one that uses a graphical interface, multimedia or accesses devices on the system, you leave the boundary of what Cygwin can do for you and further work will be needed to make your code cross-platform.  But, this task can be simplified by using cross-platform toolkits or frameworks that allow coding once and having your code compile successfully for any platform.  If you use such a framework from the start, you can not only reduce your headaches when it comes time to port to another platform but you can use the same graphical widgets - windows, menus and controls - across all platforms if you're writing a GUI app, and have them appear native to the user.
For instance, the open source Qt framework is a popular and comprehensive cross-platform development framework, allowing the building of graphical applications that work across operating systems including windows.  There are other such frameworks too.  In addition to the large frameworks there are thousands of more specialized software libraries in existence which support multiple platforms allowing you to worry less about writing different code for different platforms.
When you are developing cross-platform software from the start, you would not normally have any reason to use Cygwin.  When compiled on Windows, you would usually aim to make your code able to be compiled with either MingW or Microsoft Visual C/C++, or both.  When compiling on Linux/*nix, you'd most often compile it with the GNU compilers and tools directly.

413

主题

1558

回帖

1万

积分

积分
11498

显示全部楼层

 楼主| abababa 发表于 2024-2-12 15:26
kuing 发表于 2024-2-12 02:06
难怪我在 win 的命令行运行不了 5# 的,原来要 linux 啥的啊……

是的,我没安装python那些,因为也不会,平时就是用latex这个。那个虚拟机还是很久以前网友推荐我装的,然后当时说自带了很多工具,我一试果然有这个python(我这里的是python3)

3149

主题

8386

回帖

6万

积分

$\style{scale:11;fill:#eff}꩜$

积分
65391
QQ

显示全部楼层

hbghlyj 发表于 2024-2-12 19:31

3149

主题

8386

回帖

6万

积分

$\style{scale:11;fill:#eff}꩜$

积分
65391
QQ

显示全部楼层

hbghlyj 发表于 2024-2-12 19:34
abababa 发表于 2024-2-10 07:41
在命令里打python提示找不到


如何让 "python" 命令执行 "python3"?
A simple safe way would be to use an alias. Place this into ~/.bashrc or ~/.bash_aliases file:
  1. alias python=python3
复制代码

手机版|悠闲数学娱乐论坛(第3版)

GMT+8, 2025-3-4 16:02

Powered by Discuz!

× 快速回复 返回顶部 返回列表