topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Saturday December 14, 2024, 6:56 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Last post Author Topic: DONE: Delete double lines (all but the first) in a text file  (Read 39116 times)

korad

  • Participant
  • Joined in 2006
  • *
  • default avatar
  • Posts: 1
    • View Profile
    • Donate to Member
Hello,

sorry for my poor english.

I am looking for an application which is able to delete double entries in a large text file. I did only find a macro for UltraEdit, but if the file is greater than 1 mb it hangs. I am sure that there is already such an app available, but I couldn´t find it with google. I could only find other people looking for such a piece of software :) Sometimes I code some little things in vbs, but I am a absolute beginner. I know I have to create 2 further files:

File 1: already available master file
File 2: Temporäry File
File 3: Results File

cut first (not empty) line from file 1 and paste it to file 2
delete all lines in file 1 that are equal to this line
cut line 1 in file 2 and paste it to file 3
etc. etc.

I would appreciate some help.

Many thanks :)

chrisi
« Last Edit: March 17, 2006, 01:48 PM by brotherS »

jgpaiva

  • Global Moderator
  • Joined in 2006
  • *****
  • Posts: 4,727
    • View Profile
    • Donate to Member
Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #1 on: March 06, 2006, 01:10 PM »
I made a script similar to your request for a recent request.
I'll modify it and post it here so as you can test.
But i have to warn you: it'll be slow, and it'll be limited to a max of 64mb of text.
Anyways, i'll give it a go.
(it's a script in ahk, if it was made in C, i'm sure it'd be about a million times faster, but I don't remember C too well, and I can't compile it for windows)

skrommel

  • Fastest code in the west
  • Developer
  • Joined in 2005
  • ***
  • Posts: 933
    • View Profile
    • 1 Hour Software by skrommel
    • Donate to Member
Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #2 on: March 06, 2006, 05:03 PM »
 :-[ I really should start reading the whole posts!

Here's one, but it sorts the file, and it's limited to 1 GB.

Skrommel


;DelDuplicates.ahk
; Removes duplicate lines from a text file
; To run, download and install AutoHotkey from www.autohotkey.com
;Skrommel @2006

infile=C:\Temp\in.txt
outfile=C:\Temp\out.txt

#MaxMem 1024
SetBatchLines,-1
FileRead,file,%infile%
If ErrorLevel=0
{
  Sort,file,U
  FileDelete,%outfile%
  FileAppend,%file%,%outfile%
  file=
}




Try this one!

Skrommel


;DelDouble.ahk
; Removes double lines from text files
; To run, download and install AutoHotkey from www.autohotkey.com
;Skrommel @2006

fromfile=C:\Temp\in.txt
tofile=C:\Temp\out.txt

SetBatchLines,-1
FileDelete,%tofile%
prevline=
Loop,Read,%fromfile%
{
  If A_LoopReadLine<>%prevline%
    FileAppend,%A_LoopReadLine%`n,%tofile%
  prevline=%A_LoopReadLine%
}
« Last Edit: March 06, 2006, 06:21 PM by skrommel »

PhilKC

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 117
    • View Profile
    • BlueScreenOfDeath.co.uk
    • Donate to Member
Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #3 on: March 06, 2006, 05:23 PM »
Pseudo code:

StreamReader in = new StreamReader(inputFile);
String[] lines = in.ReadToEnd().Split("\r\n".ToCharArray());
in.Close();
ArrayList checker = new ArrayList();
for (int i=0;i<lines.Length;i++)
     if (!checker.Contains(lines[i]))
          checker.Add(lines[i]);
StreamWriter out = new StreamWriter(outputFile);
for (int i=0;i<checker.Count;i++)
     out.WriteLine(checker(i));
out.Close();

That was from memory, so, I have no idea if it would compile... (It's C# :P)

PhilKC
It's not a bug, it's an undocumented and unexplainable feature.
Stick it on your site:

jgpaiva

  • Global Moderator
  • Joined in 2006
  • *****
  • Posts: 4,727
    • View Profile
    • Donate to Member
Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #4 on: March 06, 2006, 05:57 PM »
Here's the modified version i mentioned.
It's algorithm is quite good, but ahk is a script language, so, it takes more time than C, for sure.
It took 1 minute 45 seconds to find repeated entries on a 9000 lines file, on my laptop centrino 2.0.
Still, it does solve your problem.
Doesn't alter the initial file, but the file created doesn't have the repeated entries.
It has a small bug: the progress bar doesn't correspond to the truth. In the end of the file, it's way faster than in the beggining. Just leaving the heads-up, in case you start thinking about giving up at the beggining.
It is supposed to be able to hadle 64mb of plain text, by the ahk references.

Hope it solves your problem.
(btw: the .ahk file needs autohotkey to run, and the exe file only accepts a file called "textfile.txt" as input, and only outputs to a file called "out.txt". Both are in the attached compressed file)

.exe version
.ahk version
« Last Edit: May 02, 2006, 04:48 PM by jgpaiva »

TWmailrec

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 130
    • View Profile
    • Donate to Member
Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #5 on: March 13, 2006, 07:10 PM »
Re: IDEA: Delete double Lines (all but the first) in a Text-File

The solution from jgpaiva (RepeatedEntries.ahk) solves a problem I had, but can it be modified to ignore blank lines ( CR only to aid intelligability)??

jgpaiva

  • Global Moderator
  • Joined in 2006
  • *****
  • Posts: 4,727
    • View Profile
    • Donate to Member
Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #6 on: March 14, 2006, 06:32 PM »
Here is a new version, that checks for blank lines.
Note: a line that only has SPACEs or TABS, is considered a blank line. I hope this was what you were asking for.

.exe version
.ahk version
« Last Edit: May 02, 2006, 04:48 PM by jgpaiva »

TWmailrec

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 130
    • View Profile
    • Donate to Member
Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #7 on: March 15, 2006, 10:53 PM »
Many thanks to jgpaiva for the new program mod.
The repeated strings msgbox now works well, but the output file
did not copy blank lines.
Is there any way to replicate the blank lines in the output file?
Im new to Autohotkey program language & cant cope with loops.

TWmailrec

jgpaiva

  • Global Moderator
  • Joined in 2006
  • *****
  • Posts: 4,727
    • View Profile
    • Donate to Member
Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #8 on: March 16, 2006, 06:39 AM »
EhEh TW..
You hade some work adapting the script.
There are a few "return"s missing, though.
I didn't get what you meant, you mean the problem was onkly in the messagebox?
You only wanted the msgbox fixed, but still having the blank lines in the file?

Gerome

  • Charter Honorary Member
  • Joined in 2006
  • ***
  • Posts: 154
    • View Profile
    • Get my Freestyle Basic Script Language + compiler!
    • Donate to Member
Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #9 on: March 16, 2006, 01:42 PM »
Yo !
Here's the modified version i mentioned.
It's algorithm is quite good, but ahk is a script language, so, it takes more time than C, for sure.
It took 1 minute 45 seconds to find repeated entries on a 9000 lines file, on my laptop centrino 2.0.
Still, it does solve your problem.
Doesn't alter the initial file, but the file created doesn't have the repeated entries.
It has a small bug: the progress bar doesn't correspond to the truth. In the end of the file, it's way faster than in the beggining. Just leaving the heads-up, in case you start thinking about giving up at the beggining.
It is supposed to be able to hadle 64mb of plain text, by the ahk references.

Hope it solves your problem.
(btw: the .ahk file needs autohotkey to run, and the exe file only accepts a file called "textfile.txt" as input, and only outputs to a file called "out.txt". Both are in the attached compressed file)

I've taken your script sources copied 2520 times onto themselves : it gave a 3,2 MB text file...
Tested your script under Win2k Sp4 256 Mb Ram without any other programm running and after 1 hour it has only found 50% of the duplicates...
There were only 168 840 lines... and took 35 Mb of RAM trying to aggregate...
Make your own conclusions man...
Yours,
(¯`·._.·[Gerome GUILLEMIN]·._.·´¯)
http://www.fbsl.net [FBSL Author]
http://gedd123.free.fr/FBSLv3.zip [FBSL Help file]
(¯`·._.·[If you need help... just ask]·._.·´¯)

jgpaiva

  • Global Moderator
  • Joined in 2006
  • *****
  • Posts: 4,727
    • View Profile
    • Donate to Member
Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #10 on: March 16, 2006, 01:57 PM »
It's algorithm is quite good, but ahk is a script language, so, it takes more time than C, for sure.
Can you read?
I do know that can't handle a big input file.
And i also know that the same script in C would take about 10 seconds to solve that problem.
Implemented with an hash table in C, probably would take even less.
And i also know how to implement it in C, i even have the code, because i did it for school.
I could even use the solution that PhilKC presented.
If I wanted to do that search in an efficient way, i'd code it in C, and use it under linux.
My problem is that i've never compiled any C program in windows, and the original post on this thread required something that i thought ahk could solve.

Many thanks to jgpaiva for the new program mod.
And it did.
The question here, is that noone else presented a better solution. I presented mine.
I did the best thing I can do in windows. It sure does run faster than other executable presented at this thread.

Gerome

  • Charter Honorary Member
  • Joined in 2006
  • ***
  • Posts: 154
    • View Profile
    • Get my Freestyle Basic Script Language + compiler!
    • Donate to Member
Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #11 on: March 16, 2006, 02:04 PM »
Hey!

If you did it under linux, you can then compile it the same way under windows.
Simply compile with GCC and it'll work same way...
Yours,
(¯`·._.·[Gerome GUILLEMIN]·._.·´¯)
http://www.fbsl.net [FBSL Author]
http://gedd123.free.fr/FBSLv3.zip [FBSL Help file]
(¯`·._.·[If you need help... just ask]·._.·´¯)

TWmailrec

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 130
    • View Profile
    • Donate to Member
Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #12 on: March 16, 2006, 02:31 PM »
To jgpaiva from TW

No, what I meant was the message box now works fine,
but the blank lines are still stripped out of the output file.
Was this supposed to happen?
I was hoping to preserve the blank lines in the output file, (repeated or not). I cant see why they are stripped out but they are:

TW

jgpaiva

  • Global Moderator
  • Joined in 2006
  • *****
  • Posts: 4,727
    • View Profile
    • Donate to Member
Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #13 on: March 16, 2006, 02:33 PM »
To jgpaiva from TW

No, what I meant was the message box now works fine,
but the blank lines are still stripped out of the output file.
Was this supposed to happen?
I was hoping to preserve the blank lines in the output file, (repeated or not). I cant see why they are stripped out but they are:
They are because I thought that was what you wanted ;)
I'll make it copy blank lines. :D

@Gerome I can't install gcc under windows, only through cygwin, and that's not worth the effort..

Gerome

  • Charter Honorary Member
  • Joined in 2006
  • ***
  • Posts: 154
    • View Profile
    • Get my Freestyle Basic Script Language + compiler!
    • Donate to Member
Re: IDEA: Delete double lines (all but the first) in a text file
« Reply #14 on: March 16, 2006, 02:41 PM »
Hi,

@Gerome I can't install gcc under windows, only through cygwin, and that's not worth the effort..

????????????
Install MinGW or alike : DevCPP does this for you excellently...
Yours,
(¯`·._.·[Gerome GUILLEMIN]·._.·´¯)
http://www.fbsl.net [FBSL Author]
http://gedd123.free.fr/FBSLv3.zip [FBSL Help file]
(¯`·._.·[If you need help... just ask]·._.·´¯)

jgpaiva

  • Global Moderator
  • Joined in 2006
  • *****
  • Posts: 4,727
    • View Profile
    • Donate to Member
Re: IDEA: Delete double lines (all but the first) in a text file
« Reply #15 on: March 16, 2006, 02:51 PM »
Install MinGW or alike : DevCPP does this for you excellently...
That's good, i'll give it a go next time I need to compile C. But by now, and for the next 4 months, I only see Lisp, Pov-Ray, VRML and Java ;)
Maybe next semester. But thanks by the pointer, it'll surelly be useful!!

jgpaiva

  • Global Moderator
  • Joined in 2006
  • *****
  • Posts: 4,727
    • View Profile
    • Donate to Member
Re: DONE: Delete double lines (all but the first) in a text file
« Reply #16 on: March 20, 2006, 02:27 PM »
Ok, now i got around to updating this script.
The blank lines bug is fixed, and the msgbox also is right. I think it's now as you requested, TW ;)

.exe version
.ahk version
« Last Edit: May 02, 2006, 04:49 PM by jgpaiva »

TWmailrec

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 130
    • View Profile
    • Donate to Member
Re: DONE: Delete double lines (all but the first) in a text file
« Reply #17 on: March 21, 2006, 06:44 PM »
To jgpaiva Charter Member, RepeatedEntries.ahk

   Re: DONE: Delete double lines (all but the first) in a text file
 
Program is now perfect!
(I added back in the %1% check for drag
and drop and command line parameter.)

Many thanks
TW

lanux128

  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 6,277
    • View Profile
    • Donate to Member
Re: DONE: Delete double lines (all but the first) in a text file
« Reply #18 on: April 05, 2006, 12:37 AM »
jgpaiva & TWmailrec:
i liked your script very much that i had modified it a bit for my own usage & added a gui...

but i not too clear on why both of your scripts differ... e.g. jgpaiva's script adds a linefeed to the end of every line while TWmailrec's deletes all empty lines. is this on purpose?

in any case, here's the code & screenshot of the gui.

the modified code
; Date: Apr. 03, 2006
#Persistent
#SingleInstance force
SetBatchLines,-1
Title=Delete Duplicate Lines

GoSub, ShowMain
Return

ShowMain:
Gui, -SysMenu +MinimizeBox
Gui, Add, GroupBox, x6 y6 w360 h172, %Title%
Gui, Font, s8 CDefault, Tahoma
Gui, Add, Text, x16 y25 w180 h20, Original File:
Gui, Add, Button, x326 y45 w30 h20 gSelectFile, ...
If File =
  Gui, Add, Edit, x16 y45 w300 h20 readonly vFile,
Else
  GuiControl,, File, %file%
;---0---
Gui, Add, Text, x16 y80 w180 h20, Output File:  ;Must not be the same...
If FileOut =
  Gui, Add, Edit, x16 y100 w300 h20 readonly vFileOut,
Else
  GuiControl,, FileOut, %FileOut%
Gui, Add, Button, x100 y140 w75 h25 gProcess, Process
Gui, Add, Button, x200 y140 w75 h25, Quit
Gui, Show, x270 y110 h185 w375, %Title%
Return

SelectFile:
FileSelectFile, File, 1, %A_MyDocuments%, Select text-file for processing, Text Files (*.csv; *.txt)
If File =   ;user presses Cancel...
  Return
GuiControl,, File, %file%
SplitPath, File,CurFile,CurFolder,CurExt,CurFileNoExt,
FileOut=%CurFolder%\%CurFileNoExt%_after.%CurExt%
GuiControl,, FileOut, %FileOut%
Return

Process:
If File =
  Return
FileToRead=%File%
filetowrite=%FileOut%
;To add check-box option, to overwrite existing output file?
;IfExist,%filetowrite%
;  {
;   FileDelete,%filetowrite%
;  }
FileRead,CompleteFile,%FileToRead%
StringSplit,index,CompleteFile,`r`n,`r`n
found=
count:=index0
count2:=count
GoSub,CreateGui2
ProgressFlag:=false
loop,%count%
  {
  GuiControl,2:,bar,%A_Index%
  If ProgressFlag
    break
  position:=A_Index
  Word:=index%position%
  If word is space
  {
    FileAppend,%word%`n,%filetowrite%
    continue
  }
  IfInString,found,%Word%
    continue
  count2-=1
  loop, %count2%
    {
    position2:=position+a_index
    Word2:=index%position2%
    if Word=%word2%
      found=%found% %Word% ,
    }
  Fileappend,%Word%`n,%filetowrite%
  }
if found=
  {
  Msgbox,, %Title%, No duplicate lines were found.
  GoSub, 2GuiEscape
  }
else
  {
  StringTrimRight,found2,found,2
  Msgbox,, %Title%, The following strings were repeated: %found2%
  GoSub, 2GuiEscape
  }

Return

CreateGui2:
  Gui, 2:Add,Text,,Now checking for duplicate entries. Press esc to skip.
  Gui, 2:Add, Progress,vbar w300 h20 -smooth Range0-%count%,
  Gui, 2:Show, ,%Title%
  Return
 
2GuiClose:
  GoSub, ShowMain
  ;exitapp
 
2GuiEscape:
  Gui, 2:destroy
  ProgressFlag:=true
  GoSub, ShowMain
  Return

ButtonQuit:
GuiEscape:
GuiClose:
ExitApp


http://img229.imageshack.us/img229/8475/delduplines37gz.png
DONE: Delete double lines (all but the first) in a text file


« Last Edit: April 05, 2006, 01:34 AM by brotherS »

jgpaiva

  • Global Moderator
  • Joined in 2006
  • *****
  • Posts: 4,727
    • View Profile
    • Donate to Member
Re: DONE: Delete double lines (all but the first) in a text file
« Reply #19 on: April 05, 2006, 04:43 AM »
jgpaiva & TWmailrec:
i liked your script very much that i had modified it a bit for my own usage & added a gui...

but i not too clear on why both of your scripts differ... e.g. jgpaiva's script adds a linefeed to the end of every line while TWmailrec's deletes all empty lines. is this on purpose?
TW's script is the same i made before, but with a few modifications he introduced to suit him better. My latest script doesn't remove blank lines because TW asked me for it not to remove them ;)
But now, the script works well for you, right? :)

lanux128

  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 6,277
    • View Profile
    • Donate to Member
Re: DONE: Delete double lines (all but the first) in a text file
« Reply #20 on: April 05, 2006, 09:45 PM »
...My latest script doesn't remove blank lines because TW asked me for it not to remove them ;)
But now, the script works well for you, right? :)

yes, it works for me. :up: so i'm getting a bit ambitious but with my limited skill in AHK, i need your help...
i want to add a check-box that overwrites the output file (see screenshot) which i've managed but i can't implement in the code.

Gui, Add, Checkbox, x16 y115 CheckedGray vOverwrite_File, Overwrite output file?
...
;Refer check-box option, to overwrite existing output file?
If Overwrite_File
 IfExist,%filetowrite%
   {
    FileDelete,%filetowrite%
   }

now the above code doesn't overwrite the existing file, it only appends it. do you know why?
« Last Edit: April 05, 2006, 09:57 PM by lanux128 »

f0dder

  • Charter Honorary Member
  • Joined in 2005
  • ***
  • Posts: 9,153
  • [Well, THAT escalated quickly!]
    • View Profile
    • f0dder's place
    • Read more about this member.
    • Donate to Member
Re: DONE: Delete double lines (all but the first) in a text file
« Reply #21 on: April 06, 2006, 12:56 AM »
Install MinGW or alike : DevCPP does this for you excellently...
That's good, i'll give it a go next time I need to compile C. But by now, and for the next 4 months, I only see Lisp, Pov-Ray, VRML and Java ;)
Maybe next semester. But thanks by the pointer, it'll surelly be useful!!
Or better, install the Microsoft Visual C++ 2003 toolkit. It's a better compiler, and it's free (as in money, not as in source... but who cares, I bet the majority of you haven't made tweaks to gcc or binutils :)).
- carpe noctem

jgpaiva

  • Global Moderator
  • Joined in 2006
  • *****
  • Posts: 4,727
    • View Profile
    • Donate to Member
Re: DONE: Delete double lines (all but the first) in a text file
« Reply #22 on: April 06, 2006, 03:52 AM »
@lanux: Please try "If Overwrite_File = 1" instead of "If Overwrite_File". I guess that has to do with the fact that the checkbox can have 3 states.

Or better, install the Microsoft Visual C++ 2003 toolkit. It's a better compiler, and it's free (as in money, not as in source... but who cares, I bet the majority of you haven't made tweaks to gcc or binutils :)).
MSVCpp is free? I thought it was payed... It is a full bundle, with IDE included, right?
What's the difference between MSVCpp and DevCpp?

f0dder

  • Charter Honorary Member
  • Joined in 2005
  • ***
  • Posts: 9,153
  • [Well, THAT escalated quickly!]
    • View Profile
    • f0dder's place
    • Read more about this member.
    • Donate to Member
Re: DONE: Delete double lines (all but the first) in a text file
« Reply #23 on: April 06, 2006, 04:03 AM »
The free vc2003 toolkit is just the compiler+linker+libc - you need to get platformsdk (includes+libs for GUI development) too, but that's also free. And yes, it's free - even for commercial or non-windows development, and it's the full optimizing compiler.

DevCpp is a GUI + the GNU GCC compiler. vc2003 typically produces better code than GCC, and iirc it's even more C++ conformant than the versions of GCC that has been ported to win32. If you need a GUI for it, you can check out code::blocks... or the (free) express edition of vc2005 can probably be modified to be used for it.
- carpe noctem

DanD

  • Charter Member
  • Joined in 2006
  • ***
  • Posts: 8
    • View Profile
    • Donate to Member
Re: DONE: Delete double lines (all but the first) in a text file
« Reply #24 on: April 14, 2006, 11:59 AM »
I have Perl from http://www.activestate.com/ on my (Windows) PC.  From a command line prompt with Perl you can do something like

D:\Dan\Perl>perl -a -n -e "if (@F) { print unless $h{$_}; $h{$_} = 1 } else { print }" < dc1.txt
line one
line two

line three



line four

D:\Dan\Perl>type dc1.txt
line one
line one
line two

line three



line two
line one
line four
line three

D:\Dan\Perl>

(use output redirection
... > result.txt
to capture the result).

Dan