how Loading Large File in Python
I’m using Python 2.6.2 [GCC 4.3.3] running on Ubuntu 9.04. I need to read a big datafile (~1GB, >3 million lines) , line by line using a Python script.
I tried the methods below, I find it uses a very large space of memory (~3GB)
Is there a better way to load a large file line by line, say
Several suggestions gave the methods I mentioned above and already tried, I’m trying to see if there is a better way to handle this. My search has not been fruitful so far. I appreciate your help.
p/s I have done some memory profiling using
Update 20 August 2012, 16:41 (GMT+1)
Tried both approach as suggested by J.F. Sebastian, mgilson and IamChuckB, (datafile is a variable)
Strangely both of them uses ~3GB of memory, my datafile size in this test is 765.2MB consisting of 21,181,079 lines. I see the memory get incremented along the time (around 40-80MB steps) before stabilizing at 3GB.
An elementary doubt,
I did memory profiling using Heapy to understand this better.
Level 1 Profiling
Level 2 Profiling for Level 1-Index 0
Level 2 Profiling for Level 1-Index 1
Level 2 Profiling for Level 1-Index 2
Level 2 Profiling for Level 1-Index 3
Level 3 Profiling for Level 2-Index 0, Level 1-Index 0
Level 3 Profiling for Level 2-Index 0, Level 1-Index 1
Level 3 Profiling for Level 2-Index 0, Level 1-Index 2
Level 3 Profiling for Level 2-Index 0, Level 1-Index 3
Still troubleshooting this.
Do share with me if you have faced this before.
Thanks for your help.
Update 21 August 2012, 01:55 (GMT+1)
Unfortunately, the memory usage is still at 3GB and the output (snippet) is as below,
I did the same memory profiling as before,
Level 1 Profiling
Comparing the previous memory profiling output with the above, str has reduced 45 objects (17376 bytes), tuple has reduced 25 objects (3440 bytes) and dict(no owner) though no object change, it has reduced 1536 bytes of the memory size. All other objects are the same including dict of main.NodeStatistics. The total number of objects are 35474. The small reduction in object (0.2%) produced 99.3% of memory saving (22MB from 3GB). Very strange.
If you realize, though I know the place the memory starvation is occurring, I am yet able to narrow down which one causing the bleed.
Will continue to investigate this.
Thanks to all the pointers, using this opportunity to learn much on python as I ain’t an expert. Appreciate your time taken to assist me.
Update 23 August 2012, 00:01 (GMT+1) — SOLVED
I am using 3 classes with 136 counters.
I am happy with the result and glad to close this issue.
Thanks for all your guidance. Truly appreciate it.
This works because files are iterators yielding 1 line at a time until there are no more lines to yield.