Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Import CSV files with column headers and correct column data types #4450

Open
cmbarton opened this issue Oct 4, 2024 · 9 comments
Open
Labels
enhancement New feature or request

Comments

@cmbarton
Copy link
Contributor

cmbarton commented Oct 4, 2024

CSV files are the most common and widely readable file format for tabular data. But GRASS has no easy way to import them correctly.

db.in.ogr now imports CSV files with correct column headers (see #943). But it imports all columns as strings, regardless of the data in the columns. Because tables imported this way do not have any numeric columns they cannot be used to generate points or to connect with an existing spatial dataset via a common cat field

v.in.ascii can import CSV files and create vector points. It does correctly recognize data types in CSV columns--including a CAT column. But for unknown reasons it does not import the column headers. Of course, you can create new column headers, but this becomes a PITA with many columns--especially as it requires an arcane format to do so.

A grass extension v.in.csv does some of this. However, it requires a separate Python package, pyproj, to do so. It also does not automatically recognize data types in columns like v.in.ascii does, although you can specify which columns are numeric (integer or real). It also seems to assume that the CSV table only has point coordinates in latlon.

So db.in.ogr needs to recognize column data types, or allow a user to specify them, and v.in.ascii needs to recognize column headers. Ideally, perhaps, would be a combination of these--something that could import a CSV table correctly and optionally create vector points from that table, optionally identify a CAT column, and optionally allow a user to define column data types to override the automatic data types recognition.

Related issue: #943

@cmbarton cmbarton added the enhancement New feature or request label Oct 4, 2024
@neteler
Copy link
Member

neteler commented Oct 4, 2024

Did you see https://grass.osgeo.org/grass-stable/manuals/db.in.ogr.html#import-csv-file (using .cvst file for column type recognition)?

@cmbarton
Copy link
Contributor Author

cmbarton commented Oct 4, 2024

Yes. I've done that. It is equally a pain, especially with multi-column tables and sometimes does not work correctly. Most other data analytical programs manage to open and read these ASCII text files pretty well. v.in.ascii comes very close and is only missing recognizing column names for creating points. If it had the ability to do this and the option to import the table without creating points, it would cover >95% of the cases.

@rouault
Copy link
Member

rouault commented Oct 5, 2024

The OGR CSV driver has a AUTODETECT_TYPE=YES open option

$ printf "id,float_val,my_str\n1,2.5,foo\n" > test.csv
$ ogrinfo test.csv -oo AUTODETECT_TYPE=YES  -al -q

Layer name: test
OGRFeature(test):1
  id (Integer) = 1
  float_val (Real) = 2.5
  my_str (String) = foo

@cmbarton
Copy link
Contributor Author

cmbarton commented Oct 5, 2024

Interesting. Could this be a flag in db.in.ogr? To test, is this something that can be entered in the ogr_doo argument?

@rouault
Copy link
Member

rouault commented Oct 5, 2024

To test, is this something that can be entered in the ogr_doo argument?

don't know the GRASS side of things, but from https://grass.osgeo.org/grass84/manuals/v.in.ogr.html, I assume this should be gdal_doo=AUTODETECT_TYPE=YES

@cmbarton
Copy link
Contributor Author

cmbarton commented Oct 6, 2024

Yes. This works!. It should be the default in db.in.ogr, perhaps with a flag to disable it and use a *.cvst instead.

Too bad that we can't do this with v.in.ogr, since it won't accept cvs files with defined x and y coord. columns.

But if v.in.ascii can just keep the header field, then that would be a good alternative too.

@ecodiv
Copy link
Contributor

ecodiv commented Oct 6, 2024

@cmbarton I fully agree with your points. One thing is that the column names are stored in the history of the vector layer.
I remember I used that in a Python script to rename the names of the attribute table back to the original names. It was something similar to the code below. An admittedly quick and dirty solution :-/

# Get the history, filter out the column names, split them into a list
a = list(gs.parse_command("v.info", map="AAA", flags="h").keys())[-1].split("|")

# Get a list with the column names of the vector layer imported using v.in.ascii
# and remove 'cat' from the list.
b = list(gs.parse_command("db.columns", table="AAA").keys())[1:]

# Replace the column names with the original names.
for id, name in enumerate(b):
    gs.parse_command("v.db.renamecolumn", map="AAA", column=f"{name},{a[id]}")

@cmbarton
Copy link
Contributor Author

cmbarton commented Oct 9, 2024

Thanks for the workaround .

@cmbarton
Copy link
Contributor Author

Splitting this issue into 2 separate feature requests #4593 and #4594 because it deals with 2 different GRASS commands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants