Use this tag for questions related to the distributed computing framework published by Apache.
Questions tagged [apache-spark]
19 questions
2
votes
1 answer
Can we use Cassandra in place of Hadoop with Spark?
Considering we have a backend written in NodeJS and uses MySQL and Cassandra as it's databases, if we want to add Spark to the system to do some data analyzing stuff like recommendation, can we do it with Cassandra( I mean using Spark + Cassandra)…
user20551429
- 69
- 2
2
votes
1 answer
Best practices for large JOINs - Warehouse or External Compute (e.g. Spark)
I am working on a problem that requires a very large join. The JOIN itself is pretty straightforward but the amount of data I am processing is very large. I am wondering for very large JOINs, is there a preferred type of technology. For example, is…
Arthur Putnam
- 553
- 2
- 6
- 12
2
votes
2 answers
Check the time that PostgreSQL is taking to automatically create existing indexes when you do bulk insert using copy command from Spark
I am doing a bulk insert from the spark to the postgres table. Amount of data that I am ingesting is huge. The number of records is around 120-130 million
I am first saving the records as multiple csv files on distributed storage location i.e. S3…
Nikunj Kakadiya
- 123
- 6
1
vote
0 answers
Writing large dataset from spark dataframe
We have a azure databricks job that retrieves some large dataset with pyspark. The dataframe has about 11 billion rows. We are currently writing this out to a postgresql DB (also in azure). Currently we are using the jdbc connector to write row out…
Kyle Chamberlin
- 13
- 2
1
vote
0 answers
Why does MySQL select offset not work in pyspark?
select distinct day
from t1
order by 1 desc limit 1,1
should return the second latest date in t1, but this error is thrown:
ParseException: mismatched input ',' expecting {, ';'}
uhmosdhsjxpbcrstis
- 11
- 1
1
vote
1 answer
Optimal join for joining facts with scd-type-2 dimension for aggregation/reporting
I have a fact table and an scd-type-2 dimension table. I want to produce sales report by region and year.
I have working solution with a query that joins them for reporting purposes. When I run the query in spark/databricks, it gives me a little…
Kashyap
- 145
- 6
1
vote
2 answers
When will the Spark Cassandra connector with support for Spark 3.3 be released?
Master branch has a merged pull request supporting Spark 3.3 how long before this gets built and published on maven repository?
Rubber Duck
- 111
- 3
1
vote
2 answers
Cassandra-4-Update: Multiple Schema Versions
after upgrading our first node, it has a different schema version (according to node tool describe cluster). This caused Spark Jobs to hang, because of reoccurring "schema agreement not reached" by metadata.SchemaAgreementChecker.
Is this different…
Sven
- 11
- 2
1
vote
0 answers
MySQL is not showing the history queries after the ETL is complete
I am using docker container for MySQL and Spark. Both containers are on AWS Ec2 instance.
Pyspark ETL connects to MySQL with JDBC and start extracting the data. When the ETL is running i can see the long running threads/queries in MySQL with command…
nomad123
- 19
- 1
- 3
1
vote
0 answers
Pyspark error inserting into mysql database table that exists
I am trying to insert into an existing mysql table using Pyspark JDBC connection however I get the following error(picture attached)
Can I get assistance on this error. The table exist in the MySql Database, I was successful in Inserting with a…
IIShriyaII
- 11
- 3
1
vote
0 answers
HIVE + understanding the hive-metastore logs
we have HDP cluster version - 2.6.4 , and we are runs spark streaming app
and we are uses presto cluster in order to run Hive queries
when we look on the hivemetastore logs ( under /var/log/hive ) , we can see the following warnings , that repeat…
King David
- 111
- 1
- 4
0
votes
0 answers
Py4JJavaError while creating SparkSession
I encountered this error while trying to create spark session in a python virtual environment using VS Code as IDE. The code I ran and the output is below, please help.
Code
spark = SparkSession.builder\
.master("local[1]")\
…
tomiealff
- 1
- 2
0
votes
1 answer
spark-cassandra-connector read throughput unpredictable
A user reports that the range query throughput is far higher than expected when setting spark.cassandra.input.readsPerSec in the spark-cassandra-connector.
Job dependencies. The Java driver version is set to 4.13.0.
…
Paul
- 416
- 2
- 6
0
votes
0 answers
Is there an open source implementation of QGM (Query Graph Model)?
I am building a new system that needs to interact essentially as an SQL backend. We would like to import logical queries into it (e.g. from ApacheSPARQ or Postgres and related things) and want to develop an internal representation (IR) for them. …
intel_chris
- 141
- 3
0
votes
3 answers
Data storage for analytics
I have to store some amount of data for analytical purposes.
The data source produces 2TB data per month.
Data is collected on a monthly basis (not real-time).
Data is fully structured.
There are 100+ different columns of data.
Availability of SQL…
Leeloo
- 111
- 5